End of the School Year Sanity! Activities & Resources for Teachers
tdt4260
Transcript of tdt4260
1
TDT 4260 – lecture 1 – 2011• Course introduction
– course goals– staff– contents– evaluation– web, ITSL
1 Lasse Natvig
• Textbook– Computer Architecture, A
Quantitative Approach, Fourth Edition
• by John Hennessy & David Patterson(HP90 - 96 – 03) - 06
• Today: Introduction (Chapter 1)– Partly covered
Course goal• To get a general and deep understanding of the
organization of modern computers and the motivation for different computer architectures. Give a base for understanding of research themes within the field.
• High level
2 Lasse Natvig
• High level• Mostly HW and low-level SW• HW/SW interplay• Parallelism• Principles, not details
inspire to learn more
Contents• Computer architecture fundamentals, trends, measuring
performance, quantitative principles. Instruction set architectures and the role of compilers. Instruction-level parallelism, thread-level parallelism, VLIW.
• Memory hierarchy design, cache. Multiprocessors, shared memory architectures, vector processors, NTNU/Notur supercomputers distributed shared memory
3 Lasse Natvig
supercomputers, distributed shared memory, synchronization, multithreading.
• Interconnection networks, topologies• Multicores,homogeneous and heterogeneous, principles and
product examples• Green computing (introduction)• Miniproject - prefetching
TDT-4260 / DT8803• Recommended background
– Course TDT4160 Computer Fundamentals, or equivalent.
• http://www.idi.ntnu.no/emner/tdt4260/– And Its Learning
• Friday 1215-1400– And/or some Thursdays 1015-1200
4 Lasse Natvig
– 12 lectures planned
– some exceptions may occur
• Evaluation– Obligatory exercise (counts 20%). Written
exam counts 80%. Final grade (A to F) given at end of semester. If there is a re-sit examination, the examination form may change from written to oral.
Lecture planDate and lecturer Topic
1: 14 Jan (LN, AI) Introduction, Chapter 1 / Alex: PfJudge
2: 21 Jan (IB) Pipelining, Appendix A; ILP, Chapter 2
3: 28 Jan (IB) ILP, Chapter 2; TLP, Chapter 3
4: 4 Feb (LN) Multiprocessors, Chapter 4
5: 11 Feb MG(?)) Prefetching + Energy Micro guest lecture
Subject to change
5 Lasse Natvig
6: 18 Feb (LN) Multiprocessors continued
7: 25 Feb (IB) Piranha CMP + Interconnection networks
8: 4 Mar (IB) Memory and cache, cache coherence (Chap. 5)
9: 11 Mar (LN) Multicore architectures (Wiley book chapter) + Hill Marty Amdahl
multicore ... Fedorova ... assymetric multicore ...
10: 18 Mar (IB) Memory consistency (4.6) + more on memory
11: 25 Mar (JA, AI) (1) Kongull and other NTNU and NOTUR supercomputers (2) Green
computing
12: 1 Apr (IB/LN) Wrap up lecture, remaining stuff
13: 8 Apr Slack – no lecture planned
EMECS, new European Master's Course in Embedded Computing Systems
6 Lasse Natvig
2
Preliminary reading list, subject to change!!!• Chap.1: Fundamentals, sections 1.1 - 1.12 (pages 2-54)• Chap.2: ILP, sections 2.1 - 2.2 and parts of 2.3 (pages 66-81), section 2.7
(pages 114-118), parts of section 2.9 (pages 121-127, stop at speculation), section 2.11 - 2.12 (pages 138 - 141). (Sections 2.4 - 2.6 are covered by similar material in our computer design course)
• Chap.3: Limits on ILP, section 3.1 and parts of section 3.2 (pages 154 -159), section 3.5 - 3.8 (pages 172-185).
• Chap.4: Multiprocessors and TLP, sections 4.1 - 4.5, 4.8 - 4.10 • Chap.5: Memory hierachy, section 5.1 - 5.3 (pages 288 - 315).
App A: section A 1 (Expected to be repetition from other courses)
7 Lasse Natvig
• App A: section A.1 (Expected to be repetition from other courses)• Appendix E, interconnection networks, pages E2-E14, E20-E25, E29-E37
and E45-E51.• App. F: Vector processors, sections F1 - F4 and F8 (pages F-2 - F-32, F-
44 - F-45)• Data prefetch mechanisms (ACM Computing Survey)• Piranha, (To be announced)• Multicores (New bookchapter) (To be announced)• (App. D; embedded systems?) see our new course TDT4258
Mikrokontroller systemdesign
People involvedLasse Natvig
Course responsible, [email protected]
Ian Bratt
Lecturer (Al t Til )
8 Lasse Natvig
Lecturer (Also at Tilera.com)[email protected]
Alexandru Iordan
Teaching assistant (Also PhD-student)[email protected]
http://www.idi.ntnu.no/people/
research.idi.ntnu.no/multicore
9 Lasse Natvig
Some few highlights:- Green computing, 2xPhD + master students- Multicore memory systems, 3 x PhD theses- Multicore programming and parallel computing- Cooperation with industry
Prefetching ---pfjudge
10 Lasse Natvig
”Computational computer architecture” • Computational science and engineering (CSE)
– Computational X, X = comp.arch.• Simulates new multicore architectures
– Last level, shared cache fairness (PhD-student M. Jahre)– Bandwidth aware prefetching (PhD-student M. Grannæs)
• Complex cycle-accurate simulators– 80 000 lines C++ 20 000 lines python
11 Lasse Natvig
– 80 000 lines C++, 20 000 lines python– Open source, Linux-based
• Design space exploration (DSE)– one dimension for each arch. parameter– DSE sample point = specific multicore configuration– performance of a selected set of configurations evaluated by
simulating the execution of a set of workloads
Experiment Infrastructure• Stallo compute cluster
– 60 Teraflop/s peak
– 5632 processing cores
– 12 TB total memory
– 128 TB centralized disk
– Weighs 16 tons
12 Lasse Natvig
• Multi-core research– About 60 CPU years allocated per
year to our projects
– Typical research paper uses 5 to 12 CPU years for simulation (extensive, detailed design space exploration)
3
The End of Moore’s lawfor single-core microprocessors
13 Lasse Natvig
But Moore’s law still holds for FPGA, memory and multicore processors
Motivational background• Why multicores
– in all market segments from mobile phones to supercomputers
• The ”end” of Moores law
• The power wall
• The memory wall
14 Lasse Natvig
The memory wall
• The bandwith problem
• ILP limitations
• The complexity wall
Energy & Heat Problems• Large power
consumption– Costly
– Heat problems
– Restricted battery operation time
15 Lasse Natvig
operation time
• Google ”Open House Trondheim 2006”– ”Performance/Watt
is the only flat trend line”
The Memory Wall
CPU60%/year
DRAM9%/1
10
100
1000
P-M gap grows 50% / year
Per
form
ance
“Moore’s Law”
16 Lasse Natvig
• The Processor Memory Gap
• Consequence: deeper memory hierachies– P – Registers – L1 cache – L2 cache – L3 cache – Memory - - -
– Complicates understanding of performance• cache usage has an increasing influence on performance
9%/year1
1980 1990 2000
The I/O pin or Bandwidth problem
• # I/O signaling pins– limited by physical
tecnology
– speeds have not increased at the same rate as processor clock rates
17 Lasse Natvig
• Projections– from ITRS (International
Technology Roadmap for Semiconductors)
[Huh, Burger and Keckler 2001]
The limitations of ILP (Instruction Level Parallelism) in Applications
20
25
30
2
2.5
3
cycl
es (
%)
dup
18 Lasse Natvig
0 1 2 3 4 5 6+0
5
10
15
0 5 10 150
0.5
1
1.5
Fra
ctio
n of
tota
l
Number of instructions issued
Spe
ed
Instructions issued per cycle
4
Reduced Increase in Clock Frequency
19 Lasse Natvig
Solution: Multicore architectures (also called Chip Multi-processors - CMP)
• More power-efficient– Two cores with clock frequency f/2
can potentially achieve the same speed as one at frequency f with 50% reduction in total energy consumption[Olukotun & Hammond 2005]
20 Lasse Natvig
• Exploits Thread Level Parallelism (TLP)– in addition to ILP
– requires multiprogramming orparallel programming
• Opens new possibilities for architectural innovations
Why heterogeneous multicores?• Specialized HW is
faster than general HW– Math co-processor
– GPU, DSP, etc…
• Benefits of
Cell BE processor
21 Lasse Natvig
customization– Similar to ASIC vs. general
purpose programmable HW
• Amdahl’s law– Parallel speedup limited by
serial fraction• 1 super-core
CPU – GPU – convergence(Performance – Programmability)
Processors: Larrabee, Fermi, …Languages: CUDA, OpenCL, …
22 Lasse Natvig
Parallel processing – conflicting goals
PowerefficiencyProgrammability
Portability
PerformanceThe P6-model: Parallel Processing challenges: Performance, Portability, Programmability and Power efficiency
23 Lasse Natvig
PowerefficiencyProgrammability
• Examples;
– Performance tuning may reduce portability• Eg. Datastructures adapted to cache block size
– New languages for higher programmability may reduce performance and increase power consumption
Multicore programming challenges• Instability, diversity, conflicting goals … what to do?• What kind of parallel programming?
– Homogeneous vs. heterogeneous– DSL vs. general languages– Memory locality
• What to teach?– Teaching should be founded on
active research
• Two layers of programmers
24 Lasse Natvig
y p g– The Landscape of Parallel Computing Research: A View from
Berkeley [Asan+06]• Krste Asanovic presentation at ACACES Summerschool 2007
– 1) Programmability layer (Productivity layer) (80 - 90%)• ”Joe the programmer”
– 2) Performance layer (Efficiency layer) (10 - 20%)• Both layers involved in HPC• Programmability an issue also at the performance-layer
5
Personal Health
Image Retrieval
Hearing, Music
SpeechParallel Browser
Design Patterns/Motifs
Parallel Computing Laboratory, U.C. Berkeley,(Slide adapted from Dave Patterson )
Easy to write correct programs that run efficiently on manycore
Composition & Coordination Language (C&CL)
P ll l
C&CL Compiler/Interpreter
orm
ance
25 Lasse Natvig25
Sketching
Legacy Code
SchedulersCommunication & Synch.
Primitives
Efficiency Language Compilers
Legacy OS
Multicore/GPGPU
OS Libraries & Services
RAMP Manycore
Hypervisor
Parallel Libraries
Parallel Frameworks
Autotuners
Efficiency Languages
Dia
gn
osi
ng
Po
wer
/Per
fo
Classes of computers• Servers
– storage servers– compute servers (supercomputers) – web servers– high availability– scalability– throughput oriented (response time of less importance)
• Desktop (price 3000 NOK – 50 000 NOK)– the largest market
26 Lasse Natvig
g– price/performance focus– latency oriented (response time)
• Embedded systems– the fastest growing market (”everywhere”)– TDT 4258 Microcontroller system design– ATMEL, Nordic Semic., ARM, EM, ++
Falanx (Mali) ARM Norway
27 Lasse Natvig
Borgar FXI Technologies”An idependent compute platform to gather the fragmented mobile space and thus help accelerate the prolifitation of content and applications eco- systems (I.e build an ARM based SoC, put it
28 Lasse Natvig
• http://www.fxitech.com/– ”Headquartered in Trondheim
• But also an office in Silicon Valley …”
, pin a memory card, connect it to the web- and voila, you got iPhone for the masses ).”
Trends • For technology, costs, use
• Help predicting the future
• Product development time – 2-3 years
– design for the next technology
29 Lasse Natvig
– Why should an architecture live longer than a product?
Comp. Arch. is an Integrated Approach
• What really matters is the functioning of the complete system – hardware, runtime system, compiler, operating system, and
application
– In networking, this is called the “End to End argument”
30 Lasse Natvig
• Computer architecture is not just about transistors(not at all), individual instructions, or particular implementations– E.g., Original RISC projects replaced complex instructions with a
compiler + simple instructions
6
Computer Architecture is Design and Analysis
Design
Analysis
Architecture is an iterative process:• Searching the space of possible designs• At all levels of computer systems
C ti it
31 Lasse Natvig
Creativity
Good IdeasGood IdeasMediocre IdeasBad Ideas
Cost /PerformanceAnalysis
TDT4260 Course FocusUnderstanding the design techniques, machine
structures, technology factors, evaluation methods that will determine the form of computers in 21st Century
Technology ProgrammingLanguages
Parallelism
32 Lasse Natvig
Languages
OperatingSystems History
Applications Interface Design(ISA)
Measurement & Evaluation
Computer Architecture:• Organization• Hardware/Software Boundary
Compilers
Holistic approache.g., to programmability
33 Lasse Natvig
Multicore, interconnect, memory
Operating System & system software
Parallel & concurrent programming
Moore’s Law: 2X transistors / “year”
34 Lasse Natvig
• “Cramming More Components onto Integrated Circuits”– Gordon Moore, Electronics, 1965
• # of transistors / cost-effective integrated circuit double every N months (12 ≤ N ≤ 24)
Tracking Technology Performance Trends• 4 critical implementation technologies:
– Disks, – Memory, – Network, – Processors
• Compare for Bandwidth vs. Latency
35 Lasse Natvig
improvements in performance over time• Bandwidth: number of events per unit time
– E.g., M bits/second over network, M bytes / second from disk
• Latency: elapsed time for a single event– E.g., one-way network delay in microseconds,
average disk access time in milliseconds
Latency Lags Bandwidth (last ~20 years)
100
1000
10000
Relative BW
Processor
Memory
Network
Disk
• Performance Milestones
• Processor: ‘286, ‘386, ‘486, Pentium, Pentium Pro, Pentium 4 (21x,2250x)
• Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x)
• Memory Module: 16bit plain DRAM, P M d DRAM 32b 64b SDRAM
CPU high, Memory low(“Memory Wall”)
36 Lasse Natvig
1
10
100
1 10 100
Relative Latency Improvement
Improvement
(Latency improvement = Bandwidth improvement)
Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x)
• Disk : 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)
(Processor latency = typical # of pipeline-stages * time pr. clock-cycle)
7
COST and COTS• Cost
– to produce one unit
– include (development cost / # sold units)
– benefit of large volume
• COTSdit ff th h lf
37 Lasse Natvig
– commodity off the shelf
Speedup• General definition:
Speedup (p processors) =
• For a fixed problem size (input data set), performance = 1/time
Performance (p processors)
Performance (1 processor)
Superlinear speedup ?
38 Lasse Natvig
performance 1/time– Speedup
fixed problem (p processors) =
• Note: use best sequential algorithm in the uni-processor
solution, not the parallel algorithm with p = 1
Time (1 processor)
Time (p processors)
Amdahl’s Law (1967) (fixed problem size)• “If a fraction s of a
(uniprocessor) computation is inherently serial, the speedup is at most 1/s”
• Total work in computation– serial fraction s– parallel fraction p
39 Lasse Natvig
p p– s + p = 1 (100%)
• S(n) = Time(1) / Time(n)
= (s + p) / [s +(p/n)]
= 1 / [s + (1-s) / n]
= n / [1 + (n - 1)s]• ”pessimistic and famous”
Gustafson’s “law” (1987)(scaled problem size, fixed execution time)
• Total execution time on parallel computer with nprocessors is fixed– serial fraction s’– parallel fraction p’– s’ + p’ = 1 (100%)
• S’(n) = Time’(1)/Time’(n)
40 Lasse Natvig
• S (n) = Time (1)/Time (n) = (s’ + p’n)/(s’ + p’)= s’ + p’n = s’ + (1-s’)n= n +(1-n)s’
• Reevaluating Amdahl's law, John L. Gustafson, CACM May 1988, pp 532-533. ”Not a new law, but Amdahl’s law with changed assumptions”
How the serial fraction limits speedup
• Amdahl’s law
• Work hard to
41 Lasse Natvig
reduce the serial part of the application– remember IO
– think different(than traditionally or sequentially)
= serial fraction
1
TDT4260 Computer architectureMini-project
PhD candidate Alexandru Ciprian IordanInstitutt for datateknikk og informasjonsvitenskap
2
What is it…? How much…?
• The mini-project is the exercise part of TDT4260 course
• This year the students will need to develop and evaluate a PREFETCHER
• The mini-project accounts for 20 % of the final grade in TDT4260
• 80 % for report• 20 % for oral presentation
3
What will you work with…
• Modified version of M5 (for development and evaluation)
• Computing time on Kongull cluster (for benchmarking)
• More at: http://dm-ark.idi.ntnu.no/
4
M5
• Initially developed by the University of Michigan
• Enjoys a large community of users and developers
• Flexible object-oriented architecture
• Has support for 3 ISA: ALPHA, SPARC and MIPS
5
Team work…
• You need to work in groups of 2-4 students
• Grade is based on written paper AND oral presentation (chose you best speaker)
6
Time Schedule and Deadlines
More on It’s learning
7
Web page presentation
TDT 4260App A.1, Chap 2
Instruction Level Parallelism
Contents
• Instruction level parallelism Chap 2
• Pipelining (repetition) App A
▫ Basic 5-step pipeline
• Dependencies and hazards Chap 2.1
▫ Data, name, control, structural
• Compiler techniques for ILP Chap 2.2
• (Static prediction Chap 2.3)
▫ Read this on your own
• Project introduction
Instruction level parallelism (ILP)
• A program is sequence of instructions typically written to be executed one after the other
• Poor usage of CPU resources! (Why?)
• Better: Execute instructions in parallel
▫ 1: PipelinePartial overlap of instruction execution
▫ 2: Multiple issueTotal overlap of instruction execution
• Today: Pipelining
Pipelining
(1/3)
Pipelining (2/3)
• Multiple different stages executed in parallel
▫ Laundry in 4 different stages
▫ Wash / Dry / Fold / Store
• Assumptions:
▫ Task can be split into stages
▫ Storage of temporary data
▫ Stages synchronized
▫ Next operation known before last finished?
Pipelining (3/3)
• Good Utilization: All stages are ALWAYS in use
▫ Washing, drying, folding, ...
▫ Great usage of resources!
• Common technique, used everywhere
▫ Manufacturing, CPUs, etc
• Ideal: time_stage = time_instruction / stages
▫ But stages are not perfectly balanced
▫ But transfer between stages takes time
▫ But pipeline may have to be emptied
▫ ...
Example: MIPS64 (1/2)
• RISC
• Load/store
• Few instruction formats
• Fixed instruction length
• 64-bit▫ DADD = 64 bits ADD
▫ LD = 64 bits L(oad)
• 32 registers (R0 = 0)
• EA = offset(Register)
• Pipeline▫ IF: Instruction fetch
▫ ID: Instruction decode / register fetch
▫ EX: Execute / effective address (EA)
▫ MEM: Memory access
▫ WB: Write back (reg)
Example: MIPS64 (2/2)
Instr.
Order
Time (clock cycles)
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
Big Picture:
• What are some real world examples of pipelining?
• Why do we pipeline?• Does pipelining increase or decrease instruction
throughput?• Does pipelining increase or decrease instruction
latency?
Big Picture (continued):
• Computer Architecture is the study of design tradeoffs!!!!
• There is no “philosophy of architecture” and no “perfect architecture”. This is engineering, not science.
• What are the costs of pipelining?• For what types of devices is pipelining not a
good choice?
Improve speedup?
• Why not perfect speedup?▫ Sequential programs▫ One instruction dependent on another▫ Not enough CPU resources
• What can be done?▫ Forwarding (HW)▫ Scheduling (SW / HW)▫ Prediction (SW / HW)
• Both hardware (dynamic) and compiler (static) can help
Dependencies and hazards
• Dependencies▫ Parallel instructions can be executed in parallel▫ Dependent instructions are not parallel
� I1: DADD R1, R2, R3� I2: DSUB R4, R1, R5
▫ Property of the instructions• Hazards
▫ Situation where a dependency causes an instruction to give a wrong result
▫ Property of the pipeline▫ Not all dependencies give hazards
� Dependencies must be close enough in the instruction stream to cause a hazard
Dependencies
• (True) data dependencies
▫ One instruction reads what an earlier has written
• Name dependencies
▫ Two instructions use the same register / mem loc
▫ But no flow of data between them
▫ Two types: Anti and output dependencies
• Control dependencies
▫ Instructions dependent on the result of a branch
• Again: Independent of pipeline implementation
Hazards
• Data hazards
▫ Overlap will give different result from sequential
▫ RAW / WAW / WAR
• Control hazards
▫ Branches
▫ Ex: Started executing the wrong instruction
• Structural hazards
▫ Pipeline does not support this combination of instr.
▫ Ex: Register with one port, two stages want to read
Instr.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Data dependency � Hazard?Figure A.6, Page A-16
• Read After Write (RAW)InstrJ tries to read operand before InstrI writes it
• Caused by a true data dependency
• This hazard results from an actual need for communication.
Data Hazards (1/3)
I: add r1,r2,r3J: sub r4,r1,r3
• Write After Read (WAR)InstrJ writes operand before InstrI reads it
• Caused by an anti dependencyThis results from reuse of the name “r1”
• Can’t happen in MIPS 5 stage pipeline because:
▫ All instructions take 5 stages, and
▫ Reads are always in stage 2, and
▫ Writes are always in stage 5
I: sub r4,r1,r3 J: add r1,r2,r3
Data Hazards (2/3) Data Hazards (3/3)• Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
• Caused by an output dependency
• Can’t happen in MIPS 5 stage pipeline because: ▫ All instructions take 5 stages, and ▫ Writes are always in stage 5
• WAR and WAW can occur in more complicated pipes
I: sub r1,r4,r3 J: add r1,r2,r3
Instr.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
RegALU
DMemIfetch Reg
RegALU
DMemIfetch Reg
ForwardingFigure A.7, Page A-18
IF ID/RF EX MEM WB
Instr.
Order
Ld r1,r2
add r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Can all data hazards be solved via
forwarding???IF ID/RF EX MEM WB
Structural Hazards (Memory Port)Figure A.4, Page A-14
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
Reg
ALU
DMemIfetch Reg
Hazards, Bubbles (Similar to Figure A.5, Page A-15)
Instr.
Order
Time (clock cycles)
Load
Instr 1
Ld r1, r2
Stall
Add r1, r1, r1
RegALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
Reg
ALU
DMemIfetch Reg
Bubble Bubble Bubble BubbleBubble
How do you “bubble” the pipe? How can we avoid this hazard?
Control hazards (1/2)
• Sequential execution is predictable,(conditional) branches are not
• May have fetched instructions that should not be executed
• Simple solution (figure): Stall the pipeline (bubble)▫ Performance loss depends on number of branches in the program
and pipeline implementation
▫ Branch penaltyC
Possibly wrong instruction Correct instruction
Control hazards (2/2)• What can be done?
▫ Always stop (previous slide)
� Also called freeze or flushing of the pipeline
▫ Assume no branch (=assume sequential)
� Must not change state before branch instr. is complete
▫ Assume branch
� Only smart if the target address is ready early
▫ Delayed branch
� Execute a different instruction while branch is evaluated
� Static techniques (fixed rule or compiler)
Example
• Assume branch conditionals are evaluated in the EX stage, and determine the fetch address for the following cycle.
• If we always stall, how many cycles are bubbled? • Assume branch not taken, how many bubbles for an
incorrect assumption? • Is stalling on every branch ok? • What optimizations could be done to improve stall
penalty?
Dynamic scheduling
• So far: Static scheduling
▫ Instructions executed in program order
▫ Any reordering is done by the compiler
• Dynamic scheduling
▫ CPU reorders to get a more optimal order
� Fewer hazards, fewer stalls, ...
▫ Must preserve order of operations where reordering could change the result
▫ Covered by TDT 4255 Hardware design
Compiler techniques for ILP
• For a given pipeline and superscalarity▫ How can these be best utilized?
▫ As few stalls from hazards as possible
• Dynamic scheduling▫ Tomasulo’s algorithm etc. (TDT4255)
▫ Makes the CPU much more complicated
• What can be done by the compiler?▫ Has ”ages” to spend, but less knowledge
▫ Static scheduling, but what else?
Example
Source code:
for (i = 1000; i >0; i=i-1)x[i] = x[i] + s;
Notice:
• Lots of dependencies
• No dependencies between iterations
• High loop overhead
� Loop unrolling
MIPS:
Loop: L.D F0,0(R1) ; F0 = x[i]
ADD.D F4,F0,F2 ; F2 = s
S.D F4,0(R1) ; Store x[i] + s
DADDUI R1,R1,#-8 ; x[i] is 8 bytes
BNE R1,R2,Loop ; R1 = R2?
Static schedulingLoop: L.D F0,0(R1)
stopp
ADD.D F4,F0,F2
stopp
stopp
S.D F4,0(R1)
DADDUI R1,R1,#-8
stopp
BNE R1,R2,Loop
Loop: L.D F0,0(R1)
DADDUI R1,R1,#-8
ADD.D F4,F0,F2
stopp
stopp
S.D F4,8(R1)BNE R1,R2,Loop
Result: From 9 cycles per iteration to 7(Delays from table in figure 2.2)
Loop unrolling
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
DADDUI R1,R1,#-8
BNE R1,R2,Loop
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
L.D F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1)
L.D F10,-16(R1)
ADD.D F12,F10,F2
S.D F12,-16(R1)
L.D F14,-24(R1)
ADD.D F16,F14,F2
S.D F16,-24(R1)
DADDUI R1,R1,#-32
BNE R1,R2,Loop
• Reduced loop overhead
• Requires number of iterations divisible by n (here n=4)
• Register renaming
• Offsets have changed
• Stalls not shown
Loop: L.D F0,0(R1)
L.D F6,-8(R1)
L.D F10,-16(R1)
L.D F14,-24(R1)
ADD.D F4,F0,F2
ADD.D F8,F6,F2
ADD.D F12,F10,F2
ADD.D F16,F14,F2
S.D F4,0(R1)
S.D F8,-8(R1)
DADDUI R1,R1,#-32
S.D F12,-16(R1)
S.D F16,-24(R1)BNE R1,R2,Loop
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
L.D F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1)
L.D F10,-16(R1)
ADD.D F12,F10,F2
S.D F12,-16(R1)
L.D F14,-24(R1)
ADD.D F16,F14,F2
S.D F16,-24(R1)
DADDUI R1,R1,#-32
BNE R1,R2,Loop
Avoids stall after: L.D(1), ADD.D(2), DADDUI(1)
Loop unrolling: Summary
• Original code 9 cycles per element
• Scheduling 7 cycles per element
• Loop unrolling 6,75 cycles per element
▫ Unrolled 4 iterations
• Combination 3,5 cycles per element
▫ Avoids stalls entirely
Compiler reduced execution time by 61%
Loop unrolling in practice
• Do not usually know upper bound of loop• Suppose it is n, and we would like to unroll the loop
to make k copies of the body• Instead of a single unrolled loop, we generate a pair
of consecutive loops:▫ 1st executes (n mod k) times and has a body that is the
original loop▫ 2nd is the unrolled body surrounded by an outer loop
that iterates (n/k) times
• For large values of n, most of the execution time will be spent in the unrolled loop
TDT 4260Chap 2, Chap 3
Instruction Level Parallelism (cont)
Review
• Name real-world examples of pipelining
• Does pipelining lower instruction latency?
• What is the advantage of pipelining?
• What are some disadvantages of pipelining?
• What can a compiler do to avoid processor stalls?
• What are the three types of data dependences?
• What are the three types of pipeline hazards?
Contents
• Very Large Instruction Word Chap 2.7
▫ IA-64 and EPIC
• Instruction fetching Chap 2.9
• Limits to ILP Chap 3.1/2
• Multi-threading Chap 3.5
• CPI ≥ 1 if issue only 1 instruction every clock cycle
• Multiple-issue processors come in 3 flavors:
1. Statically-scheduled superscalar processors• In-order execution
• Varying number of instructions issued (compiler)
2. Dynamically-scheduled superscalar processors • Out-of-order execution
• Varying number of instructions issued (CPU)
3. VLIW (very long instruction word) processors• In-order execution
• Fixed number of instructions issued
Getting CPI below 1
VLIW: Very Large Instruction Word (1/2)
• Each VLIW has explicit coding for multiple operations▫ Several instructions combined into packets
▫ Possibly with parallelism indicated
• Tradeoff instruction space for simple decoding▫ Room for many operations
▫ Independent operations => execute in parallel
▫ E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
VLIW: Very Large Instruction Word (2/2)
• Assume 2 load/store, 2 fp, 1 int/branch▫ VLIW with 0-5 operations.
▫ Why 0?
• Important to avoid empty instruction slots▫ Loop unrolling
▫ Local scheduling
▫ Global scheduling
� Scheduling across branches
• Difficult to find all dependencies in advance▫ Solution1: Block on memory accesses
▫ Solution2: CPU detects some dependencies
Recall:
Unrolled Loop
that minimizes
stalls for Scalar
Loop: L.D F0,0(R1)
L.D F6,-8(R1)
L.D F10,-16(R1)
L.D F14,-24(R1)
ADD.D F4,F0,F2
ADD.D F8,F6,F2
ADD.D F12,F10,F2
ADD.D F16,F14,F2
S.D F4,0(R1)
S.D F8,-8(R1)
DADDUI R1,R1,#-32
S.D F12,-16(R1)
S.D F16,-24(R1)BNE R1,R2,Loop
Source code:
for (i = 1000; i >0; i=i-1)x[i] = x[i] + s;
Register mapping:
s � F2
i � R1
Loop Unrolling in VLIW
Memory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branch
L.D F0,0(R1) L.D F6,-8(R1) 1
L.D F10,-16(R1) L.D F14,-24(R1) 2
L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3
L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4
ADD.D F20,F18,F2 ADD.D F24,F22,F2 5
S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6
S.D -16(R1),F12 S.D -24(R1),F16 7
S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#48 8
S.D -0(R1),F28 BNEZ R1,LOOP 9
Unrolled 7 iterations to avoid delays
7 results in 9 clocks, or 1.3 clocks per iteration (1.8X)
Average: 2.5 ops per clock, 50% efficiency
Note: Need more registers in VLIW (15 vs. 6 in SS)
Problems with 1st Generation VLIW
• Increase in code size
▫ Loop unrolling
▫ Partially empty VLIW
• Operated in lock-step; no hazard detection HW
▫ A stall in any functional unit pipeline causes entire processor to stall, since all functional units must be kept synchronized
▫ Compiler might predict function units, but caches hard to predict
▫ Moder VLIWs are “interlocked” (identify dependences between bundles and stall).
• Binary code compatibility
▫ Strict VLIW => different numbers of functional units and unit latencies require different versions of the code
VLIW Tradeoffs
• Advantages▫ “Simpler” hardware because the HW does not have to
identify independent instructions.
• Disadvantages▫ Relies on smart compiler▫ Code incompatibility between generations▫ There are limits to what the compiler can do (can’t move
loads above branches, can’t move loads above stores)
• Common uses▫ Embedded market where hardware simplicity is
important, applications exhibit plenty of ILP, and binary compatibility is a non-issue.
IA-64 and EPIC• 64 bit instruction set architecture
▫ Not a CPU, but an architecture▫ Itanium and Itanium 2 are CPUs
based on IA-64
• Made by Intel and Hewlett-Packard (itanium 2 and 3 designed in Colorado)
• Uses EPIC: Explicitly Parallel Instruction Computing • Departure from the x86 architecture• Meant to achieve out-of-order performance with in-
order HW + compiler-smarts▫ Stop bits to help with code density▫ Support for control speculation (moving loads above
branches)▫ Support for data speculation (moving loads above stores)
Details in Appendix G.6
Instruction bundle (VLIW)
Functional units and template
• Functional units:▫ I (Integer), M (Integer + Memory), F (FP), B (Branch),
L + X (64 bit operands + special inst.)
• Template field:▫ Maps instruction to functional unit
▫ Indicates stops: Limitations to ILP
Code example (1/2)
Code example 2/2Control Speculation
• Can the compiler schedule an independent load above a branch?
Bne R1, R2, TARGET
Ld R3, R4(0)
• What are the problems?
• EPIC provides speculative loadsLd.s R3, R4(0)
Bne R1, R2, TARGET
Check R4(0)
Data Speculation
• Can the compiler schedule an independent load above a store?
St R5, R6(0)
Ld R3, R4(0)
• What are the problems?
• EPIC provides “advanced loads” and an ALAT (Advanced Load Address Table)
Ld.a R3, R4(0) � creates entry in ALAT
St R5, R6(0) �looks up ALAT, if match, jump to fixup code
EPIC Conclusions• Goal of EPIC was to maintain advantages of VLIW, but
achieve performance of out-of-order.
• Results:
▫ Complicated bundling rules saves some space, but makes the hardware more complicated
▫ Add special hardware and instructions for scheduling loads above stores and branches (new complicated hardware)
▫ Add special hardware to remove branch penalties (predication)
▫ End result is a machine as complicated as an out-of-order, but now also requiring a super-sophisticated compiler.
Instruction fetching
• Want to issue >1 instruction every cycle
• This means fetching >1 instruction▫ E.g. 4-8 instructions fetched every cycle
• Several problems▫ Bandwidth / Latency
▫ Determining which instructions� Jumps
� Branches
• Integrated instruction fetch unit
Branch Target Buffer (BTB)
• Predicts next instruction address, sends it out beforedecoding instruction
• PC of branch sent to BTB
• When match is found, Predicted PC is returned
• If branch predicted taken, instruction fetch continues at Predicted PC
Branch Target Buffer (BTB)
• Predicts next instruction address, sends it out beforedecoding instruction
• PC of branch sent to BTB
• When match is found, Predicted PC is returned
• If branch predicted taken, instruction fetch continues at Predicted PC
Possible Optimizations????
Return Address Predictor
• Small buffer of return addresses acts as a stack
• Caches most recent return addresses
• Call ⇒ Push a return address on stack
• Return ⇒ Pop an address off stack & predict as new PC
0%
10%
20%
30%
40%
50%
60%
70%
0 1 2 4 8 16
Return address buffer entries
Misprediction frequency
go
m88ksim
cc1
compress
xlisp
ijpeg
perl
vortex
Integrated Instruction Fetch Units
• Recent designs have implemented the fetch stage as a separate, autonomous unit
▫ Multiple-issue in one simple pipeline stage is too complex
• An integrated fetch unit provides:
▫ Branch prediction
▫ Instruction prefetch
▫ Instruction memory access and buffering
Limits to ILP
• Advances in compiler technology + significantly new and different hardware techniques may be able to overcome limitations assumed in studies
• However, unlikely such advances when coupled with realistic hardware will overcome these limits in near future
• How much ILP is available using existing mechanisms with increasing HW budgets?
Chapter 3
Ideal HW Model1. Register renaming – infinite virtual registers
all register WAW & WAR hazards are avoided
2. Branch prediction – perfect; no mispredictions
3. Jump prediction – all jumps perfectly predicted
2 & 3 ⇒ no control dependencies; perfect speculation & an unbounded buffer of instructions available
4. Memory-address alias analysis – addresses known & a load can be moved before a store provided addresses not equal
1&4 eliminates all but RAW
5. perfect caches; 1 cycle latency for all instructions; unlimited instructions issued/clock cycle
Upper Limit to ILP: Ideal Machine(Figure 3.1)
Programs
0
20
40
60
80
100
120
140
160
gcc espresso li fpppp doducd tomcatv
54.862.6
17.9
75.2
118.7
150.1
Integer: 18 - 60
FP: 75 - 150
Inst
ruct
ion
s P
er C
lock
Instruction window
• Ideal HW need to know entire code
• Obviously not practical▫ Register dependencies scales quadratically
• Window: The set of instructions examined for simultaneous execution
• How does the size of the window affect IPC?▫ Too small window => Can’t see whole loops
▫ Too large window => Hard to implement
5563
18
75
119
150
3641
15
61 59 60
1015 12
49
16
45
10 13 11
35
15
34
8 8 914
914
0
20
40
60
80
100
120
140
160
gcc espresso li fpppp doduc tomcatv
Inst
ruct
ions
Per
Clo
ck
Infinite 2048 512 128 32
More Realistic HW: Window ImpactFigure 3.2
FP: 9 - 150
Integer: 8 - 63
IPC
Thread Level Parallelism (TLP)
• ILP exploits implicit parallel operations within a loop or straight-line code segment
• TLP explicitly represented by the use of multiple threads of execution that are inherently parallel
• Use multiple instruction streams to improve:1. Throughput of computers that run many programs
2. Execution time of a single application implemented as a multi-threaded program (parallel program)
Multi-threaded execution
• Multi-threading: multiple threads share the
functional units of 1 processor via overlapping▫ Must duplicate independent state of each thread e.g., a
separate copy of register file, PC and page table
▫ Memory shared through virtual memory mechanisms
▫ HW for fast thread switch; much faster than full process switch ≈ 100s to 1000s of clocks
• When switch?▫ Alternate instruction per thread (fine grain)
▫ When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)
Fine-Grained Multithreading
• Switches between threads on each instruction▫ Multiples threads interleaved
• Usually round-robin fashion, skipping stalled threads
• CPU must be able to switch threads every clock
• Hides both short and long stalls▫ Other threads executed when one thread stalls
• But slows down execution of individual threads▫ Thread ready to execute without stalls will be delayed by
instructions from other threads
• Used on Sun’s Niagara
• Switch threads only on costly stalls (L2 cache miss)• Advantages
▫ No need for very fast thread-switching▫ Doesn’t slow down thread, since switches only when
thread encounters a costly stall
• Disadvantage: hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs▫ Since CPU issues instructions from 1 thread, when a stall
occurs, the pipeline must be emptied or frozen ▫ New thread must fill pipeline before instructions can
complete
• => Better for reducing penalty of high cost stalls, where pipeline refill << stall time
Coarse-Grained Multithreading
Do both ILP and TLP?
• TLP and ILP exploit two different kinds of parallel structure in a program
• Can a high-ILP processor also exploit TLP?▫ Functional units often idle because of stalls or
dependences in the code
• Can TLP be a source of independent instructions that might reduce processor stalls?
• Can TLP be used to employ functional units that would otherwise lie idle with insufficient ILP?
• => Simultaneous Multi-threading (SMT)▫ Intel: Hyper-Threading
Simultaneous Multi-threading
1
2
3
4
5
6
7
8
9
M M FX FX FP FP BR CCCycleOne thread, 8 units
M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes
1
2
3
4
5
6
7
8
9
M M FX FX FP FP BR CCCycleTwo threads, 8 units
Simultaneous Multi-threading (SMT)
• A dynamically scheduled processor already has many HW mechanisms to support multi-threading▫ Large set of virtual registers
� Virtual = not all visible at ISA level
� Register renaming
▫ Dynamic scheduling
• Just add a per thread renaming table and keeping separate PCs▫ Independent commitment can be supported by logically
keeping a separate reorder buffer for each thread
Multi-threaded categories
Time (processor cycle) Superscalar Fine-Grained Coarse-Grained Multiprocessing
Simultaneous
Multithreading
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Idle slot
Design Challenges in SMT• SMT makes sense only with fine-grained
implementation▫ How to reduce the impact on single thread performance?▫ Give priority to one or a few preferred threads
• Large register file needed to hold multiple contexts• Not affecting clock cycle time, especially in
▫ Instruction issue - more candidate instructions need to be considered
▫ Instruction completion - choosing which instructions to commit may be challenging
• Ensuring that cache and TLB conflicts generated by SMT do not degrade performance
1
TDT 4260 – lecture 4 – 2011• Contents
– Computer architecture introduction• Trends• Moore’s law• Amdahl’s law• Gustafson’s law
1 Lasse Natvig
Gustafson s law
– Why multiprocessor? Chap 4.1• Taxonomy• Memory architecture• Communication
– Cache coherence Chap 4.2• The problem• Snooping protocols
Updated lecture plan pr. 4/2Date and lecturer Topic
1: 14 Jan (LN, AI) Introduction, Chapter 1 / Alex: PfJudge2: 21 Jan (IB) Pipelining, Appendix A; ILP, Chapter 23: 3 Feb (IB) ILP, Chapter 2; TLP, Chapter 34: 4 Feb (LN) Multiprocessors, Chapter 4 5: 11 Feb MG Prefetching + Energy Micro guest lecture by Marius Grannæs &
pizza6: 18 Feb (LN) Multiprocessors continued
2 Lasse Natvig
7: 24 Feb (IB) Memory and cache, cache coherence (Chap. 5)8: 4 Mar (IB) Piranha CMP + Interconnection networks
9: 11 Mar (LN) Multicore architectures (Wiley book chapter) + Hill Marty Amdahl multicore ... Fedorova ... assymetric multicore ...
10: 18 Mar (IB) Memory consistency (4.6) + more on memory11: 25 Mar (JA, AI) (1) Kongull and other NTNU and NOTUR supercomputers (2)
Green computing12: 7 Apr (IB/LN) Wrap up lecture, remaining stuff
13: 8 Apr Slack – no lecture planned
Trends • For technology, costs, use
• Help predicting the future
• Product development time – 2-3 years
– design for the next technology
3 Lasse Natvig
– Why should an architecture live longer than a product?
Comp. Arch. is an Integrated Approach
• What really matters is the functioning of the complete system – hardware, runtime system, compiler, operating
system, and application
– In networking this is called the “End to End argument”
4 Lasse Natvig
– In networking, this is called the End to End argument
• Computer architecture is not just about transistors(not at all), individual instructions, or particular implementations– E.g., Original RISC projects replaced complex
instructions with a compiler + simple instructions
Computer Architecture is Design and Analysis
Design
Analysis
Architecture is an iterative process:• Searching the huge space of possible designs• At all levels of computer systems
C ti it
5 Lasse Natvig
Creativity
Good IdeasGood IdeasMediocre IdeasBad Ideas
Cost /PerformanceAnalysis
TDT4260 Course FocusUnderstanding the design techniques, machine
structures, technology factors, evaluation methods that will determine the form of computers in 21st Century
Technology ProgrammingLanguages
Parallelism
6 Lasse Natvig
Languages
OperatingSystems History
Applications Interface Design(ISA)
Measurement & Evaluation
Computer Architecture:• Organization• Hardware/Software Boundary
Compilers
2
Holistic approache.g., to programmability combined with performance
TBP (Wool, TBB)Energy aware task pool implementation
NTNU-principle: Teaching based on research, example, PhD-project of Alexandru Iordan:
7 Lasse Natvig
Multicore, interconnect, memory
Operating System & system software
Parallel & concurrent programming
Multicore memory systems (Dybdahl-PhD, Grannæs-PhD, Jahre-PhD, M5-sim, pfJudge)
Moore’s Law: 2X transistors / “year”
8 Lasse Natvig
• “Cramming More Components onto Integrated Circuits”– Gordon Moore, Electronics, 1965
• # of transistors / cost-effective integrated circuit double every N months (12 ≤ N ≤ 24)
Tracking Technology Performance Trends• 4 critical implementation technologies:
– Disks, – Memory, – Network, – Processors
• Compare for Bandwidth vs. Latency
9 Lasse Natvig
improvements in performance over time• Bandwidth: number of events per unit time
– E.g., M bits/second over network, M bytes / second from disk
• Latency: elapsed time for a single event– E.g., one-way network delay in microseconds,
average disk access time in milliseconds
Latency Lags Bandwidth (last ~20 years)
100
1000
10000
Relative BW
Processor
Memory
Network
Disk
• Performance Milestones
• Processor: ‘286, ‘386, ‘486, Pentium, Pentium Pro, Pentium 4 (21x,2250x)
• Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x)
• Memory Module: 16bit plain DRAM, P M d DRAM 32b 64b SDRAM
CPU high, Memory low(“Memory Wall”)
10 Lasse Natvig
1
10
100
1 10 100
Relative Latency Improvement
Improvement
(Latency improvement = Bandwidth improvement)
Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x)
• Disk : 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)
-----------------
(Processor latency = typical # of pipeline-stages * time pr. clock-cycle)
COST and COTS• Cost
– to produce one unit
– include (development cost / # sold units)
– benefit of large volume
• COTS
11 Lasse Natvig
• COTS– commodity off the shelf
• much better performance/price pr. component
• strong influence on the selection of components for building supercomputers in more than 20 years
Speedup• General definition:
Speedup (p processors) =
• For a fixed problem size (input data set), performance = 1/time
Performance (p processors)
Performance (1 processor)
12 Lasse Natvig
performance 1/time– Speedup
fixed problem (p processors) =
• Note: use best sequential algorithm in the uni-processor
solution, not the parallel algorithm with p = 1
Time (1 processor)
Time (p processors)
Superlinear speedup ?
3
Amdahl’s Law (1967) (fixed problem size)• “If a fraction s of a
(uniprocessor) computation is inherently serial, the speedup is at most 1/s”
• Total work in computation– serial fraction s– parallel fraction p
13 Lasse Natvig
p p– s + p = 1 (100%)
• S(n) = Time(1) / Time(n)
= (s + p) / [s +(p/n)]
= 1 / [s + (1-s) / n]
= n / [1 + (n - 1)s]• ”pessimistic and famous”
Gustafson’s “law” (1987)(scaled problem size, fixed execution time)
• Total execution time on parallel computer with nprocessors is fixed– serial fraction s’– parallel fraction p’– s’ + p’ = 1 (100%)
• S’(n) = Time’(1)/Time’(n)
14 Lasse Natvig
• S (n) = Time (1)/Time (n) = (s’ + p’n)/(s’ + p’)= s’ + p’n = s’ + (1-s’)n= n +(1-n)s’
• Reevaluating Amdahl's law, John L. Gustafson, CACM May 1988, pp 532-533. ”Not a new law, but Amdahl’s law with changed assumptions”
How the serial fraction limits speedup
• Amdahl’s law
• Work hard to
15 Lasse Natvig
reduce the serial part of the application– remember IO
– think different(than traditionally or sequentially)
= serial fraction
Single/ILP Multi/TLP• Uniprocessor trends
– Getting too complex– Speed of light– Diminishing returns from ILP
• Multiprocessor
16 Lasse Natvig
Multiprocessor – Focus in the textbook: 4-32 CPUs– Increased performance through parallelism– Multichip– Multicore ((Single) Chip Multiprocessors – CMP)– Cost effective
• Right balance of ILP and TLP is unclear today– Desktop vs. server?
Other Factors Multiprocessors• Growth in data-intensive applications
– Databases, file servers, multimedia, …
• Growing interest in servers, server performance• Increasing desktop performance less important
– Outside of graphics
17 Lasse Natvig
Outside of graphics
• Improved understanding in how to use multiprocessors effectively – Especially in servers where significant natural TLP
• Advantage of leveraging design investment by replication – Rather than unique design
• Power/cooling issues multicore
Multiprocessor – Taxonomy• Flynn’s taxonomy (1966, 1972)
– Taxonomy = classification
– Widely used, but perhaps a bit coarse
• Single Instruction Single Data (SISD)– Common uniprocessor
Si l I t ti M lti l D t (SIMD)
18 Lasse Natvig
• Single Instruction Multiple Data (SIMD)– “ = Data Level Parallelism (DLP)”
• Multiple Instruction Single Data (MISD)– Not implemented?– Pipeline / Stream processing / GPU ?
• Multiple Instruction Multiple Data (MIMD)– Used today– “ = Thread Level Parallelism (TLP)”
4
Flynn’s taxonomy (1/2)Single/Multiple Instruction/Data Stream
19 Lasse Natvig
SISD uniprocessor
MIMD w/shared memorySIMD w/distributed memory
Flynn’s taxonomy (2/2), MISDSingle/Multiple Instruction/Data Stream
20 Lasse Natvig
MISD (software pipeline)
Advantages to MIMD• Flexibility
– High single-user performance, multiple programs, multiple threads
– High multiple-user performance
– Combination
• Built using commercial off-the-shelf (COTS) t
21 Lasse Natvig
components– 2 x Uniprocessor = Multi-CPU
– 2 x Uniprocessor core on a single chip = Multicore
MIMD: Memory architecture
P1 Pn
P1
$ $
Pn
22 Lasse Natvig
$
Interconnection network (IN)
$Mem MemInterconnection network (IN)
Mem Mem
Centralized Memory Distributed Memory
Centralized Memory Multiprocessor • Also called
• Symmetric Multiprocessors (SMPs)
• Uniform Memory Access (UMA) architecture
• Shared memory becomes bottleneck
23 Lasse Natvig
y
• Large caches single memory can satisfy memory demands of small number of processors
• Can scale to a few dozen processors by using a switch and by using many memory banks
• Scaling beyond that is hard
Distributed (Shared) Memory Multiprocessor
• Pro: Cost-effective way to scale memory bandwidth • If most accesses are to
local memory
• Pro: Reduces latency of local memory accesses
24 Lasse Natvig
• Con: Communication becomes more complex
• Pro/Con: Possible to change software to take advantage of memory that is close, but this can also make SW less portable– Non-Uniform Memory Access (NUMA)
5
MP (MIMD), cluster of SMPs
Proc.
Caches
Node Interc Network
Proc.
Caches
Proc.
Caches
Proc.
Caches
Node Interc Network
Proc.
Caches
Proc.
Caches
25 Lasse Natvig
Cluster Interconnection Network
Memory I/O
Node Interc. Network
Memory I/O
Node Interc. Network
• Combination of centralized and distributed
• Like an early version of the kongull-cluster
Distributed memory
1. Shared address space• Logically shared, physically distributed
• Distributed Shared Memory (DSM)
NUMA hit t
P
M
Network
P P
Conceptual Model
26 Lasse Natvig
• NUMA architecture
2. Separate address spaces• Every P-M module is a separate
computer
• Multicomputer
• Clusters
• Not a focus in this course
Conceptual Model
P
M
P
M
P
M
Network
Implementation
Communication models• Shared memory
– Centralized or Distributed Shared Memory– Communication using LOAD/STORE– Coordinated using traditional OS methods
• Semaphores, monitors, etc.
– Busy-wait more acceptable than for uniprosessor
27 Lasse Natvig
Busy wait more acceptable than for uniprosessor
• Message passing– Using send (put) and receive (get)
• Asynchronous / Synchronous
– Libraries, standards• …, PVM, MPI, …
Limits to parallelism• We need separate processes and threads!
– Can’t split one thread among CPUs/cores• Parallel algorithms needed
– Separate field– Some problems are inherently serial
• P-complete problems– Part of parallel complexity theory
• See minicourse TDT6 - Heterogeneous and green
28 Lasse Natvig
See minicourse TDT6 Heterogeneous and green computing
• http://www.idi.ntnu.no/emner/tdt4260/tdt6
• Amdahl’s law– Serial fraction of code limits speedup– Example: speedup = 80 with 100 processors require
maximum 0,25% of the time spent on serial code
SMP: Cache Coherence Problem
/O
P1
cache cache cache
P2 P3
34 5
u = ?u = ?
u :5 u :5
u = 7
29 Lasse Natvig
• Processors see different values for u after event 3• Old (stale) value read in event 4 (hit)• Event 5 (miss) reads
– correct value (if write-through caches)– old value (if write-back caches)
• Unacceptable to programs, and frequent!
I/O devices
Memory
12u :5
Enforcing coherence• Separate caches makes multiple copies frequent
– Migration
• Moved from shared memory to local cache
• Speeds up access, reduces memory bandwidth requirements
– Replication
• Several local copies when item is read by several
30 Lasse Natvig
p y
• Speeds up access, reduces memory contention
• Need coherence protocols to track shared data– Directory based
• Status in shared location (Chap. 4.4)
– (Bus) snooping
• Each cache maintains local status
• All caches monitor broadcast medium
• Write invalidate / Write update
6
Snooping: Write invalidate
• Several reads or one write: No change• Writes require exclusive access• Writes to shared data: All other cache copies
i lid t d
31 Lasse Natvig
invalidated– Invalidate command and address broadcasted– All caches listen (snoops) and invalidates if necessary
• Read miss:– Write-Through: Memory always up to date
– Write-Back: Caches listen and any exclusive copy is put on the bus
Snooping: Write update• Also called write broadcast
• Must know which cache blocks are shared
32 Lasse Natvig
• Usually Write-Through– Write to shared data: Broadcast, all caches listen and updates their
copy (if any)
– Read miss: Main memory is up to date
Snooping: Invalidate vs. Update• Repeated writes to the same address (no reads) requires
several updates, but only one invalidate
• Invalidates are done at cache block level, while updates are done of individual words
• Delay from a word is written until it can be read is shorter for updates
33 Lasse Natvig
updates
• Invalidate most common– Less bus traffic
– Less memory traffic
– Bus and memory bandwidth typical bottleneck
An Example Snoopy Protocol• Invalidation protocol, write-back cache
• Each cache block is in one state– Shared : Clean in all caches and up-to-date in
34 Lasse Natvig
memory, block can be read
– Exclusive : One cache has only copy, its writeable, and dirty
– Invalid : block contains no data
Snooping: Invalidation protocol (1/6)Processor
0Processor
1Processor
2Processor
N-1
read x
35 Lasse Natvig
Interconnection Network
I/O Systemox
Main Memory
read miss
Processor0
Processor1
Processor2
ProcessorN-1
ox
Snooping: Invalidation protocol (2/6)
36 Lasse Natvig
Interconnection Network
I/O Systemox
Main Memory
oxshared
7
Processor0
Processor1
Processor2
ProcessorN-1
ox
read x
Snooping: Invalidation protocol (3/6)
37 Lasse Natvig
Interconnection Network
I/O Systemox
Main Memory
oxshared
read miss
Processor0
Processor1
Processor2
ProcessorN-1
ox ox
Snooping: Invalidation protocol (4/6)
38 Lasse Natvig
Interconnection Network
I/O Systemox
Main Memory
oxshared
oxshared
Processor0
Processor1
Processor2
ProcessorN-1
ox ox
write x
Snooping: Invalidation protocol (5/6)
39 Lasse Natvig
Interconnection Network
ox
Main Memory
oxshared
oxshared
invalidate
I/O System
Processor0
Processor1
Processor2
ProcessorN-1
1x
Snooping: Invalidation protocol (6/6)
40 Lasse Natvig
Interconnection Network
I/O Systemox
Main Memory
1xexclusive
Prefetching
Marius Grannæs
Feb 11th, 2011
www.ntnu.no M. Grannæs, Prefetching
2
About Me
• PhD from NTNU in Computer Architecture in 2010• “Reducing Memory Latency by Improving Resource Utilization”• Supervised by Lasse Natvig• Now working for Energy Micro• Working on energy profiling, caching and prefetching• Software development
www.ntnu.no M. Grannæs, Prefetching
3
About Energy Micro• Fabless semiconductor company• Founded in 2007 by ex-chipcon founders• 50 employees• Offices around the world• Designing the world most energy friendly microcontrollers• Today: EFM32 Gecko• Next friday: EFM32 Tiny Gecko (cache)• May(ish): EFM32 Giant Gecko (cache + prefetching)• Ambition: 1% marketshare...
• of a $30 bn market.
www.ntnu.no M. Grannæs, Prefetching
3
About Energy Micro• Fabless semiconductor company• Founded in 2007 by ex-chipcon founders• 50 employees• Offices around the world• Designing the world most energy friendly microcontrollers• Today: EFM32 Gecko• Next friday: EFM32 Tiny Gecko (cache)• May(ish): EFM32 Giant Gecko (cache + prefetching)• Ambition: 1% marketshare...• of a $30 bn market.
www.ntnu.no M. Grannæs, Prefetching
4
What is Prefetching?
Prefetching
Prefetching is a technique for predicting future prefetches andfetching the data into the cache
www.ntnu.no M. Grannæs, Prefetching
5
The Memory Wall
1
10
100
1000
10000
100000
1980 1985 1990 1995 2000 2005 2010
Per
form
ance
Year
CPU performanceMemory performance
W.Wulf and S. McKee, "Hitting the Memory Wall: Implications ofthe Obvious"
www.ntnu.no M. Grannæs, Prefetching
6
A Useful Analogy• An Intel Core i7 can execute 147600 Million Instructions per
second.• ⇒ A carpenter can hammer one nail per second.
• DDR3-1600 RAM can perform 65 Million transfers per second.• ⇒ The carpenter must wait 38 minutes per nail.
www.ntnu.no M. Grannæs, Prefetching
6
A Useful Analogy• An Intel Core i7 can execute 147600 Million Instructions per
second.• ⇒ A carpenter can hammer one nail per second.
• DDR3-1600 RAM can perform 65 Million transfers per second.
• ⇒ The carpenter must wait 38 minutes per nail.
www.ntnu.no M. Grannæs, Prefetching
6
A Useful Analogy• An Intel Core i7 can execute 147600 Million Instructions per
second.• ⇒ A carpenter can hammer one nail per second.
• DDR3-1600 RAM can perform 65 Million transfers per second.• ⇒ The carpenter must wait 38 minutes per nail.
www.ntnu.no M. Grannæs, Prefetching
7
Solution
Solution outline:1 You bring an entire box of nails.2 Keep the box close to the carpenter
www.ntnu.no M. Grannæs, Prefetching
7
Solution
Solution outline:1 You bring an entire box of nails.2 Keep the box close to the carpenter
www.ntnu.no M. Grannæs, Prefetching
8
Analysis: CarpentingHow long (on average) does it take to get one nail?
www.ntnu.no M. Grannæs, Prefetching
8
Analysis: CarpentingHow long (on average) does it take to get one nail?
Nail latency
LNail = LBox + pBox is empty · (LShop + LTraffic)
LNail Time to get one nail.LBox Time to check and fetch one nail from the box.
pBox is empty Probabilty that the box you have is empty.LShop Time to go to the shop (38 minutes).LTraffic Time lost due to traffic.
www.ntnu.no M. Grannæs, Prefetching
9
Solution: (For computers)
• Faster, but smaller memory closer to the processor.• Temporal locality
• If you needed X in the past, you are probably going to need Xin the near future.
• Spatial locality• If you need X , you probably need X + 1
⇒ If you need X, put it in the cache, along with everything elseclose to it (cache line)
www.ntnu.no M. Grannæs, Prefetching
9
Solution: (For computers)
• Faster, but smaller memory closer to the processor.• Temporal locality
• If you needed X in the past, you are probably going to need Xin the near future.
• Spatial locality• If you need X , you probably need X + 1
⇒ If you need X, put it in the cache, along with everything elseclose to it (cache line)
www.ntnu.no M. Grannæs, Prefetching
10
Analysis: Caches
Nail latency
LSystem = LCache + pMiss · (LMain Memory + LCongestion)
LSystem Total system latency.LCache Latency of the cache.
pMiss Probabilty of a cache miss.LMain Memory Main memory latency.
LCongestion Latency due to main memory congestion.
www.ntnu.no M. Grannæs, Prefetching
11
DRAM in perspective• “Incredibly slow” DRAM has a response time of 15.37 ns.• Speed of light is 3 · 108m/s.• Physical distance from processor to DRAM chips is typically
20cm.
2 · 20 · 10−3m3 · 108m/s
= 0.13ns (1)
• Just 2 orders of magnitude!• Intel Core i7 - 147600 Million Instructions per second.• Ultimate laptop - 5 · 1050 operations per second/kg.
Lloyd, Seth, “Ultimate physical limits to computation”
www.ntnu.no M. Grannæs, Prefetching
11
DRAM in perspective• “Incredibly slow” DRAM has a response time of 15.37 ns.• Speed of light is 3 · 108m/s.• Physical distance from processor to DRAM chips is typically
20cm.2 · 20 · 10−3m
3 · 108m/s= 0.13ns (1)
• Just 2 orders of magnitude!
• Intel Core i7 - 147600 Million Instructions per second.• Ultimate laptop - 5 · 1050 operations per second/kg.
Lloyd, Seth, “Ultimate physical limits to computation”
www.ntnu.no M. Grannæs, Prefetching
11
DRAM in perspective• “Incredibly slow” DRAM has a response time of 15.37 ns.• Speed of light is 3 · 108m/s.• Physical distance from processor to DRAM chips is typically
20cm.2 · 20 · 10−3m
3 · 108m/s= 0.13ns (1)
• Just 2 orders of magnitude!• Intel Core i7 - 147600 Million Instructions per second.• Ultimate laptop - 5 · 1050 operations per second/kg.
Lloyd, Seth, “Ultimate physical limits to computation”
www.ntnu.no M. Grannæs, Prefetching
12
When does caching not work?The four Cs:• Cold/Compulsory:
• The data has not been referenced before• Capacity
• The data has been referenced before, but has been thrown out,because of the limited size of the cache.
• Conflict• The data has been thrown out of a set-assosciative cache
because it would not fit in the set.• Coherence
• Another processor (in a muti-processor/core environment) hasinvalidated the cacheline.
We can buy our way out of Capacity and Conflict misses, but notCold or Coherence misses!
www.ntnu.no M. Grannæs, Prefetching
12
When does caching not work?The four Cs:• Cold/Compulsory:
• The data has not been referenced before• Capacity
• The data has been referenced before, but has been thrown out,because of the limited size of the cache.
• Conflict• The data has been thrown out of a set-assosciative cache
because it would not fit in the set.• Coherence
• Another processor (in a muti-processor/core environment) hasinvalidated the cacheline.
We can buy our way out of Capacity and Conflict misses, but notCold or Coherence misses!
www.ntnu.no M. Grannæs, Prefetching
13
Cache Sizes
1
10
100
1000
10000
1985 1990 1995 2000 2005 2010
Cac
he s
ize
(kB
)
Year
8048
6DX
Pent
ium
Pent
ium
Pro
Pent
ium
IIPe
ntiu
m II
IPe
ntiu
m 4
Pent
ium
4E
Cor
e 2
Cor
e i7
www.ntnu.no M. Grannæs, Prefetching
14
Core i7 (Lynnfield) - 2009
www.ntnu.no M. Grannæs, Prefetching
15
Pentium M - 2003
www.ntnu.no M. Grannæs, Prefetching
16
Prefetching
Prefetching increases the performance of caches by predictingwhat data is needed and fetching that data into the cache before itis referenced. Need to know:• What to prefetch?• When to prefetch?• Where to put the data?• How do we prefetch? (Mechanism)
www.ntnu.no M. Grannæs, Prefetching
17
Prefetching Terminology
Good PrefetchA prefetch is classified as Good if the prefetched block isreferenced by the application before it is replaced.
Bad PrefetchA prefetch is classified as Bad if the prefetched block is notreferenced by the application before it is replaced.
www.ntnu.no M. Grannæs, Prefetching
17
Prefetching Terminology
Good PrefetchA prefetch is classified as Good if the prefetched block isreferenced by the application before it is replaced.
Bad PrefetchA prefetch is classified as Bad if the prefetched block is notreferenced by the application before it is replaced.
www.ntnu.no M. Grannæs, Prefetching
18
Accuracy
The accuracy of a given prefetch algorithm that yields G goodprefetches and B bad prefetches is calculated as:
Accuracy
Accuracy = GG+B
www.ntnu.no M. Grannæs, Prefetching
19
Coverage
If a conventional cache has M misses without using any prefetchalgorithm, the coverage of a given prefetch algorithm that yields Ggood prefetches and B bad prefetches is calculated as:
Accuracy
Coverage = GM
www.ntnu.no M. Grannæs, Prefetching
20
Prefetching
System Latency
Lsystem = Lcache + pmiss · (Lmain memory + Lcongestion)
• If a prefetch is good:• pmiss is lowered• ⇒ Lsystem decreases
• If a prefetch is bad:• pmiss becomes higher because useful data might be replaced• Lcongestion becomes higher because of useless traffic• ⇒ Lsystem increases
www.ntnu.no M. Grannæs, Prefetching
20
Prefetching
System Latency
Lsystem = Lcache + pmiss · (Lmain memory + Lcongestion)
• If a prefetch is good:• pmiss is lowered• ⇒ Lsystem decreases
• If a prefetch is bad:• pmiss becomes higher because useful data might be replaced• Lcongestion becomes higher because of useless traffic• ⇒ Lsystem increases
www.ntnu.no M. Grannæs, Prefetching
20
Prefetching
System Latency
Lsystem = Lcache + pmiss · (Lmain memory + Lcongestion)
• If a prefetch is good:• pmiss is lowered• ⇒ Lsystem decreases
• If a prefetch is bad:• pmiss becomes higher because useful data might be replaced• Lcongestion becomes higher because of useless traffic• ⇒ Lsystem increases
www.ntnu.no M. Grannæs, Prefetching
21
Prefetching TechniquesTypes of prefetching:• Software
• Special instructions.• Most modern high performance processors have them.• Very flexible.• Can be good at pointer chasing.• Requires compiler or programmer effort.• Processor executes prefetches instead of computation.• Static (performed at compile-time).
• Hardware• Hybrid
www.ntnu.no M. Grannæs, Prefetching
21
Prefetching Techniques
Types of prefetching:• Software• Hardware
• Dedicated hardware analyzes memory references.• Most modern high performance processors have them.• Fixed functionality.• Requires no effort by the programmer or compiler.• Off-loads prefetching to hardware.• Dynamic (performed at run-time)
• Hybrid
www.ntnu.no M. Grannæs, Prefetching
21
Prefetching Techniques
Types of prefetching:• Software• Hardware• Hybrid
• Dedicated hardware unit.• Hardware unit programmed by software.• Some effort required by the programmer or compiler.
www.ntnu.no M. Grannæs, Prefetching
22
Software Prefetchingf o r ( i =0; i < 10000; i ++) {
acc += data [ i ] ;}
MOV r1, 0 ; Acc
MOV rO, #0 ; i
Label: LOAD r2, r0(#data) ; Cache miss! (400 cycles!)
ADD r1, r2 ; acc += date[i]
INC r0 ; i++
CMP r0, #100000 ; i < 100000
BL Label ; branch if less
www.ntnu.no M. Grannæs, Prefetching
22
Software Prefetchingf o r ( i =0; i < 10000; i ++) {
acc += data [ i ] ;}
MOV r1, 0 ; Acc
MOV rO, #0 ; i
Label: LOAD r2, r0(#data) ; Cache miss! (400 cycles!)
ADD r1, r2 ; acc += date[i]
INC r0 ; i++
CMP r0, #100000 ; i < 100000
BL Label ; branch if less
www.ntnu.no M. Grannæs, Prefetching
23
Software Prefetching II
f o r ( i =0; i < 10000; i ++) {acc += data [ i ] ;
}
Simple optimization using __builtin_prefetch()
f o r ( i =0; i < 10000; i ++) {_ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ;acc += data [ i ] ;
}
Why add 10 (and not 1?)Prefetch Distance - Memory latency >> Computation latency.
www.ntnu.no M. Grannæs, Prefetching
23
Software Prefetching II
f o r ( i =0; i < 10000; i ++) {acc += data [ i ] ;
}
Simple optimization using __builtin_prefetch()
f o r ( i =0; i < 10000; i ++) {_ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ;acc += data [ i ] ;
}
Why add 10 (and not 1?)Prefetch Distance - Memory latency >> Computation latency.
www.ntnu.no M. Grannæs, Prefetching
24
Software Prefetching IIIf o r ( i =0; i < 10000; i ++) {
_ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ;acc += data [ i ] ;
}
Note:• data[0]→ data[9] will not be prefetched.• data[10000]→ data[10009] will be prefetched, but not used.
Accuracy =G
G + B=
999010000
= 0.999 = 99,9%
Coverage =GM
=999010000
= 0.999 = 99,9%
www.ntnu.no M. Grannæs, Prefetching
24
Software Prefetching IIIf o r ( i =0; i < 10000; i ++) {
_ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ;acc += data [ i ] ;
}
Note:• data[0]→ data[9] will not be prefetched.• data[10000]→ data[10009] will be prefetched, but not used.
Accuracy =G
G + B=
999010000
= 0.999 = 99,9%
Coverage =GM
=999010000
= 0.999 = 99,9%
www.ntnu.no M. Grannæs, Prefetching
25
Complex Softwaref o r ( i =0; i < 10000; i ++) {
_ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ;i f ( someFunction ( i ) == True ){
acc += data [ i ] ;}
}
Does prefetching pay off in this case?
• How many times is someFunction(i) true?• How much memory bus access is perfomed in
someFunction(i)?• Does power matter?
We have to profile the program to know!
www.ntnu.no M. Grannæs, Prefetching
25
Complex Softwaref o r ( i =0; i < 10000; i ++) {
_ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ;i f ( someFunction ( i ) == True ){
acc += data [ i ] ;}
}
Does prefetching pay off in this case?• How many times is someFunction(i) true?• How much memory bus access is perfomed in
someFunction(i)?• Does power matter?
We have to profile the program to know!
www.ntnu.no M. Grannæs, Prefetching
26
Dynamic Data Structures I
typedef s t r u c t {i n t data ;node_t next ;
} node_t ;
wh i le ( ( node = node−>next ) != NULL) {acc += node−>data ;
}
www.ntnu.no M. Grannæs, Prefetching
27
Dynamic Data Structures II
typedef s t r u c t {i n t data ;node_t next ;node_t jump ;
} node_t ;
wh i le ( ( node = node−>next ) != NULL) {_ _ b u l t i n _ p r e f e t c h ( node−>jump ) ;acc += node−>data ;
}
www.ntnu.no M. Grannæs, Prefetching
28
Hardware PrefetchingSoftware prefetching:• Need programmer effort to implement• Prefetch instructions is not computing• Compile-time• Very flexible
Hardware prefetching:• No programmer effort• Does not displace compute instructions• Run-time• Not flexible
www.ntnu.no M. Grannæs, Prefetching
28
Hardware PrefetchingSoftware prefetching:• Need programmer effort to implement• Prefetch instructions is not computing• Compile-time• Very flexible
Hardware prefetching:• No programmer effort• Does not displace compute instructions• Run-time• Not flexible
www.ntnu.no M. Grannæs, Prefetching
29
Sequential Prefetching
The simplest prefetcher, but suprisingly effective due to spatiallocality.
Sequential Prefetching
Miss on address X⇒ Fetch X+n, X+n+1 ... , X+n+j
n Prefetch distancej Prefetch degree
Collectively known as prefetch agressiveness.
www.ntnu.no M. Grannæs, Prefetching
30
Sequential Prefetching II
1
1.5
2
2.5
3
3.5
4
4.5
5
libqu
antu
m
milc
lesl
ie3d
Gem
sFD
TD lbm
sphi
nx3
Spe
edup
Benchmark
Sequential
www.ntnu.no M. Grannæs, Prefetching
31
Reference Prediction TablesTien-Fu Chen and Jean-Loup Baer (1995)• Builds upon sequential prefetching, stride directed prefetching.• Observation: Non-unit strides in many applications
• 2, 4, 6, 8, 10 (stride 2)
• Observation: Each load instruction has a distinct accesspattern
Reference Prediction Tables (RPT):• Table index by the load instruction• Simple state machine• Store a single delta of history.
www.ntnu.no M. Grannæs, Prefetching
32
Reference Prediction Tables
PC Last Addr. StateDelta
Cache Miss:
Init ial Training Prefetch
www.ntnu.no M. Grannæs, Prefetching
33
Reference Prediction Tables
PC Last Addr. StateDelta
Cache Miss:
Init ial Training Prefetch
1
www.ntnu.no M. Grannæs, Prefetching
34
Reference Prediction Tables
PC Last Addr. StateDelta
Cache Miss:
Init ial Training Prefetch
1
1100 Init--
www.ntnu.no M. Grannæs, Prefetching
35
Reference Prediction Tables
PC Last Addr. StateDelta
Cache Miss:
Init ial Training Prefetch
1
1100 Train
--
3
3 2
www.ntnu.no M. Grannæs, Prefetching
36
Reference Prediction Tables
PC Last Addr. StateDelta
Cache Miss:
Init ial Training Prefetch
1
3100 Prefetch
2
3
5 2
5
www.ntnu.no M. Grannæs, Prefetching
37
Reference Prediction Tables
1
1.5
2
2.5
3
3.5
4
4.5
5
libqu
antu
m
milc
lesl
ie3d
Gem
sFD
TD lbm
sphi
nx3
Spe
edup
Benchmark
SequentialRPT
www.ntnu.no M. Grannæs, Prefetching
38
Global History Buffer
K. Nesbit, A. Dhodapkar and J.Smith (2004)• Observation: Predicting more complex patterns require more
history• Observation: A lot of history in the RPT is very old
Program Counter/Delta Correlation (PC/DC)• Store all misses in a FIFO called Global History Buffer (GHB)• Linked list of all misses from one load instruction• Traversing linked list gives a history for that load
www.ntnu.no M. Grannæs, Prefetching
39
Global History Buffer
PC Ptr Address Ptr
100
Index Table Global History Buffer
1
3
Delta Buffer
www.ntnu.no M. Grannæs, Prefetching
40
Global History Buffer
PC Ptr Address Ptr
100
Index Table Global History Buffer
1
3
Delta Buffer
5
www.ntnu.no M. Grannæs, Prefetching
41
Global History Buffer
PC Ptr Address Ptr
100
Index Table Global History Buffer
1
3
Delta Buffer
5
www.ntnu.no M. Grannæs, Prefetching
42
Global History Buffer
PC Ptr Address Ptr
100
Index Table Global History Buffer
1
3
Delta Buffer
5
www.ntnu.no M. Grannæs, Prefetching
43
Global History Buffer
PC Ptr Address Ptr
100
Index Table Global History Buffer
1
3
Delta Buffer
5
www.ntnu.no M. Grannæs, Prefetching
44
Global History Buffer
PC Ptr Address Ptr
100
Index Table Global History Buffer
1
3
Delta Buffer
5
www.ntnu.no M. Grannæs, Prefetching
45
Global History Buffer
PC Ptr Address Ptr
100
Index Table Global History Buffer
1
3
Delta Buffer
5
2
www.ntnu.no M. Grannæs, Prefetching
46
Global History Buffer
PC Ptr Address Ptr
100
Index Table Global History Buffer
1
3
Delta Buffer
5
2
2
www.ntnu.no M. Grannæs, Prefetching
47
Delta Correlation
• In the previous example, the delta buffer only contained twovalues (2,2).
• Thus it is easy to guess that the next delta is also 2.• We can then prefetch: Current address + Delta = 5 + 2 = 7
What if the pattern is repeating, but not regular?1, 2, 3, 4, 5, 1, 2, 3, 4, 5
www.ntnu.no M. Grannæs, Prefetching
47
Delta Correlation
• In the previous example, the delta buffer only contained twovalues (2,2).
• Thus it is easy to guess that the next delta is also 2.• We can then prefetch: Current address + Delta = 5 + 2 = 7
What if the pattern is repeating, but not regular?1, 2, 3, 4, 5, 1, 2, 3, 4, 5
www.ntnu.no M. Grannæs, Prefetching
48
Delta Correlation
1 2 3 1 2 3 124 25
210 11 13 16 17 19 22
www.ntnu.no M. Grannæs, Prefetching
49
Delta Correlation
1 2 3 1 2 3 124 25
210 11 13 16 17 19 22
www.ntnu.no M. Grannæs, Prefetching
50
Delta Correlation
10 11 13 17 18 20 231 2 3 1 2 3 1
24 252
www.ntnu.no M. Grannæs, Prefetching
51
Delta Correlation
10 11 13 17 18 20 231 2 3 1 2 3 1
24 252
www.ntnu.no M. Grannæs, Prefetching
52
Delta Correlation
10 11 13 17 18 20 231 2 3 1 2 3 1
24 252
www.ntnu.no M. Grannæs, Prefetching
53
Delta Correlation
10 11 13 17 18 20 231 2 3 1 2 3 1
24 252
www.ntnu.no M. Grannæs, Prefetching
54
Delta Correlation
1 2 3 1 2 3 124 25
210 11 13 16 17 19 22
www.ntnu.no M. Grannæs, Prefetching
55
Delta Correlation
1 2 3 1 2 3 123 25
210 11 13 16 17 19 22
www.ntnu.no M. Grannæs, Prefetching
56
Delta Correlation
1 2 3 1 2 3 123 25
210 11 13 16 17 19 22
www.ntnu.no M. Grannæs, Prefetching
57
PC/DC
1
1.5
2
2.5
3
3.5
4
4.5
5
libqu
antu
m
milc
lesl
ie3d
Gem
sFD
TD lbm
sphi
nx3
Spe
edup
Benchmark
SequentialRPT
PC/DC
www.ntnu.no M. Grannæs, Prefetching
58
Data Prefetching Championships
• Organized by JILP• Held in conjunction with HPCA’09• Branch prediction championships• Everyone uses the same API (six function calls)• Same set of benchmarks• Third party evaluates performance• 20+ prefetchers submitted
http://www.jilp.org/dpc/
www.ntnu.no M. Grannæs, Prefetching
59
Delta Correlating Prediction Tables• Our submission to DPC-1• Observation: GHB pointer chasing is expensive.• Observation: History doesn’t really get old.• Observation: History would reach a steady state.• Observation: Deltas are typically small, while the address
space is large.• Table indexed by the PC of the load• Each entry holds the history of the load in the form of deltas.• Delta Correlation
PC Last Addr. DLast Pref. D D D D D Ptr
www.ntnu.no M. Grannæs, Prefetching
60
Delta Correlating Prefetch Tables
PC Last Addr. DLast Pref. D D D D D Ptr
www.ntnu.no M. Grannæs, Prefetching
61
Delta Correlating Prefetch Tables
PC Last Addr. DLast Pref. D D D D D Ptr
10
100 10 - -- -- - - -
www.ntnu.no M. Grannæs, Prefetching
62
Delta Correlating Prefetch Tables
PC Last Addr. DLast Pref. D D D D D Ptr
10
100 10 - -- -- - - -
11
www.ntnu.no M. Grannæs, Prefetching
63
Delta Correlating Prefetch Tables
PC Last Addr. DLast Pref. D D D D D Ptr
10
100 10 - -- -- 1 - -
11
www.ntnu.no M. Grannæs, Prefetching
64
Delta Correlating Prefetch Tables
PC Last Addr. DLast Pref. D D D D D Ptr
10
100 10 - -- -- 1 - -
11
www.ntnu.no M. Grannæs, Prefetching
65
Delta Correlating Prefetch Tables
PC Last Addr. DLast Pref. D D D D D Ptr
10
100 11 - -- -- 1 - -
11
www.ntnu.no M. Grannæs, Prefetching
66
Delta Correlating Prefetch Tables
PC Last Addr. DLast Pref. D D D D D Ptr
10
100 13 - -- -- 1 2 -
11 13
www.ntnu.no M. Grannæs, Prefetching
67
Delta Correlating Prefetch Tables
PC Last Addr. DLast Pref. D D D D D Ptr
10
100 16 3 -- -- 1 2 -
11 13 16
www.ntnu.no M. Grannæs, Prefetching
68
Delta Correlating Prefetch Tables
PC Last Addr. DLast Pref. D D D D D Ptr
10
100 22 3 31 2- 1 2 -
11 13 16 17 19 22
www.ntnu.no M. Grannæs, Prefetching
69
Delta Correlating Prefetch Tables
1
1.5
2
2.5
3
3.5
4
4.5
5
libqu
antu
m
milc
lesl
ie3d
Gem
sFD
TD lbm
sphi
nx3
Spe
edup
Benchmark
SequentialRPT
PC/DCDCPT
www.ntnu.no M. Grannæs, Prefetching
70
DPC-1 Results
1 Access Map Pattern Matching2 Global History Buffer - Local Delta Buffer3 Prefetching based on a Differential Finite Context Machine4 Delta Correlating Prediction Tables
What did the winning entries do differently?• AMPM - Massive reordering to expose more patterns.• GHB-LDB and PDFCM - Prefetch into the L1.
www.ntnu.no M. Grannæs, Prefetching
70
DPC-1 Results
1 Access Map Pattern Matching2 Global History Buffer - Local Delta Buffer3 Prefetching based on a Differential Finite Context Machine4 Delta Correlating Prediction Tables
What did the winning entries do differently?• AMPM - Massive reordering to expose more patterns.• GHB-LDB and PDFCM - Prefetch into the L1.
www.ntnu.no M. Grannæs, Prefetching
71
Access Map Pattern Matching• Winning entry by Ishii et al.• Divides memory into hot zones• Each zone is tracked by using a 2 bit vector• Examines each zone for constant strides• Ignores temporal information
LessonBecause of reordering, modern processors/compilers can reorderloads, thus the temporal information might be off.
www.ntnu.no M. Grannæs, Prefetching
72
Global History Buffer - Local Delta Buffer
• Second place by Dimitrov et al.• Somewhat similar to DCPT• Improves PC/DC prefetching by including global correlation• Most common stride• Prefetches directly into the L1
LessonPrefetch into L1 gives that extra performance boostMost common stride
www.ntnu.no M. Grannæs, Prefetching
73
Prefetching based on a Differential FiniteContext Machine• Third place by Ramos et al.• Table with the most recent history for each load.• A hash of the history is computed and used to look up into a
table containing the predicted stride• Repeat process to increase prefetching degree/distance• Separate prefetcher for L1
LessonFeedback to adjust prefetching degree/prefetching distancePrefetch into the L1
www.ntnu.no M. Grannæs, Prefetching
74
Improving DCPT
Partial Matching
Technique for handling reordering, common strides, etc
L1 Hoisting
Technique for handling L1 prefetching
www.ntnu.no M. Grannæs, Prefetching
75
Partial Matching
• AMPM ignores all temporal information• Reordering the delta history is very expensive
Reorder 5 accesses: 5! = 120 possibilities• Solution: Reduce spatial resolution by ignoring low bits
Example delta stream
8, 9, 10, 8, 10, 9⇒ (Ignore lower 2 bits)
8, 8, 8, 8, 8, 8 , 8
www.ntnu.no M. Grannæs, Prefetching
75
Partial Matching
• AMPM ignores all temporal information• Reordering the delta history is very expensive
Reorder 5 accesses: 5! = 120 possibilities• Solution: Reduce spatial resolution by ignoring low bits
Example delta stream
8, 9, 10, 8, 10, 9⇒ (Ignore lower 2 bits)8, 8, 8, 8, 8, 8
, 8
www.ntnu.no M. Grannæs, Prefetching
75
Partial Matching
• AMPM ignores all temporal information• Reordering the delta history is very expensive
Reorder 5 accesses: 5! = 120 possibilities• Solution: Reduce spatial resolution by ignoring low bits
Example delta stream
8, 9, 10, 8, 10, 9⇒ (Ignore lower 2 bits)8, 8, 8, 8, 8, 8 , 8
www.ntnu.no M. Grannæs, Prefetching
76
L1 Hoisting
• All three top entries had mechanisms for prefetching into L1• Problem: Pollution• Solution: Use the same highly accurate mechanism to
prefetch into the L1.• In the steady state, only the last predicted delta will be used.• All other deltas has been prefetched and is either in the L2 or
on it’s way.• Hoist the first delta from the L2 to the L1 to increase
performance.
www.ntnu.no M. Grannæs, Prefetching
77
L1 Hoisting II
Example delta stream
2, 3, 1, 2, 3, 1, 2, 3,
1, 2, 3, 1, 2, 3
Steady state
Prefetch the last delta into L2Hoist the first delta into L1
www.ntnu.no M. Grannæs, Prefetching
77
L1 Hoisting II
Example delta stream
2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3
Steady state
Prefetch the last delta into L2Hoist the first delta into L1
www.ntnu.no M. Grannæs, Prefetching
77
L1 Hoisting II
Example delta stream
2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3
Steady state
Prefetch the last delta into L2Hoist the first delta into L1
www.ntnu.no M. Grannæs, Prefetching
78
DCPT-P
0
1
2
3
4
5
6
7
milcGemsFDTD
libquantum
leslie3dlbm sphinx3
Spe
edup
DCPT-PAMPM
GHB-LDBPDFCM
RPTPC/DC
www.ntnu.no M. Grannæs, Prefetching
79
Interaction with the memory controller
• So far we’ve talked about what to prefetch (address)• When and how is equally important• Modern DRAM is complex• Modern DRAM controllers are even more complex• Bandwidth limited
www.ntnu.no M. Grannæs, Prefetching
80
Modern DRAM
• Can have multiple independent memory controllers• Can have multiple channels per controller• Typically multiple banks• Each bank contains several pages (rows) of data (typical 1k-
8k)• Each page accesses is put in a single pagebuffer• Access time to the pagebuffer is much lower than a full access
www.ntnu.no M. Grannæs, Prefetching
81
The 3D structure of modern DRAM
www.ntnu.no M. Grannæs, Prefetching
82
The 3D structure of modern DRAM
www.ntnu.no M. Grannæs, Prefetching
83
The 3D structure of modern DRAM
www.ntnu.no M. Grannæs, Prefetching
84
The 3D structure of modern DRAM
www.ntnu.no M. Grannæs, Prefetching
85
The 3D structure of modern DRAM
www.ntnu.no M. Grannæs, Prefetching
86
Example
Suppose a processor requires data at locations X1 and X2 that arelocated on the same page at times T1 and T2.There are two separate outcomes:
www.ntnu.no M. Grannæs, Prefetching
87
Case 1:
The requests occur at roughly the same time:1 Read 1 (T1) enters the memory controller
2 The page is opened3 Read 2 (T2) enters the memory controller4 Data X1 is returned from DRAM5 Data X2 is returned from DRAM6 The page is closed
Although there are two separate reads, the page is only openedonce.
www.ntnu.no M. Grannæs, Prefetching
87
Case 1:
The requests occur at roughly the same time:1 Read 1 (T1) enters the memory controller2 The page is opened
3 Read 2 (T2) enters the memory controller4 Data X1 is returned from DRAM5 Data X2 is returned from DRAM6 The page is closed
Although there are two separate reads, the page is only openedonce.
www.ntnu.no M. Grannæs, Prefetching
87
Case 1:
The requests occur at roughly the same time:1 Read 1 (T1) enters the memory controller2 The page is opened3 Read 2 (T2) enters the memory controller
4 Data X1 is returned from DRAM5 Data X2 is returned from DRAM6 The page is closed
Although there are two separate reads, the page is only openedonce.
www.ntnu.no M. Grannæs, Prefetching
87
Case 1:
The requests occur at roughly the same time:1 Read 1 (T1) enters the memory controller2 The page is opened3 Read 2 (T2) enters the memory controller4 Data X1 is returned from DRAM
5 Data X2 is returned from DRAM6 The page is closed
Although there are two separate reads, the page is only openedonce.
www.ntnu.no M. Grannæs, Prefetching
87
Case 1:
The requests occur at roughly the same time:1 Read 1 (T1) enters the memory controller2 The page is opened3 Read 2 (T2) enters the memory controller4 Data X1 is returned from DRAM5 Data X2 is returned from DRAM
6 The page is closedAlthough there are two separate reads, the page is only openedonce.
www.ntnu.no M. Grannæs, Prefetching
87
Case 1:
The requests occur at roughly the same time:1 Read 1 (T1) enters the memory controller2 The page is opened3 Read 2 (T2) enters the memory controller4 Data X1 is returned from DRAM5 Data X2 is returned from DRAM6 The page is closed
Although there are two separate reads, the page is only openedonce.
www.ntnu.no M. Grannæs, Prefetching
87
Case 1:
The requests occur at roughly the same time:1 Read 1 (T1) enters the memory controller2 The page is opened3 Read 2 (T2) enters the memory controller4 Data X1 is returned from DRAM5 Data X2 is returned from DRAM6 The page is closed
Although there are two separate reads, the page is only openedonce.
www.ntnu.no M. Grannæs, Prefetching
88
Case 2:The requests are separated in time:
1 Read 1 (T1) enters the memory controller
2 The page is opened3 Data X1 is returned from DRAM4 The page is closed5 Read 2 (T2) enters the memory controller6 The page is opened again7 Data X2 is returned from DRAM8 The page is closed
The page is opened and closed twice. By prefetching X2 we canincrease performance by reducing latency and increase memorythroughput.
www.ntnu.no M. Grannæs, Prefetching
88
Case 2:The requests are separated in time:
1 Read 1 (T1) enters the memory controller2 The page is opened
3 Data X1 is returned from DRAM4 The page is closed5 Read 2 (T2) enters the memory controller6 The page is opened again7 Data X2 is returned from DRAM8 The page is closed
The page is opened and closed twice. By prefetching X2 we canincrease performance by reducing latency and increase memorythroughput.
www.ntnu.no M. Grannæs, Prefetching
88
Case 2:The requests are separated in time:
1 Read 1 (T1) enters the memory controller2 The page is opened3 Data X1 is returned from DRAM
4 The page is closed5 Read 2 (T2) enters the memory controller6 The page is opened again7 Data X2 is returned from DRAM8 The page is closed
The page is opened and closed twice. By prefetching X2 we canincrease performance by reducing latency and increase memorythroughput.
www.ntnu.no M. Grannæs, Prefetching
88
Case 2:The requests are separated in time:
1 Read 1 (T1) enters the memory controller2 The page is opened3 Data X1 is returned from DRAM4 The page is closed
5 Read 2 (T2) enters the memory controller6 The page is opened again7 Data X2 is returned from DRAM8 The page is closed
The page is opened and closed twice. By prefetching X2 we canincrease performance by reducing latency and increase memorythroughput.
www.ntnu.no M. Grannæs, Prefetching
88
Case 2:The requests are separated in time:
1 Read 1 (T1) enters the memory controller2 The page is opened3 Data X1 is returned from DRAM4 The page is closed5 Read 2 (T2) enters the memory controller
6 The page is opened again7 Data X2 is returned from DRAM8 The page is closed
The page is opened and closed twice. By prefetching X2 we canincrease performance by reducing latency and increase memorythroughput.
www.ntnu.no M. Grannæs, Prefetching
88
Case 2:The requests are separated in time:
1 Read 1 (T1) enters the memory controller2 The page is opened3 Data X1 is returned from DRAM4 The page is closed5 Read 2 (T2) enters the memory controller6 The page is opened again
7 Data X2 is returned from DRAM8 The page is closed
The page is opened and closed twice. By prefetching X2 we canincrease performance by reducing latency and increase memorythroughput.
www.ntnu.no M. Grannæs, Prefetching
88
Case 2:The requests are separated in time:
1 Read 1 (T1) enters the memory controller2 The page is opened3 Data X1 is returned from DRAM4 The page is closed5 Read 2 (T2) enters the memory controller6 The page is opened again7 Data X2 is returned from DRAM
8 The page is closedThe page is opened and closed twice. By prefetching X2 we canincrease performance by reducing latency and increase memorythroughput.
www.ntnu.no M. Grannæs, Prefetching
88
Case 2:The requests are separated in time:
1 Read 1 (T1) enters the memory controller2 The page is opened3 Data X1 is returned from DRAM4 The page is closed5 Read 2 (T2) enters the memory controller6 The page is opened again7 Data X2 is returned from DRAM8 The page is closed
The page is opened and closed twice. By prefetching X2 we canincrease performance by reducing latency and increase memorythroughput.
www.ntnu.no M. Grannæs, Prefetching
89
When does prefetching pay off?The break-even point:
Prefetching Accuracy · Cost of Prefetching = Cost of Single Read
What is the cost of prefetching?• Application dependant• Less than the cost of a single read, because:
• Able to utilize open pages• Reduce latency• Increase throughput• Multiple banks• Lower latency
www.ntnu.no M. Grannæs, Prefetching
90
Performance vs. Accuracy
0
10
20
30
40
50
60
70
80
90
100
-40 -20 0 20 40 60
Acc
urac
y
IPC improvement (%)
Sequential prefetchingScheduled Region prefetching
CZone/Delta Correlation prefetchingReference Predicton Tables prefetching
Treshold
www.ntnu.no M. Grannæs, Prefetching
91
Q&A
Thank you for listening!
www.ntnu.no M. Grannæs, Prefetching
TDT 4260 – lecture 17/2• Contents
– Cache coherence Chap 4.2• Repetition• Snooping protocols
• SMP performance Chap 4.3– Cache performance
• Directory based cache coherence Chap 4.4y p• Synchronization Chap 4.5• UltraSPARC T1 (Niagara) Chap 4 8• UltraSPARC T1 (Niagara) Chap 4.8
1 Lasse Natvig
Updated lecture plan pr. 17/2Updated lecture plan pr. 7/Date and lecturer Topic
1: 14 Jan (LN, AI) Introduction, Chapter 1 / Alex: PfJudge2: 21 Jan (IB) Pipelining, Appendix A; ILP, Chapter 23: 3 Feb (IB) ILP, Chapter 2; TLP, Chapter 34: 4 Feb (LN) Multiprocessors Chapter 44: 4 Feb (LN) Multiprocessors, Chapter 4 5: 11 Feb MG Prefetching + Energy Micro guest lecture by Marius Grannæs &
pizza 6: 18 Feb (LN, MJ) Multiprocessors continued // Writing a comp.arch. paper
(relevant for miniproject, by (MJ))7: 24 Feb (IB) Memory and cache, cache coherence (Chap. 5)8: 3 Mar (IB) Piranha CMP + Interconnection networks
9: 11 Mar (LN) Multicore architectures (Wiley book chapter) + Hill Marty Amdahl multicore ... Fedorova ... assymetric multicore ...
10: 18 Mar (IB) Memory consistency (4.6) + more on memory10: 18 Mar (IB) Memory consistency (4.6) + more on memory11: 25 Mar (JA, AI) (1) Kongull and other NTNU and NOTUR supercomputers (2)
Green computing12: 7 Apr (IB/LN) Wrap up lecture, remaining stuff
2 Lasse Natvig
13: 8 Apr Slack – no lecture planned
Mi i j t d t ?Miniproject groups, updates?
Rank Prefetcher Group Score
f f1 rpt64k4_pf Farfetched 1.089
2 rpt_prefetcher_rpt_seq L2Detour 1.072
3 teeest Group 6 1.000
3 Lasse Natvig
IDI Open, a challenge for you?IDI Open, a challenge for you?• http://events.idi.ntnu.no/open11/
• 2 april programming contest informal fun pizza2 april, programming contest, informal, fun, pizza, coke (?), party (?), 100- 150 people, mostly students low thresholdstudents, low threshold
• Teams: 3 persons, one PC, Java, C/C++ ?P bl S i l t i k• Problems: Some simple, some tricky
• Our team ”DM-gruppas beskjedne venner” is challenging you students!– And we will challenge some of all the ICT companies in
4 Lasse Natvig
Trondheim
SMP: Cache Coherence ProblemSMP: Cache Coherence ProblemP1 P2 P3
cache cache cache 34 5
u = ?u = ?
u :5 u :5
u = 7
I/O d i1 I/O devices
Memory
12u :5
• Processors see different values for u after event 3• Old (stale) value read in event 4 (hit)( ) ( )• Event 5 (miss) reads
– correct value (if write-through caches)old value (if write back caches)
5 Lasse Natvig
– old value (if write-back caches)• Unacceptable to programs, and frequent!
Enforcing coherence (recap)Enforcing coherence (recap)• Separate caches speed up access
– Migration• Moved from shared memory to local cache
Replication– Replication• Several local copies when item is read by several
• Need coherence protocols to track shared data• Need coherence protocols to track shared data– (Bus) snooping
• Each cache maintains local status• All caches monitor broadcast medium• Write invalidate / Write update
6 Lasse Natvig
State Machine (1/3) State Machine (1/3) State machine
CPU Read hit
State machinefor CPUrequestsfor each
InvalidShared
(read/only)CPU Read miss
Place read misson busfor each
cache blockCPU Write
on bus
CPU read missWrite back blockPlace Write
Miss on bus
Write back block,Place read misson bus
CPU Read missPlace read miss on bus
CPU WriteMiss => Write Miss on BusHit => Invalidate on Bus
Exclusive(read/write)
Hit => Invalidate on Bus
CPU Write MissW it b k h bl k
CPU read hit
7 Lasse Natvig
Write back cache blockPlace write miss on bus
CPU write hit
State Machine (2/3)State Machine (2/3)State machine
for busWrite miss/ Invalidatefor bus
requestsfor each cache block
Invalid Shared(read/only)
for this block
cache block
Write miss Read miss
W it B k
for this blockRead miss
Write BackBlock; (abortmemory access)
Read miss for this blockWrite Back Block; ( b t l
Exclusive(read/write)
(abort excl.memory access)
8 Lasse Natvig
State Machine (3/3)State Mach ne (3/3)• State machine
for CPU requestsCPU Read hit
Write miss/Invqfor each cache block andfor bus requests Place read miss
InvalidShared
(read/only)CPU Read
Write miss/Invfor this block
for bus requestsfor each cache block
Place read misson busCPU Write
Place Write Miss on busMiss on bus
CPU read missWrite back block,
CPU Write
CPU Read missPlace read miss on busWrite Back
Write missfor this block
Place read misson bus
CPU WriteMiss => Write Miss on BusHit => Invalidate on Bus
Write BackBlock; (abort excl.memory access)
R d i
Exclusive(read/write)
CPU d hit
Read miss for this block
Write BackBlock; (abortmemory access)
9 Lasse Natvig
( )CPU Write Miss, Write back cache block, Place write miss on bus
CPU read hitCPU write hit
Directory based cache coherence (1/2)Directory based cache coherence (1/2)
• Large MP systems, lots of CPUs• Distributed memory preferable
– Increases memory bandwidth• Snooping bus with broadcast?
– A single bus become a bottleneckA single bus become a bottleneck– Other ways of communicating needed
• With these broadcasting is hard/expensiveWith these broadcasting is hard/expensive– Can avoid broadcast if we know exactly which caches
have a copy Directory
10 Lasse Natvig
py y
Directory based cache coherence (2/2)• Directory knows which blocks are in which cache and their state• Directory can be partitioned and distributed• Typical states:
– Shared– Uncached– Modified
• Protocol based on• Protocol based on messages
• Invalidate and update sent only where needed
Avoids broadcast
11 Lasse Natvig
– Avoids broadcast, reduces traffic Fig 4.19
SMP performance (shared memory)SMP performance (shared memory)
• Focus on cache performance
• 3 types of cache misses in uniprocessor (3 C’s)yp p ( )– Capacity (too small for working set)– Compulsory (cold-start)– Conflict (placement strategy)
• Multiprocessor also give coherence misses– True sharing
• Misses because of sharing of dataFalse sharing– False sharing
• Misses because of invalidates that would not have happened with cache block size = one word
12 Lasse Natvig
E l h (f )Example: L3 cache size (fig 4.11)Al h S 4100• AlphaServer 4100– 4 x Alpha @ 300 MHz
L1 8 KB I +
80
90
100– L1: 8 KB I +8 KB D
– L2: 96 KB Tim
e
50
60
70
IdlePAL CodeM A
– L3: off-chip2 MB
ecut
ion
30
40
50 Memory AccessL2/L3 Cache AccessInstruction Execution
lized
Ex
0
10
20
Nor
mal
13 Lasse Natvig
01 MB 2 MB 4 MB 8MB
L3 Cache Size
Example: L3 cache size (fig 4.12)
2.75
3
3.25
InstructionCapacity/Conflictct
ion
1 75
2
2.25
2.5p y
ColdFalse SharingTrue Sharing
er In
stru
1
1.25
1.5
1.75
ycle
s pe
0.25
0.5
0.75
1
emor
y C
y
01 MB 2 MB 4 MB 8 MB
Cache size
Me
14 Lasse Natvig
Example: Increasing parallelism (fig 4.13)
2.5
3InstructionConflict/CapacityColduc
tion
2
ColdFalse SharingTrue Sharing
per I
nstru
1
1.5
Cyc
les
p
0.5
emor
y C
01 2 4 6 8
Processor count
M
15 Lasse Natvig
Example: Increased block size (fig 4.14)
1516
Insructions
11121314
ruc
tio
ns Capacity/Conflict
Cold
truct
ions
789
10
1,0
00
ins
tr
False Sharing
True Sharing
000
Inst
3456
Mis
se
s p
er
es p
er 1
0123
32 64 128 256
M
Mis
se
16 Lasse Natvig
32 64 128 256Block size in bytes
2/18/2011
1
1
How to Write a Computer Architecture PaperHow to Write a Computer Architecture Paper
TDT4260 Computer Architecture18. February 2011
Magnus Jahre
2
2nd Branch Prediction Championship• International competition similar to our prefetching
exercise system
• Task: Implement your best possible branch predictor• Task: Implement your best possible branch predictor and write a paper about it
• Submission deadline: 15. April 2011
• More info: http://www.jilp.org/jwac-2/
3
How does pfJudge work?• Each submitted file is one kongull job
– Contains 12 M5 instances since there are 12 CPUs per core– Each M5 instance runs a different SPEC 2000 benchmark
• The kongull job added to the job queue• The kongull job added to the job queue– Status “Running” can mean running or queued, be patient– Running a job can take a long time depending on load– Kongull is usually able to empty the queue during the night
• We can give you a regular user account on Kongull– Remember that Kongull is a shared resource!– Always calculate the expected CPU-hour demand of
your experiment before submitting
4
Storage Estimation
• We impose an storage limit of 8KB on your prefetchers– This limit is not checked by the exercise system
• This is realistic: hardware components are usually designed with an area budget in mind
• Estimating storage is simple– Table based prefetcher: add up the bits used in each entry and
multiply by the number of entries
5
HOW TO USE A SIMULATOR
6
Research WorkflowEvaluate Solution on
Compute ClusterRecieve PhD(get a real job)
2/18/2011
2
7
Why simulate?
• Model of a system– Model the interesting parts with high accuracy– Model the rest of the system with sufficient accuracy
• “All models are wrong but some are useful” (G. Box, 1979)
• The model does not necessarily have a one-to-one correspondence with the actual hardware– Try to model behavior
– Simplify your code wherever possible
8
Know your model
• You need to figure out which system is being modeled!
• Pfsys is a help to getting started, but to drawPfsys is a help to getting started, but to draw conclusions from you work you need to understand what you are modeling
9
HOW TO WRITE A PAPER
10
Find Your Story
• A good computer architecture paper tells a story– All good stories have a bad guy: the problem– All good stories have a hero: the scheme
• Writing a good paper is all about finding and identifying your story
• Note that this story has to be told within the strict structure of a scientific article
11
Paper Format
• You will be pressed for space
• Try to say things as precisely as possibleYour first write up can be as much as 3x the page limit and it’s still– Your first write-up can be as much as 3x the page limit and it s still easy (possible) to get it under the limit
• Think about your plots/figures– A good plot/figure gives a lot of information– Is this figure the best way of conveying this idea?– Is this plot the best way for visualizing this data?– Plots/figures need to be area efficient (but readable!)
12
Typical Paper Outline
• Abstract• Introduction• Background/Related Work
Th S h ( b tit t ith d i ti titl )• The Scheme (substitute with a descriptive title)• Methodology• Results• Discussion• Conclusion (with optional further work)
2/18/2011
3
13
Abstract• An experienced reader should be able to understand
exactly what you have done from only reading the abstract– This is different from a summary
• Should be short, varies from 150 to 200 word maximum
• Should include a description of the problem, the solution and the main results
• Typically the last thing you write
14
Introduction
• Introduces the larger research area that the paper is a part of
• Introduces the problem at hand• Introduces the problem at hand
• Explains the scheme
• Level of abstraction: “20 000 feet”
15
Related Work
• Reference the work that other researchers have done that is related to your scheme
• Should be complete (i e contain all relevant work)• Should be complete (i.e. contain all relevant work)– Remember: you define the scope of your work
• Can be split into two sections: Background and Related Work– Background is an informative introduction to the field (often section 2)– Related work is a very dense section that includes all relevant
references (often section n-1)
16
The Scheme
• Explain your scheme in detail– Choose an informative title
• Trick: Add an informative figure that helps explain• Trick: Add an informative figure that helps explain your scheme
• If your scheme is complex, an informative example may be in order
17
Methodology
• Explains your experimental setup
• Should answer the following questions:– Which simulator did you use?– How have you extended the simulator?– Which parameters did you use for your simulations? (aim: reproducibility)– Which benchmarks did you use?– Why did you chose these benchmarks?
• Important: should be realistic
• If you are unsure about a parameter, run a simulationto check its impact
18
Results• Show that your scheme works
• Compare to other schemes that do the same thing– Hopefully you are better, but you need to compare anyway
• Trick: “Oracle Scheme”– Uses “perfect” information to create an upper bound on the
performance of a class of schemes– Prefetching: Best case is that all L2 accesses are hits
• Sensitivity analysis– Check the impact of model assumptions on your
scheme
2/18/2011
4
19
Discussion
• Only include this if you need it
• Can be used if:You have weaknesses in your model that you have not accounted– You have weaknesses in your model that you have not accounted for
– You tested improvements to your scheme that did not give good enough results to be included in “The Scheme” section
20
Conclusion
• Repeat the main results of your work
• Remember that the abstract, introduction and conclusion are usually read before the rest of theconclusion are usually read before the rest of the paper
• Can include Further Work:– Things you thought about doing that you did not have time to do
21
Thank You
Visit our website:http://research.idi.ntnu.no/multicore/
TDT 4260Chap 5
TLP & Memory Hierarchy
Review on ILP
• What is ILP ?
• Let the compiler find the ILP
▫ Advantages?
▫ Disadvantages?
• Let the HW find the ILP
▫ Advantages?
▫ Disadvantages?
Contents
• Multi-threading Chap 3.5
• Memory hierarchy Chap 5.1
▫ 6 basic cache optimizations
• 11 advanced cache optimizations Chap 5.2
Multi-threaded execution
• Multi-threading: multiple threads share the
functional units of 1 processor via overlapping▫ Must duplicate independent state of each thread e.g., a
separate copy of register file, PC and page table
▫ Memory shared through virtual memory mechanisms
▫ HW for fast thread switch; much faster than full process switch ≈ 100s to 1000s of clocks
• When switch?▫ Alternate instruction per thread (fine grain)
▫ When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)
Fine-Grained Multithreading
• Switches between threads on each instruction▫ Multiples threads interleaved
• Usually round-robin fashion, skipping stalled threads
• CPU must be able to switch threads every clock
• Hides both short and long stalls▫ Other threads executed when one thread stalls
• But slows down execution of individual threads▫ Thread ready to execute without stalls will be delayed by
instructions from other threads
• Used on Sun’s Niagara
• Switch threads only on costly stalls (L2 cache miss)• Advantages
▫ No need for very fast thread-switching▫ Doesn’t slow down thread, since switches only when
thread encounters a costly stall
• Disadvantage: hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs▫ Since CPU issues instructions from 1 thread, when a stall
occurs, the pipeline must be emptied or frozen ▫ New thread must fill pipeline before instructions can
complete
• => Better for reducing penalty of high cost stalls, where pipeline refill << stall time
Coarse-Grained Multithreading
Do both ILP and TLP?
• TLP and ILP exploit two different kinds of parallel structure in a system
• Can a high-ILP processor also exploit TLP?▫ Functional units often idle because of stalls or
dependences in the code
• Can TLP be a source of independent instructions that might reduce processor stalls?
• Can TLP be used to employ functional units that would otherwise lie idle with insufficient ILP?
• => Simultaneous Multi-threading (SMT)▫ Intel: Hyper-Threading
Simultaneous Multi-threading
1
2
3
4
5
6
7
8
9
M M FX FX FP FP BR CCCycleOne thread, 8 units
M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes
1
2
3
4
5
6
7
8
9
M M FX FX FP FP BR CCCycleTwo threads, 8 units
Simultaneous Multi-threading (SMT)
• A dynamically scheduled processor already has many HW mechanisms to support multi-threading▫ Large set of virtual registers
� Virtual = not all visible at ISA level
� Register renaming
▫ Dynamic scheduling
• Just add a per thread renaming table and keeping separate PCs▫ Independent commitment can be supported by logically
keeping a separate reorder buffer for each thread
Multi-threaded categories
Time (processor cycle) Superscalar Fine-Grained Coarse-Grained Multiprocessing
Simultaneous
Multithreading
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Idle slot
Design Challenges in SMT• SMT makes sense only with fine-grained
implementation▫ How to reduce the impact on single thread performance?▫ Give priority to one or a few preferred threads
• Large register file needed to hold multiple contexts• Not affecting clock cycle time, especially in
▫ Instruction issue - more candidate instructions need to be considered
▫ Instruction completion - choosing which instructions to commit may be challenging
• Ensuring that cache and TLB conflicts generated by SMT do not degrade performance
Why memory hierarchy? (fig 5.2)
1
10
100
1,000
10,000
100,000
1980 1985 1990 1995 2000 2005 2010
Year
Per
form
anc
e
Memory
ProcessorProcessor-MemoryPerformance GapGrowing
Why memory hierarchy?
• Principle of Locality▫ Spatial Locality� Addresses near each other are likely referenced close
together in time
▫ Temporal Locality� The same address is likely to reused in the near
future
• Idea: Store recently used elements a fast memories close to the processor▫ Managed by software or hardware?
Memory hierarchyWe want large, fast and cheap at the same time
Control
Datapath
Memory
Processor
Mem
ory
Memory
MemoryM
em
ory
Fastest Slowest
Smallest Largest
Most expensive Cheapest
Speed:
Capacity:
Cost:
Block 12 placed in cache with 8 Cache lines
0 1 2 3 4 5 6 7Blockno.
Fully associative:block 12 can go anywhere
0 1 2 3 4 5 6 7Blockno.
Direct mapped:block 12 can go only into block 4 (12 mod 8)
0 1 2 3 4 5 6 7Blockno.
Set associative:block 12 can go anywhere in set 0 (12 mod 4)
Set0
Set1
Set2
Set3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Block Address
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3Blockno.
Cache block placement
Cache performance
• Miss rate alone is not an accurate measure
• Cache performance is important for CPU perf.• More important with higher clock rate
• Cache design can also affect instructions that don’taccess memory!• Example: A set associative L1 cache on the critical path
requires extra logic which will increase the clock cycle time• Trade off: Additional hits vs. cycle time reduction
Average access time = Hit time + Miss rate * Miss penalty
Reducing Hit Time1. Giving Reads Priority over Writes � Writes in write-buffer can be handled after a newer read if
not causing dependency problems
2. Avoiding Address Translation during Cache Indexing� Eg. use Virtual Memory page offset to index the cache
Reducing Miss Penalty3. Multilevel Caches� Both small and fast (L1) and large (&slower) (L2)
Reducing Miss Rate4. Larger Block size (Compulsory misses)5. Larger Cache size (Capacity misses)6. Higher Associativity (Conflict misses)
6 Basic Cache Optimizations
1: Giving Reads Priority over
Writes • Caches typically use a write buffer
▫ CPU writes to cache and write buffer▫ Cache controller transfers from buffer to RAM▫ Write buffer usually FIFO with N elements▫ Works well as long as buffer does not fill faster than
it can be emptied
• Optimization▫ Handle read misses before write buffer writes▫ Must check for conflicts with write buffer first
ProcessorCache
Write Buffer
DRAM
Virtual memory• Processes use a large virtual memory
• Virtual addresses are dynamically mapped to physical addresses using HW & SW
• Page, page frame, page error, translation lookaside buffer (TLB) etc.
Process 1:
Virtual address (VA)Physical address (PA)
vir. page
Process 2:
phy. pageaddress
translation
0
0
2n-1
0
2n-12m-1
2: Avoiding Address Translation during
Cache Indexing• Virtual cache: Use virtual addresses in caches
▫ Saves time on translation VA -> PA▫ Disadvantages
� Must flush cache on process switch� Can be avoided by including PID in tag
� Alias problem: OS and a process can have two VAs pointing to the same PA
• Compromise:”virtually indexed, physically tagged”▫ Use page offset to index cache▫ The same for VA and PA▫ At the same time as data is read from cache, VA � PA is
done for the tag▫ Tag comparison using PA▫ But: Page size restricts cache size
3: Multilevel Caches (1/2)
• Make cache faster to keep up with CPU or larger to reduce misses?
• Why not both?
• Multilevel caches� Small and fast L1
�Large (and cheaper) L2
3: Multilevel Caches (2/2)
• Local miss rate▫ #cache misses / # cache accesses
• Global miss rate▫ #cache misses / # CPU memory accesses
• L1 cache speed affects CPU clock rate• L2 cache speed affects only L1 miss penalty
▫ Can use more complex mapping for L2▫ L2 can be large
Average access time = L1 Hit time + L1 Miss rate * (L2 Hit time + L2 Miss rate * L2 Miss penalty)
4: Larger Block size
Block Size (bytes)
Miss Rate
0%
5%
10%
15%
20%
25%
16 32 64
128
256
1K
4K
16K
64K
256K
Capacitymisses
Compulsorymisses
Conflictmisses
Trade-off32 and 64 byte common
5: Larger Cache size
• Simple method
• Square-root Rule (quadrupling the size of the cache will half the miss rate)
• Disadvantages
▫ Longer hit time
▫ Higher cost
• Most used for L2/L3 caches
6: Higher Associativity • Lower miss rate
• Disadvantages
▫ Can increase hit time
▫ Higher cost
• 8-way has similar performance to fully
associative
11 Advanced Cache Optimizations
Reducing hit time
1. Small and simple caches
2.Way prediction
3.Trace caches
Increasing cache bandwidth
4.Pipelined caches
5.Non-blocking caches
6.Multibanked caches
Reducing Miss Penalty
7. Critical word first
8. Merging write buffers
Reducing Miss Rate
9. Compiler optimizations
Reducing miss penalty or miss rate via parallelism
10.Hardware prefetching
11.Compiler prefetching
1: Small and simple caches• Compare address to tag memory takes time
• ⇒ Small cache can help hit time ▫ E.g., L1 caches same size for 3 generations of AMD microprocessors: K6,
Athlon, and Opteron
▫ Also L2 cache small enough to fit on chip with the processor avoids time penalty of going off chip
• Simple ⇒ direct mapping▫ Can overlap tag check with data transmission since no choice
• Access time estimate for 90 nm using CACTI model 4.0▫ Median ratios of access time relative to the direct-mapped caches are 1.32, 1.39,
and 1.43 for 2-way, 4-way, and 8-way caches
-
0.50
1.00
1.50
2.00
2.50
16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB
Cache size
Acc
ess
time
(ns) 1-way 2-way 4-way 8-way
2: Way prediction
• Extra bits are kept in the cache to predict which way (block) in a set the next access will hit
▫ Can retrieve the tag early for comparison
▫ Achieves fast hit even with just one comparator
▫ Several cycles needed to check other blocks with misses
3: Trace caches• Increasingly hard to feed modern superscalar
processors with enough instructions
• Trace cache
▫ Stores dynamic instruction sequences rather than ”bytes of data”
▫ Instruction sequence may include branches
� Branch prediction integrated in with the cache
▫ Complex and relatively little used
▫ Used in Pentium 4: Trace cache stores up to 12K micro-ops decoded from x86 instructions (also saves decode time)
4: Pipelined caches
• Pipeline technology applied to cache lookups
▫ Several lookups in processing at once
▫ Results in faster cycle time
▫ Examples: Pentium (1 cycle), Pentium-III (2 cycles), P4 (4 cycles)
▫ L1: Increases the number of pipeline stages needed to execute an instruction
▫ L2/L3: Increases throughput
� Nearly for free since the hit latency on the order of 10 –20 processor cycles and caches are easy to pipeline
5: Non-blocking caches (1/2)
• Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss
• “hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests
• “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses
▫ Requires that the lower-level memory can service multiple concurrent misses
▫ Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses
▫ Pentium Pro allows 4 outstanding memory misses
5: Non-Blocking Cache Implementation
• The cache can handle as many concurrent misses as there are MSHRs
• Cache must block when all valid bits (V) are set
• Very common
...
MHA = Miss Handling ArchitectureMSHR = Miss information/Status Holding Register
DMHA = Dynamic Miss Handling Architecture
5: Non-blocking Cache Performance6: Multibanked caches
• Divide cache into independent banks that can support simultaneous accesses
▫ E.g.,T1 (“Niagara”) L2 has 4 banks
• Banking works best when accesses naturally spread themselves across banks ⇒ mapping of addresses to banks affects behavior of memory system
• Simple mapping that works well is “sequential interleaving”
▫ Spread block addresses sequentially across banks
▫ E,g, if there 4 banks, Bank 0 has all blocks whose address modulo 4 is 0; bank 1 has all blocks whose address modulo 4 is 1; …
7: Critical word first
• Don’t wait for full block before restarting CPU
• Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution
• Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block
▫ Long blocks more popular today ⇒⇒⇒⇒ Critical Word 1st widely used
block
8: Merging write buffers
• Write buffer allows processor to continue while waiting to write to memory
▫ If buffer contains modified blocks, the addresses can be checked to see if address of new data matches the address of a valid write buffer entry
▫ If so, new data are combined with that entry
• Multiword writes more efficient to memory
• The Sun T1 (Niagara) processor, among many others, uses write merging
9: Compiler optimizations• Instruction order can often be changed without
affecting correctness
▫ May reduce conflict misses
▫ Profiling may help the compiler
• Compiler generate instructions grouped in basic blocks
▫ If the start of a basic block is aligned to a cache block, misses will be reduced
� Important for larger cache block sizes
• Data is even easier to move
▫ Lots of different compiler optimizations
10: Hardware prefetching• Prefetching relies on having extra memory bandwidth that
can be used without penalty• Instruction Prefetching
▫ Typically, CPU fetches 2 blocks on a miss: the requested block and the next consecutive block.
▫ Requested block is placed in instruction cache when it returns, and prefetched block is placed into instruction stream buffer
• Data Prefetching▫ Pentium 4 can prefetch data into L2 cache from up to 8 streams
▫ Prefetching invoked if 2 successive L2 cache misses to a page
1.16
1.45
1.18 1.20 1.21 1.26 1.29 1.32 1.401.49
1.97
1.001.201.401.601.802.002.20
gap
mcf
fam3d
wup
wise
galgel
face
rec
swim
applu
luca
s
mgrid
equa
kePerform
ance
Improv
emen
t
SPECint2000 SPECfp2000
11: Compiler prefetching
• Data Prefetch
▫ Load data into register (HP PA-RISC loads)
▫ Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
▫ Special prefetching instructions cannot cause faults;a form of speculative execution
• Issuing Prefetch Instructions takes time
▫ Is cost of prefetch issues < savings in reduced misses?
Cache Coherency
• Consider the following case. I have two processors
that are sharing address X.
• Both cores read address X
• Address X is brought from memory into the caches of both processors
• Now, one of the processors writes to address X and changes the value.
• What happens? How does the other processor get notified that address X has changed?
Two types of cache coherence
schemes
• Snooping
▫ Broadcast writes, so all copies in all caches will be properly invalidated or updated.
• Directory
▫ In a structure, keep track of which cores are caching each address.
▫ When a write occurs, query the directory and properly handle any other cached copies.
TDT 4260Appendix E
Interconnection Networks
Contents
• Introduction App E.1
• Two devices App E.2
• Multiple devices App E.3
• Topology App E.4
• Routing, arbitration, switching App E.5
Conceptual overview
• Basic network technology assumed known
• Motivation
▫ Increased importance
� System-to-system connections
� Intra system connections
▫ Increased demands
� Bandwidth, latency, reliability, ...
▫ � Vital part of system design
Motivation
Types of networks
Number of devices and distance• OCN – On-chip network
▫ Functional units, register files, caches, …▫ Also known as: Network on Chip (NoC)
• SAN – System/storage area network▫ Multiprocessor and multicomputer, storage
• LAN – Local area network• WAN – Wide area network
• Trend: Switches replace buses
E.2: Connecting two devices
Destination implicit
Software to Send and Receive
• SW Send steps1: Application copies data to OS buffer
2: OS calculates checksum, starts timer
3: OS sends data to network interface HW and says start
• SW Receive steps3: OS copies data from network interface HW to OS buffer
2: OS calculates checksum, if matches send ACK; if not, deletes message (sender resends when timer expires)
1: If OK, OS copies data to user address space and signals application to continue
• Sequence of steps for SW: protocol
Network media
Copper, 1mm thick, twisted to avoidantenna effect (telephone)
Used by cable companies: high BW, good noise immunity
Light: 3 parts are cable, light source, light detector.Multimode light disperse (LED), Single mode single wave (laser)
Twisted Pair:
Coaxial Cable:
Copper core
Insulator
Braided outer conductor
Plastic Covering
Fiber Optics
Transmitter– L.E.D– Laser Diode
Receiver
– Photodiode
lightsource Silica
Total internalreflectionAir
9
OCNs SANs LANs WANs
Media type
Distance (meters)
0.01 1 10 100 >1,000
Basic Network Structure and Functions• Media and Form Factor
Fiber Optics
Coaxialcables
Myrinetconnectors
Cat5E twisted pair
Metal layers
Printedcircuit
boards
InfiniBandconnectors
Ethernet
Packet latency
Sender
Receiver
SenderOverhead
Transmission time(size/bandwidth)
Transmission time(size/bandwidth)
Time ofFlight
ReceiverOverhead
Transport Latency
Total Latency = Sender Overhead + Time of Flight + Message Size / bandwidth + Receiver Overhead
Total Latency
(processorbusy)
(processorbusy)
Shared Media (Ethernet)
Switched Media (CM-5,ATM)
Node
Node
Node Node
Node
Node Node���Switch����E.3: Connecting multiple devices (1/3)
• New issues▫ Topology� What paths are possible for
packets?
▫ Routing� Which of the possible paths are
allowable (valid) for packets?
▫ Arbitration� When are paths available for
packets?
▫ Switching� How are paths allocated to
packets?
Shared Media (Ethernet)
Switched Media (CM-5,ATM)
Node
Node
Node Node
Node
Node Node���Switch����E.3: Connecting multiple devices (2/3)
• Two types of topology▫ Shared media▫ Switched media
• Shared media (bus)▫ Arbitration� Carrier Sensing� Collision Detection
▫ Routing is simple� Only one possible path
Connecting multiple devices (3/3)
• Switched media▫ “Point-to-point” connections▫ Routing for each packet▫ Arbitration for each connection
• Comparison▫ Much higher aggregate BW in
switched network than shared media network
▫ Shared media is cheaper▫ Distributed arbitration simpler for
switched
Shared Media (Ethernet)
Switched Media (CM-5,ATM)
Node
Node
Node Node
Node
Node Node���Switch���� • One switch or bus can connect a limited number of
devices▫ Complexity, cost, technology, …
• Interconnected switches needed for larger networks
• Topology: connection structure▫ What paths are possible for packets?▫ All pairs of devices must have path(s) available
• A network is partitioned by a set of links if their removal disconnects the graph▫ Bisection bandwidth▫ Important for performance
E.4: Interconnection Topologies
• Common topology for connecting CPUs and I/O units
• Also used for interconnecting CPUs
• Fast and expensive (O(N2))
• Non-blocking
Crossbar
P
P
C
C
I/O
I/O
M MM M
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
0
1
0
1
1
1
SourceSource DestinationDestination
2x2 switches2x2 switches
StraightStraight CrossoverCrossover
Upper broadcastUpper broadcast Lower broadcastLower broadcast
Omega network
• Example of multistage network
• Usually log2n stages for n inputs - O(N log N)
• Can block
Linear Arrays and Rings
• Linear array= 1D grid
• 2D grid
• Torus has wrap-around connections
• CRAY with 3D torusSwitch
P$
External I/O
Memctrl
and NI
Mem
• Distributed switched networks
• Node = switch + 1-n end nodes
Trees
• Diameter and average distance are logarithmic▫ k-ary tree, height d = logk N
▫ address = d-vector of radix k coordinates describing path down from root
• Fixed number of connections per node (i.e. fixed degree)
• Bisection bandwidth = 1 near the root
E.5: Routing, Arbitration, Switching
• Routing▫ Which of the possible paths are allowable for packets?
▫ Set of operations needed to compute a valid path
▫ Executed at source, intermediate, or even at destination nodes
• Arbitration▫ When are paths available for packets?
▫ Resolves packets requesting the same resources at the same time
▫ For every arbitration, there is a winner and possibly many losers
� Losers are buffered (lossless) or dropped on overflow (lossy)
• Switching▫ How are paths allocated to packets?
▫ The winning packet (from arbitration) proceeds towards destination
▫ Paths can be established one fragment at a time or in their entirety
Routing• Shared Media
▫ Broadcast to everyone
• Switched Media needs real routing. Options:
▫ Source-based routing: message specifies path to the destination (changes of direction)
▫ Virtual Circuit: circuit established from source to destination, message picks the circuit to follow
▫ Destination-based routing: message specifies destination, switch must pick the path
� Deterministic: always follow same path
� Adaptive: pick different paths to avoid congestion, failures
� Randomized routing: pick between several good paths to balance network load
Routing mechanism
• Need to select output port for each input packet▫ And fast…
• Simple arithmetic in regular topologies▫ Ex: ∆x, ∆y routing in a grid (first ∆x then ∆y)� west (-x) ∆x < 0� east (+x) ∆x > 0� south (-y) ∆x = 0, ∆y < 0� north (+y) ∆x = 0, ∆y > 0
• Unidirectional links sufficient for torus (+x, +y)• Dimension-order routing ▫ Reduce relative address of each dimension in
order to avoid deadlock
Deadlock
• How can it arise?▫ necessary conditions:
� shared resources
� incrementally allocated
� non-preemptible
• How do you handle it?▫ constrain how channel
resources are allocated(deadlock avoidance)
▫ Add a mechanism thatdetects likely deadlocks and fixes them(deadlock recovery)
TRC (0,0) TRC (0,1) TRC (0,2) TRC (0,3)
TRC (1,0) TRC (1,1) TRC (1,2) TRC (1,3)
TRC (2,0) TRC (2,1) TRC (2,2) TRC (2,3)
TRC (3,0) TRC (3,1) TRC (3,2) TRC (3,3)
XX
Arbitration (1/2)
• Several simultaneous requests to shared resource
• Ideal: Maximize usage of network resources
• Problem: Starvation
▫ Fairness needed
• Figure: Two phase arb.
▫ Request, Grant
▫ Poor usage
Arbitration (2/2)
• Three phases
• Multiple requests
• Better usage
• But: Increased latency
• Allocating paths for packets
• Two techniques:
▫ Circuit switching (connection oriented)
� Communication channel
� Allocated before first packet
� Packet headers don’t need routing info
� Wastes bandwidth
▫ Packet switching (connection less)
� Each packet handled independently
� Can’t guarantee response time
� Two types – next slide
Switching Store & Forward vs Cut-Through Routing
• Cut-through (on blocking)▫ Virtual cut-through (spools rest of packet into buffer)▫ Wormhole (buffers only a few flits, leaves tail along route)
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1
023
3 1 0
2 1 0
23 1 0
0
1
2
3
23 1 0Time
Store & Forward Routing Cut-Through Routing
Source Dest Dest
Piranha: Designing a Scalable CMP-based System for
Commercial Workloads
Piranha: Designing a Scalable CMP-based System for
Commercial Workloads
Luiz André BarrosoWestern Research Laboratory
Luiz André BarrosoWestern Research Laboratory
April 27, 2001 Asilomar Microcomputer Workshop
What is Piranha?What is Piranha?What is Piranha?lA scalable shared memory architecture based on chip
multiprocessing (CMP) and targeted at commercialworkloads
lA research prototype under development by CompaqResearch and Compaq NonStop Hardware DevelopmentGroup
lA departure from ever increasing processor complexityand system design/verification cycles
Importance of Commercial ApplicationsImportance of Commercial ApplicationsImportance of Commercial Applications
lTotal server market size in 1999: ~$55-60B– technical applications: less than $6B– commercial applications: ~$40B
Worldwide Server Customer Spending (IDC 1999)
Infrastructure29%
Business processing
22%
Decision support
14%
Software development
14%
Collaborative12%
Other3%
Scientific & engineering
6%
Price Structure of ServersPrice Structure of ServersPrice Structure of Serversl IBM eServer 680
(220KtpmC; $43/tpmC)§ 24 CPUs§ 96GB DRAM, 18 TB Disk§ $9M price tag
lCompaq ProLiant ML370(32KtpmC; $12/tpmC)§ 4 CPUs§ 8GB DRAM, 2TB Disk§ $240K price tag
- Software maintenance/management costs even higher (up to $100M)- Storage prices dominate (50%-70% in customer installations)
- Price of expensive CPUs/memory system amortized
Normalized breakdown of HW cost
0%10%20%30%40%50%60%70%80%90%
100%
IBM eServer 680 Compaq ProLiant ML570
I/ODRAMCPUBase
$/CPU $/MB DRAM $/GB Disk
IBM eServer 680 $65,417 $9 $359Compaq ProLiant ML570 $6,048 $4 $64
Price per componentSystem
OutlineOutlineOutline
l Importance of Commercial Workloads
lCommercial Workload Requirements
lTrends in Processor Design
lPiranha
lDesign Methodology
lSummary
Studies of Commercial WorkloadsStudies of Commercial WorkloadsStudies of Commercial Workloadsl Collaboration with Kourosh Gharachorloo (Compaq WRL)
– ISCA’98: Memory System Characterization of Commercial Workloads (with E. Bugnion)
– ISCA’98: An Analysis of Database Workload Performance onSimultaneous Multithreaded Processors
(with J. Lo, S. Eggers, H. Levy, and S. Parekh)
– ASPLOS’98: Performance of Database Workloads on Shared-MemorySystems with Out-of-Order Processors
(with P. Ranganathan and S. Adve)
– HPCA’00: Impact of Chip-Level Integration on Performance of OLTPWorkloads
(with A. Nowatzyk and B. Verghese)
– ISCA’01: Code Layout Optimizations for Transaction ProcessingWorkloads
(with A. Ramirez, R. Cohn, J. Larriba-Pey, G. Lowney, and M. Valero)
Studies of Commercial Workloads: summaryStudies of Commercial Workloads: summaryStudies of Commercial Workloads: summarylMemory system is the main bottleneck
– astronomically high CPI– dominated by memory stall times– instruction stalls as important as data stalls– fast/large L2 caches are critical
lVery poor Instruction Level Parallelism (ILP)– frequent hard-to-predict branches– large L1 miss ratios– Ld-Ld dependencies– disappointing gains from wide-issue out-of-order techniques!
OutlineOutlineOutline
l Importance of Commercial Workloads
lCommercial Workload Requirements
lTrends in Processor Design
lPiranha
lDesign Methodology
lSummary
Increasing Complexity of Processor DesignsIncreasing Complexity of Processor DesignsIncreasing Complexity of Processor Designs
lPushing limits of instruction-level parallelism– multiple instruction issue– speculative out-of-order (OOO) execution
lDriven by applications such as SPECl Increasing design time and team size
lYielding diminishing returns in performance
Processor(SGI MIPS)
YearShipped
TransistorCount
(millions)
DesignTeamSize
DesignTime
(months)
VerificationTeam Size
(% of total)R2000 1985 0.10 20 15 15%R4000 1991 1.40 55 24 20%R10000 1996 6.80 >100 36 >35%
courtesy: John Hennessy, IEEE Computer, 32(8)
Exploiting Higher Levels of IntegrationExploiting Higher Levels of IntegrationExploiting Higher Levels of Integration
l lower latency, higher bandwidthl reuse of existing CPU core
addresses complexity issues
1.5MBL2$
1GHz21264 CPU
64KBD$
64KBI$
I/ONet
wo
rk In
terf
ace
Co
her
ence
En
gin
e
ME
M-C
TL
0
31
ME
M-C
TL
0
31
Alpha 21364
364M
IO364
M
IO
364M
IO364
M
IO
364M
IO364
M
IO
l incrementally scalableglueless multiprocessing
Singlechip
Exploiting Parallelism in Commercial AppsExploiting Parallelism in Commercial AppsExploiting Parallelism in Commercial Apps
L2$
CPU
D$I$
I/O
Net
wo
rkC
oh
eren
ce
ME
M-C
TL
ME
M-C
TL
CPU
D$I$
Chip Multiprocessing (CMP)
Example: IBM Power4
time
thread 1thread 2thread 3thread 4
Simultaneous Multithreading (SMT)
l SMT superior in single-thread performance
l CMP addresses complexity by using simpler cores
time
thread 1thread 2thread 3thread 4
Example: Alpha 21464
OutlineOutlineOutlinel Importance of Commercial Workloads
lCommercial Workload Requirements
lTrends in Processor Design
lPiranha– Architecture– Performance
lDesign Methodology
lSummary
Piranha ProjectPiranha ProjectPiranha Project
lExplore chip multiprocessing for scalable serverslFocus on parallel commercial workloadslSmall team, modest investment, short design timelAddress complexity by using:
– simple processor cores– standard ASIC methodology
Give up on ILP, embrace TLP
Piranha Team MembersPiranha Team MembersPiranha Team MembersResearch
– Luiz André Barroso (WRL)– Kourosh Gharachorloo (WRL)– David Lowell (WRL)– Joel McCormack (WRL)– Mosur Ravishankar (WRL)– Rob Stets (WRL)– Yuan Yu (SRC)
NonStop Hardware DevelopmentASIC Design Center
– Tom Heynemann– Dan Joyce– Harland Maxwell– Harold Miller– Sanjay Singh– Scott Smith– Jeff Sprouse– … several contractors
Robert McNamaraBasem NayfehAndreas NowatzykJoan PendletonShaz Qadeer
Brian RobinsonBarton SanoDaniel ScalesBen Verghese
Former Contributors
Piranha Processing NodePiranha Processing NodePiranha Processing Node
CPU
Alpha core: 1-issue, in-order, 500MHzL1 caches: I&D, 64KB, 2-wayIntra-chip switch (ICS) 32GB/sec, 1-cycle delayL2 cache: shared, 1MB, 8-wayMemory Controller (MC) RDRAM, 12.8GB/secProtocol Engines (HE & RE): µprog., 1K µinstr., even/odd interleavingSystem Interconnect: 4-port Xbar router topology independent 32GB/sec total bandwidth
D$I$
L2$
ICS
CPU
D$I$
L2$
L2$
CPU
D$I$
CPU
D$I$L2$
CPU
D$I$L2$
CPU
D$I$L2$
L2$
CPU
D$I$L2$
CPU
D$I$
MEM-CTL
MEM-CTL
MEM-CTL MEM-CTL MEM-CTL
MEM-CTL MEM-CTL MEM-CTL
RE
HE
Ro
ute
r
Single Chip
Piranha I/O NodePiranha I/O NodePiranha I/O Node
Ro
ute
r
2 Links @8GB/s D$
L2$
CPU
I$
FBFB
RE
HE
ICS
D$PCI-X
MEM-CTL
l I/O node is a full-fledged member of system interconnect– CPU indistinguishable from Processing Node CPUs– participates in global coherence protocol
Example ConfigurationExample ConfigurationExample Configuration
P
P P P
P- I/O
P- I/O
P
P
l Arbitrary topologies
l Match ratio of Processing to I/O nodes to application requirements
L2 Cache and Intra-Node CoherenceL2 Cache and Intra-Node CoherenceL2 Cache and Intra-Node Coherence
lNo inclusion between L1s and L2 cache– total L1 capacity equals L2 capacity– L2 misses go directly to L1– L2 filled by L1 replacements
l L2 keeps track of all lines in the chip– sends Invalidates, Forwards– orchestrates L1-to-L2 write-backs to maximize
chip-memory utilization– cooperates with Protocol Engines to enforce
system-wide coherence
Inter-Node Coherence ProtocolInter-Node Coherence ProtocolInter-Node Coherence Protocoll ‘Stealing’ ECC bits for memory directory
lDirectory (2b state + 40b sharing info)
lDual representation: limited pointer + coarse vectorl “Cruise Missile” Invalidations (CMI)
– limit fan-out/fan-in serialization with CV
lSeveral new protocol optimizations
info on sharersstate
2b 20b
info on sharersstate
2b 20b
8x(64+8) 4X(128+9+7) 2X(256+10+22) 1X(512+11+53)Data-bitsECCDirectory-bits
0 28 44 53
010000001000CMI
Simulated ArchitecturesSimulated ArchitecturesSimulated Architectures
Single-Chip Piranha PerformanceSingle-Chip Piranha PerformanceSingle-Chip Piranha Performance
0
50
100
150
200
250
300
350
P1500 MHz1-issue
INO1GHz
1-issue
OOO1GHz
4-issue
P8500MHz1-issue
P1500 MHz1-issue
INO1GHz
1-issue
OOO1GHz
4-issue
P8500MHz1-issue
No
rmal
ized
Exe
cuti
on
Tim
e L2MissL2HitCPU
233
145
100
34
350
191
100
44
OLTP DSS
l Piranha’s performance margin 3x for OLTP and 2.2x for DSS
l Piranha has more outstanding misses è better utilizes memory system
Single-Chip Performance (Cont.)Single-Chip Performance Single-Chip Performance (Cont.)(Cont.)
lNear-linear scalability– low memory latencies– effectiveness of highly associative L2 and non-inclusive caching
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
Number of Cores
Sp
eed
up
010
2030
4050
6070
8090
100
P1 P2 P4 P8
500 MHz, 1-issue
No
rmal
ized
Bre
akd
ow
n o
f L
1 M
isse
s (%
)
L2 MissL2 FwdL2 Hit
Potential of a Full-Custom PiranhaPotential of a Full-Custom PiranhaPotential of a Full-Custom Piranha
l 5x margin over OOO for OLTP and DSSl Full-custom design benefits substantially from boost in core speed
0
20
40
60
80
100
120
OOO1GHz
4-issue
P8500MHz1-issue
P8F1.25GHz1-issue
OOO1GHz
4-issue
P8500MHz1-issue
P8F1.25GHz1-issue
No
rmal
ized
Exe
cuti
on
Tim
e
L2 MissL2 HitCPU
OLTP DSS
100
34
20
100
43
19
OutlineOutlineOutline
l Importance of Commercial Workloads
lCommercial Workload Requirements
lTrends in Processor Design
lPiranha
lDesign Methodology
lSummary
Managing Complexity in the ArchitectureManaging Complexity in the ArchitectureManaging Complexity in the ArchitecturelUse of many simpler logic modules
– shorter design– easier verification– only short wires*– faster synthesis– simpler chip-level layout
lSimplify intra-chip communication– all traffic goes through ICS (no backdoors)
lUse of microprogrammed protocol engineslAdoption of large VM pagesl Implement sub-set of Alpha ISA
– no VAX floating point, no multimedia instructions, etc.
Methodology ChallengesMethodology ChallengesMethodology Challengesl Isolated sub-module testing
– need to create robust bus functional models (BFM)– sub-modules’ behavior highly inter-dependent– not feasible with a small team
lSystem-level (integrated) testing– much easier to create tests– only one BFM at the processor interface– simpler to assert correct operation– Verilog simulation is too slow for comprehensive testing
Our Approach:Our Approach:Our Approach:
lDesign in stylized C++ (synthesizable RTL level)– use mostly system-level, semi-random testing– simulations in C++ (faster & cheaper than Verilog)§ simulation speed ~1000 clocks/second
– employ directed tests to fill test coverage gaps
lAutomatic C++ to Verilog translation– single design database– reduce translation errors– faster turnaround of design changes– risk: untested methodology
lUsing industry-standard synthesis tools
l IBM ASIC process (Cu11)
Piranha Methodology: OverviewPiranha Methodology: OverviewPiranha Methodology: Overview
C++ RTLModels
C++ RTL Models: Cycleaccurate and “synthesizeable”
Physical Design: leveragesindustry standard Verilog-basedtools
PhysicalDesign
cxx: C++ compiler
PS1
PS1: Fast (C++) LogicSimulator
cxx
PS1V PS1V: Can “co-simulate” C++and Verilog module versionsand check correspondence
cxx VerilogModels
Verilog Models: Machinetranslated from C++ models
CLevel
CLevel: C++-to-Verilog Translator
SummarySummarySummarylCMP architectures are inevitable in the near future
lPiranha investigates an extreme point in CMP design– many simple cores
lPiranha has a large architectural advantage over complexsingle-core designs (> 3x) for database applications
lPiranha methodology enables faster design turnaround
lKey to Piranha is application focus:– One-size-fits-all solutions may soon be infeasible
ReferenceReferenceReferencelPapers on commercial workload performance & Piranha
research.compaq.com/wrl/projects/Database
1
TDT 4260 – lecture 11/3 - 2011• Miniproject status, update, presentation
• Synchronization, Textbook Chap 4.5– And a short note on BSP (with excellent timing …)
• Short presentation of NUTS, NTNU Test Sattelite System http://nuts iet ntnu no/
1 Lasse Natvig
Sattelite System http://nuts.iet.ntnu.no/
• UltraSPARC T1 (Niagara), Chap 4.8
• And more on multicores
Miniproject – after the first deadlineImplementing 1 existing prefetcher
Comparison of 2 or more existing prefetchers
Improving on existing prefetcher
Sequential prefetcher RPT and DCPT
Improving sequential
2 Lasse Natvig
prefetcher RPT and DCPT sequential prefetcher
RPT prefetcherSequential (tagged or adaptive), RPT and DCPT
Improving DCPT
Miniproject – after the first deadline• Feedback
– RPT and DCPT are popular choice; the report should properly motivate each group choice of prefetcher (the motivation should not be: “The code was easily available”)
– Several groups works on similar methods• “find your story”
3 Lasse Natvig
– too much focus on getting the highest result in the PfJudge ranking; as stated in section 2.3. of the guidelines, the miniproject will be evaluated based on the following criteria:
• good use of language• clarity of the problem statement• overall document structure• depth of understanding for the field of prefetching• quality of presentation
Miniproject presentations• Friday 15/4 at 1415-1700 (max)• OK for all?
– No … we are working on finding a time schedule that is OK for all
4 Lasse Natvig
IDI Open, a challenge for you?
5 Lasse Natvig
Synchronization• Important concept
– Synchronize access to shared resources– Order events from cooperating processes correctly
• Smaller MP systems– Implemented by uninterrupted instruction(s) atomically accessing
l
6 Lasse Natvig
a value– Requires special hardware support– Simplifies construction of OS / parallel apps
• Larger MP systems Appendix H (not in course)
2
• Swaps value in register for value in memory– Mem = 0 means not locked, Mem = 1 means locked– How does this work
• Register <= 1 ; Processor want to lock• Exchange(Register, Mem)
Atomic exchange (swap)
7 Lasse Natvig
Exchange(Register, Mem)– If Register = 0 Success
• Mem was = 0 Was unlocked• Mem is now = 1 Now locked
– If Register = 1 Fail• Mem was = 1 Was locked• Mem is now = 1 Still locked
• Exchange must be atomic!
• One alternative: Load Linked (LL) and Store Conditional (SC)– Used in sequence
• If memory location accessed by LL changes, SC fails• If context switch between LL and SC SC fails
Implementing atomic exchange (1/2)
8 Lasse Natvig
• If context switch between LL and SC, SC fails– Implemented using a special link register
• Contains address used in LL• Reset if matching cache block is invalidated or if we get
an interrupt• SC checks if link register contains the same address. If
so, we have atomic execution of LL & SC
• Example code EXCH (R4, 0(R1)):try: MOV R3, R4 ; mov exchange value
LL R2, 0(R1) ; load linkedSC R3, 0(R1) ; store conditionalBEQZ R3, try ; branch if SC failed
Implementing atomic exchange (2/2)
9 Lasse Natvig
MOV R4, R2 ; put load value in R4
• This can now be used to implement e.g. spin locksDADDUI R2, R0, #1 ; R0 always = 0
lockit: EXCH R2, 0(R1) ; atomic exchangeBNEZ R2, lockit ; already locked?
Barrier sync. in BSP• The BSP-model
– Leslie G. Valiant, A bridging model for parallel computation, [CACM 1990]
– Computations organised in
10 Lasse Natvig
supersteps– Algorithms adapt to
compute platform represented through 4 parameters
– Helps the combination of portability & performance
http://www.seas.harvard.edu/news-events/press-releases/valiant_turing
Multicore• Important and early example: UltraSPARC T1• Motivation (See lecture 1)
– In all market segments from mobile phones to supercomputers– End of Moores law for single-core– The power wall
Th ll
11 Lasse Natvig
– The memory wall – The bandwith problem– ILP limitations– The complexity wall
Why multicores?
12 Lasse Natvig
3
13 Lasse Natvig
Chip Multithreading Opportunities and challenges• Paper by Spracklen & Abraham, HPCA-11 (2005)
[SA05]• CMT processors = Chip Multi-Threaded processors• A spectrum of processor architectures
14 Lasse Natvig
• A spectrum of processor architectures– Uni-processors with SMT (one core)– (pure) Chip Multiprocessors (CMP) (one thread pr. core)– Combination of SMT and CMP (They call it CMT)
• Best suited to server workloads (with high TLP)
Offchip Bandwidth• A bottleneck• Bandwidth increasing, but also latency [Patt04]• Need more than 100 in-flight requests to fully utilize
the available bandwidth
15 Lasse Natvig
Sharing processor resources• SMT
– Hardware strand• ”HW for storing the state of a thread of execution”• Several strands can share resources within the core, such as execution
resources– This improves utilization of processor resources– Reduces applications sensitivity to off-chip misses
• Switch between threads can be very efficient
16 Lasse Natvig
• (pure) CMP– Multiple cores can share chip resources such as memory controller,
off-chip bandwidth and L2 cache– No sharing of HW resources between strands within core
• Combination (CMT)
1st generation CMT• 2 cores per chip• Cores derived from
earlier uniprocessor designs
• Cores do not share any t ff
17 Lasse Natvig
resources, except off-chip data paths
• Examples: Sun’s Gemini, Sun’s UltraSPARC IV (Jaguar), AMD dual core Opteron, Intel dual-core Itanium (Montecito), Intel dual-core Xeon (Paxville, server)
2nd generation CMT• 2 or more cores per chip• Cores still derived from earlier
uniprocessor designs• Cores now share the L2 cache
– Speeds inter-core communication
18 Lasse Natvig
Speeds te co e co u cat o– Advantageous as most commercial
applications have significant instructionfootprints
• Examples: Sun’s UltraSPARC IV+, IBM’s Power 4/5
4
3rd generation CMT• CMT processors are best
designed from the ground-up, optimized for a CMT design point– Lower power consumption
• Multiple cores per chip
19 Lasse Natvig
Multiple cores per chip• Examples:
– Sun’s Niagara (T1) • 8 cores, each is 4-way SMT• Each core single-issue, short
pipeline• Shared 3MB L2-cache
– IBM’s Power-5• 2 cores, each 2-way SMT
Multicore generations (?)
20 Lasse Natvig
CMT/Multicore design space• Number of cores
– Multiple simple or few complex?• Recent paper of Hill & Marty …
– See http://www.youtube.com/watch?v=KfgWmQpzD74
– Heterogeneous cores• Serial fraction of parallel application
– Remember Amdahl’s lawO f l f i l th d d li ti
21 Lasse Natvig
• One powerful core for single-threaded applications
• Resource sharing– L2 cache! (and L3)
• (Terminology: LL = Last Level cache)– Floating point units– New more expensive resources (amortized over multiple cores)
• Shadow tags, more advanced cache techniques, HW accelerators, Cryptographic, OS functions (eg. memcopy), XML parsing, compression
– Your innovation !!!
CMT/Multicore challenges• Multipe threads (strands) share resources
– Maximize overall performance• Good resource utilization• Avoid ”starvation” (Units without work to do)
– Cores must be ”good neighbours”• Fairness, research by Magnus Jahre• See http://research.idi.ntnu.no/multicore/pub
P f t hi
22 Lasse Natvig
• Prefetching– Agressive prefetching is OK in single-thread system since the entire
system is idle on a miss– CMT/Multicore requires more careful prefetching
• Prefetch operation may take resources used by other threads– See research by Marius Grannæs (same link as above)
• Speculative operations– OK if using idle resources (delay until resource is idle)– More careful (just as prefetching) / seldomly power efficient
• Target: Commercial server applications– High thread level parallelism (TLP)
• Large numbers of parallel client requests– Low instruction level parallelism (ILP)
• High cache miss rates• Many unpredictable branches
UltraSPARC T1 (“Niagara”)
23 Lasse Natvig
• Many unpredictable branches
• Power, cooling, and space aremajor concerns for data centers
• Metric: (Performance / Watt) / Sq. Ft.• Approach: Multicore, Fine-grain
multithreading, Simple pipeline, Small L1 caches, Shared L2
T1 processor – ”logical” overview
24 Lasse Natvig
1.2 GHz at 72W typical, 79W peak power consumption
5
T1 Architecture• Also ships with 6 or 4 processors
25 Lasse Natvig
T1 pipeline / 4 threads• Single issue, in-order, 6-deep pipeline: F, S, D,
E, M, W • Shared units:
– L1 cache, L2 cache – TLB – Exec units
26 Lasse Natvig
Exec. units – pipe registers
• Separate units:– PC– instruction
buffer– reg file– store buffer
1 5%
2.0%
2.5%
ate
TPC-C
SPECJBB
Miss Rates: L2 Cache Size, Block Size (fig. 4.27)
27 Lasse Natvig
0.0%
0.5%
1.0%
1.5%
1.5 MB;32B
1.5 MB;64B
3 MB;32B
3 MB;64B
6 MB;32B
6 MB;64B
L2 M
iss
ra
T1
140
160
180
200
TPC-CSPECJBB
Miss Latency: L2 Cache Size, Block Size (fig. 4.28)
T1
28 Lasse Natvig
0
20
40
60
80
100
120
1.5 MB; 32B 1.5 MB; 64B 3 MB; 32B 3 MB; 64B 6 MB; 32B 6 MB; 64B
L2 M
iss
late
ncy
CPI Breakdown of Performance
Benchmark
Per Thread
CPI
Per core CPI
Effective CPI for 8 cores
Effective IPC for 8 cores
TPC-C 7 20 1 80 0 23 4 4
29 Lasse Natvig
TPC C 7.20 1.80 0.23 4.4
SPECJBB 5.60 1.40 0.18 5.7
SPECWeb99 6.60 1.65 0.21 4.8
Average thread status (fig 4.30)
30 Lasse Natvig
6
Not Ready Breakdown (fig 4.31)
40%
60%
80%
100%
cycl
es n
ot re
ady
Other
Pipeline delay
L2 miss
31 Lasse Natvig
• Other = ?– TPC-C - store buffer full is largest contributor– SPEC-JBB - atomic instructions are largest contributor – SPECWeb99 - both factors contribute
0%
20%
40%
TPC-C SPECJBB SPECWeb99
Frac
tion
of
L1 D miss
L1 I miss
4
4.5
5
5.5
6
6.5
to P
entiu
m D
+Power5 Opteron Sun T1
Performance Relative to Pentium D
32 Lasse Natvig
0
0.5
1
1.5
2
2.5
3
3.5
SPECIntRate SPECFPRate SPECJBB05 SPECWeb05 TPC-like
Perfo
rman
ce re
lativ
e
2 5
3
3.5
4
4.5
5
5.5
aliz
ed to
Pen
tium
D
+Power5 Opteron Sun T1
Performance/mm2, Performance/Watt
33 Lasse Natvig
0
0.5
1
1.5
2
2.5
SPECIntRate
/mm^2
SPECIntRate
/Watt
SPECFPRate
/mm^2
SPECFPRate
/Watt
SPECJBB05
/mm^2
SPECJBB05
/Watt
TPC-C
/mm^2
TPC-C
/Watt
Effic
ienc
y no
rma
Cache CoherencyAnd
Memory Models
Review● Does pipelining help instruction latency?● Does pipelining help instruction throughput?● What is Instruction Level Parallelism? ● What are the advantages of OoO machines? ● What are the disadvantages of OoO machines? ● What are the advantages of VLIW?● What are the disadvantages of VLIW? ● What is an example of Data Spatial Locality? ● What is an example of Data Temporal Locality? ● What is an example of Instruction Spatial Locality? ● What is an example of Instruction Temporal Locality? ● What is a TLB? ● What is a packet switched network?
Memory Models (Memory Consistency)
Memory Model: The system supports a given model if operations on memory follow specific rules. The data consistency model specifies a contract between programmer and system, wherein the system guarantees that if the programmer follows the rules, memory will be consistent and the results of memory operations will be predictable.
Memory Models (Memory Consistency)
Memory Model: The system supports a given model if operations on memory follow specific rules. The data consistency model specifies a contract between programmer and system, wherein the system guarantees that if the programmer follows the rules, memory will be consistent and the results of memory operations will be predictable.
Huh??????
Sequential Consistency?
Simple Case● Consider a simple two processor system
● The two processors are coherent● Programs running in parallel may communicate via
memory addresses● Special hardware is required in order to enable
communication via memory addresses.● Shared memory addresses are the standard form of
communication for parallel programming
Memory
CPU 0 CPU 1
Interconnect
Simple Case● CPU 0 wants to send a data word to CPU 1
● What does the code look like ???
Memory
CPU 0 CPU 1
Interconnect
Simple Case● CPU 0 wants to send a data word to CPU 1
● What does the code look like ???
● Code on CPU0 writes a value to an address● Code on CPU1 reads the address to get the new value
Memory
CPU 0 CPU 1
Interconnect
Simple Case
Memory
CPU 0 CPU 1
Interconnect
int shared_flag = 0;int shared_value = 0;
void sender_thread(){
shared_value = 42;shared_flag = 1;
}
void receiver_thread(){
while (shared_flag == 0) { }Int new_value = shared_value;printf(“%i\n”, new_value);
}
Simple Case
Memory
CPU 0 CPU 1
Interconnect
int shared_flag = 0;int shared_value = 0;
void sender_thread(){
shared_value = 42;shared_flag = 1;
}
void receiver_thread(){
while (shared_flag == 0) { }Int new_value = shared_value;printf(“%i\n”, new_value);
}
Global variables are shared when using pthreads. This means all threads within this process may access these variables
Simple Case
Memory
CPU 0 CPU 1
Interconnect
int shared_flag = 0;int shared_value = 0;
void sender_thread(){
shared_value = 42;shared_flag = 1;
}
void receiver_thread(){
while (shared_flag == 0) { }Int new_value = shared_value;printf(“%i\n”, new_value);
}
Global variables are shared when using pthreads. This means all threads within this process may access these variables
Sender writes to the shared data, then sets a shared data flag that the receiver is polling
Simple Case
Memory
CPU 0 CPU 1
Interconnect
int shared_flag = 0;int shared_value = 0;
void sender_thread(){
shared_value = 42;shared_flag = 1;
}
void receiver_thread(){
while (shared_flag == 0) { }Int new_value = shared_value;printf(“%i\n”, new_value);
}
Global variables are shared when using pthreads. This means all threads within this process may access these variables
Sender writes to the shared data, then sets a shared data flag that the receiver is polling
Receiver is polling on the flag. When the flag is no longer zero, the receiver reads the shared_value and prints it out.
Simple Case
Memory
CPU 0 CPU 1
Interconnect
int shared_flag = 0;int shared_value = 0;
void sender_thread(){
shared_value = 42;shared_flag = 1;
}
void receiver_thread(){
while (shared_flag == 0) { }Int new_value = shared_value;printf(“%i\n”, new_value);
}
Global variables are shared when using pthreads. This means all threads within this process may access these variables
Sender writes to the shared data, then sets a shared data flag that the receiver is polling
Receiver is polling on the flag. When the flag is no longer zero, the receiver reads the shared_value and prints it out.
Any Problems???
Simple CMP Cache Coherency
Interconnect
CPU 0
L1
CPU 0
L1
CPU 0
L1
CPU 0
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory ● Four core machine supporting cache coherency
● Each core has a local L1 Data and Instruction cache.
● The L2 cache is shared amongst all cores, and physically distributed into 4 disparate banks
● The interconnect sends memory requests and responses back and forth between the caches
The Coherency Problem
Interconnect
CPU 0
L1
CPU 0
L1
CPU 0
L1
CPU 0
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Ld R1,X
The Coherency Problem
Interconnect
CPU 0
L1
CPU 0
L1
CPU 0
L1
CPU 0
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Ld R1,X
● Misses in Cache
Miss!
The Coherency Problem
Interconnect
CPU 0
L1
CPU 0
L1
CPU 0
L1
CPU 0
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Ld R1,X
● Misses in Cache
● Goes to “home” l2 (home often determined by hash of address)
The Coherency Problem
Interconnect
CPU 0
L1
CPU 0
L1
CPU 0
L1
CPU 0
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Ld R1,X
● Misses in Cache
● Goes to “home” l2 (home often determined by hash of address)
● If miss at home L2, read data from memory
To Memory
The Coherency Problem
Interconnect
CPU 0
L1
CPU 0
L1
CPU 0
L1
CPU 0
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Ld R1,X
● Misses in Cache
● Goes to “home” l2 (home often determined by hash of address)
● If miss at home L2, read data from memory
● Deposit data in both home L2 and Local L1
The Coherency Problem
Interconnect
CPU 0
L1
CPU 0
L1
CPU 0
L1
CPU 0
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Ld R1,X
● Misses in Cache
● Goes to “home” l2 (home often determined by hash of address)
● If miss at home L2, read data from memory
● Deposit data in both home L2 and Local L1
Mem(X) is now in both the L2 and ONE L1 cache
The Coherency Problem
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Ld R1,X
● CPU 3 reads the same address
Ld R2,X
The Coherency Problem
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Ld R1,X
● CPU 3 reads the same address
● Miss in L1
Ld R2,X
Miss!
The Coherency Problem
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Ld R1,X
● CPU 3 reads the same address
● Miss in L1
● Sends request to L2
● Hits in L2
Ld R2,X
The Coherency Problem
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Ld R1,X
● CPU 3 reads the same address
● Miss in L1
● Sends request to L2
● Hits in L2
● Data is placed in L1 cache for CPU 3
Ld R2,X
The Coherency Problem
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Ld R1,X
Store R2, X
● CPU now STORES to address X
Ld R2,X
What happens?????
The Coherency Problem
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Ld R1,X
Store R2, X
● CPU now STORES to address X
Ld R2,X
Special hardware is needed in order to either update or invalidate the data in CPU 3's cache
The Coherency Problem
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Ld R1,X
Store R2, X
● For this example, we will assume a directory based invalidate protocol, with write-thru L1 caches
Ld R2,X
The Coherency Problem
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Ld R1,X
Store R2, X
● Store updates the local L1 and writes-thru to the L2
Ld R2,X
The Coherency Problem
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
0, 3
L2 Bank
Directory
Ld R1,X
Store R2, X
● Store updates the local L1 and writes-thru to the L2
● At the L2, the directory is inspected, showing CPU3 is sharing the line
Ld R2,X
The Coherency Problem
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
0, 3
L2 Bank
Directory
Ld R1,X
Store R2, X
● Store updates the local L1 and writes-thru to the L2
● At the L2, the directory is inspected, showing CPU3 is sharing the line
● The data in CPU3's cache is invalidated
Ld R2,X
The Coherency Problem
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
0
L2 Bank
Directory
Ld R1,X
Store R2, X
● Store updates the local L1 and writes-thru to the L2
● At the L2, the directory is inspected, showing CPU3 is sharing the line
● The data in CPU3's cache is invalidated
● The L2 cache is updated with the new value
Ld R2,X
The Coherency Problem
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
0
L2 Bank
Directory
Ld R1,X
Store R2, X
● Store updates the local L1 and writes-thru to the L2
● At the L2, the directory is inspected, showing CPU3 is sharing the line
● The data in CPU3's cache is invalidated
● The L2 cache is updated with the new value
● The system is now “coherent”
● Note that CPU3 was removed from the directory
Ld R2,X
Ordering
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Store R1,X
Store R2, Y
● Our protocol relies on stores writing through to the L2 cache.
Ordering
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Store R1,X
Store R2, Y
● Our protocol relies on stores writing through to the L2 cache.
● If the stores are to different addresses, there are multiple points within the system where the stores may be reordered.
Ordering
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Store R1,X
Store R2, Y
Ordering
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Store R1,X
Store R2, Y
Ordering
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Store R1,X
Store R2, Y
Ordering
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Store R1,X
Store R2, Y
Purple leaves the network first!
Ordering
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Store R1,X
Store R2, Y
Stores are written to the shared L2 out-of-order (purple first, then red) !!!
Ordering
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Store R1,X
Store R2, Y
Stores are written to the shared L2 out-of-order (purple first, then red) !!!
Interconnect is not the only cause for out-of-order!
Ordering
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Store R1,X
Store R2, Y
Processor core may issues instructions out-of-order (remember out-of-order machines??)
Ordering
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
Store R1,X
Store R2, Y
L2 pipeline may also reorder requests to different addresses
L2 Pipeline Ordering
ResourceAllocation
And Conflict
Detection
L2 TagAccess
L2 DataAccess
CoherenceControl
From Network
Retry Fifo
From Network
L2 Pipeline Ordering
ResourceAllocation
And Conflict
Detection
L2 TagAccess
L2 DataAccess
CoherenceControl
Retry Fifo
Two Memory Requests arrive on the network
ResourceAllocation
And Conflict
DetectionFrom Network
L2 Pipeline Ordering
L2 TagAccess
L2 DataAccess
CoherenceControl
Retry Fifo
Requests Serviced in-order
Retry Fifo ResourceAllocation
And Conflict
DetectionFrom Network
L2 Pipeline Ordering
L2 TagAccess
L2 DataAccess
CoherenceControl
Conflicts are sent to retry fifo
Conflict!
Retry Fifo ResourceAllocation
And Conflict
DetectionFrom Network
L2 Pipeline Ordering
L2 TagAccess
L2 DataAccess
CoherenceControl
Network is given priority
L2 TagAccess
Retry Fifo ResourceAllocation
And Conflict
DetectionFrom Network
L2 Pipeline Ordering
L2 DataAccess
CoherenceControl
Requests are now executing in a different order!
L2 DataAccess
L2 TagAccess
Retry Fifo ResourceAllocation
And Conflict
DetectionFrom Network
L2 Pipeline Ordering
CoherenceControl
Requests are now executing in a different order!
Simple Case (revisited)
Memory
CPU 0 CPU 1
Interconnect
int shared_flag = 0;int shared_value = 0;
void sender_thread(){
shared_value = 42;shared_flag = 1;
}
void receiver_thread(){
while (shared_flag == 0) { }Int new_value = shared_value;printf(“%i\n”, new_value);
}
Simple Case (revisited)
shared_value = 42;shared_flag = 1;
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
421
0
Simple Case (revisited)
shared_value = 42;shared_flag = 1;
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
421
Receiver is spinning on “shared_flag”
0
0
0
0
Simple Case (revisited)
shared_value = 42;shared_flag = 1;
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
421
“shared_value” has reset value of 0
0
0
0
0
Simple Case (revisited)
shared_value = 42;shared_flag = 1;
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
421
0
0
0
0
Store to shared value writes-thru L1
42
Simple Case (revisited)
shared_value = 42;shared_flag = 1;
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
421
0
0
0
0
Store to “shared_flag” writes thru L1
421
Simple Case (revisited)
shared_value = 42;shared_flag = 1;
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
421
0
0
0
0
Both stores are now sitting in the network 421
Simple Case (revisited)
shared_value = 42;shared_flag = 1;
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
421
0
0
0
0
Store to “shared_flag” is first to leave the network
42
1
Simple Case (revisited)
shared_value = 42;shared_flag = 1;
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
421
0
0
0
0
1) “shared_flag” is updated
2) Coherence protocol invalidates copy in CPU3
42
1
Simple Case (revisited)
shared_value = 42;shared_flag = 1;
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
421
0 0
0
42
1
Simple Case (revisited)
shared_value = 42;shared_flag = 1;
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
421
0 0
0
42
1Receiver that is polling now misses in the cache and sends request to L2!
Miss!
Simple Case (revisited)
shared_value = 42;shared_flag = 1;
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
421
0 0
0
42
1Response comes back.
Flag is now set!
Time to read the “shared_value”!
1
Simple Case (revisited)
shared_value = 42;shared_flag = 1;
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
421
0 0
0
42
1
1
Note that the write to “shared_value” is still sitting in the network!
Simple Case (revisited)
shared_value = 42;shared_flag = 1;
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
421
0 0
0
42
1
1
Simple Case (revisited)
shared_value = 42;shared_flag = 1;
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
3
421
0 0
0
42
1
1 0
Simple Case (revisited)
shared_value = 42;shared_flag = 1;
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
3
421
0 0
0
42
1
1 0
Write of “42” to “shared_value” finally escapes the network, but it is TOO LATE!
Simple Case (revisited)
shared_value = 42;shared_flag = 1;
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
3
421
0 0
0
42
1
1 0
Our code doesn't always work!
WTF???
0
Simple Case (revisited)
shared_value = 42;shared_flag = 1;
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
3
421
0 0
0
42
1
1 0
The architecture needs to expose ordering properties to the programmer, so that the programmer may write correct code.
This is called the “Memory Model”
Sequential Consistency
Hardware GUARANTEES that all memory operations are ordered globally.
● Benefits● Simplifies programming (our initial code would have worked)
● Costs● Hard to implement micro-architecturally● Can hurt performance● Hard to verify
Weak Consistency
Loads and stores to different addresses may be re-ordered
● Benefits● Much easier to implement and build● Higher performing● Easy to verify
● Costs● More complicated for the programmer● Requires special “ordering” instructions for synchronization
Instructions for Weak Memory Models
● Write Barrier● Don't issue a write until all preceding writes have completed
● Read Barrier● Don't issue a read until all preceding reads have completed
● Memory Barrier● Don't issue a memory operation until all preceding memory
operations have completed
Etc etc
Simple Case (write barrier)
Memory
CPU 0 CPU 1
Interconnect
int shared_flag = 0;int shared_value = 0;
void sender_thread(){
shared_value = 42;__write_barrier();shared_flag = 1;
}
void receiver_thread(){
while (shared_flag == 0) { }Int new_value = shared_value;printf(“%i\n”, new_value);
}
Simple Case (revisited)
shared_value = 42;__write_barrier();shared_flag = 1;
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
42
1
0
Simple Case (revisited)
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
42
1
Receiver is spinning on “shared_flag”
0
0
0
0shared_value = 42;__write_barrier();shared_flag = 1;
Simple Case (revisited)
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
42
1
“shared_value” has reset value of 0
0
0
0
0shared_value = 42;__write_barrier();shared_flag = 1;
Simple Case (revisited)
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
42
1
0
0
0
0
Store to shared value writes-thru L1
42
shared_value = 42;__write_barrier();shared_flag = 1;
Simple Case (revisited)
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
42
1
0
0
0
0
write_barrier prevents issues of “shared_flag = 1” until the “shared_value = 42” is complete. This is tracked via acknowledgments
42
shared_value = 42;__write_barrier();shared_flag = 1;
Blocked
Simple Case (revisited)
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
42
1
0
0
0
0
Write eventually leaves network
42
shared_value = 42;__write_barrier();shared_flag = 1;
42
Blocked
Simple Case (revisited)
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
42
1
0
0
0
0
Write is acknowledged
42
shared_value = 42;__write_barrier();shared_flag = 1;
StillBlocked
Simple Case (revisited)
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
42
1
0
0
0
0
Barrier is now complete!
42
shared_value = 42;__write_barrier();shared_flag = 1;
Simple Case (revisited)
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
42
1
0
0
0
0
Store to “shared_flag” writes thru L1
42
1
shared_value = 42;__write_barrier();shared_flag = 1;
Simple Case (revisited)
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
42
1
0
0
0
0
Store to “shared_flag” leaves the network
421
shared_value = 42;__write_barrier();shared_flag = 1;
Simple Case (revisited)
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
42
1
0
0
0
0
1) “shared_flag” is updated
2) Coherence protocol invalidates copy in CPU3
421
shared_value = 42;__write_barrier();shared_flag = 1;
Simple Case (revisited)
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
42
1
0 0
0
421
shared_value = 42;__write_barrier();shared_flag = 1;
Simple Case (revisited)
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
42
1
0 0
0
1Receiver that is polling now misses in the cache and sends request to L2!
Miss!
shared_value = 42;__write_barrier();shared_flag = 1;
42
Simple Case (revisited)
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
42
1
0 0
0
1Response comes back.
Flag is now set!
Time to read the “shared_value”!
1
shared_value = 42;__write_barrier();shared_flag = 1;
42
Simple Case (revisited)
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
42
1
0 0
0
1
1
shared_value = 42;__write_barrier();shared_flag = 1;
42
Simple Case (revisited)
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
Directory
42
1
0 0
0
1
1
shared_value = 42;__write_barrier();shared_flag = 1;
42
Simple Case (revisited)
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
3
42
1
0 0
0
42
1
1 0
shared_value = 42;__write_barrier();shared_flag = 1;
42
42
Correct Code!!!
Simple Case (revisited)
while (shared_flag == 0) { }new_value = shared_value;
Interconnect
CPU 0
L1
CPU 1
L1
CPU 2
L1
CPU 3
L1
L2 Bank
3
L2 Bank
Directory
L2 Bank
Directory
L2 Bank
3
42
1
0 0
0
42
1
1 0
shared_value = 42;__write_barrier();shared_flag = 1;
42
42
Correct Code!!!
What about reads.....
Weak or Strong?
● The academic community pushed hard for sequential consistency:
“Multiprocessors Should Support Simple Memory Consistency Models” Mark Hill, IEEE Computer, August 1998
Weak or Strong?
● The academic community pushed hard for sequential consistency:
“Multiprocessors Should Support Simple Memory Consistency Models” Mark Hill, IEEE Computer, August 1998
WRONG!!!
Most new architectures support relaxed memory models (ARM, IA64, TILE, etc). Much easier to implement and verify. Not a programming issue, because the complexity is hidden behind a library, and 99.9% of programmers don't have to worry about these issues!
Break ProblemYou are one of P recently arrested prisoners. The warden makes the following announcement:
"You may meet together today and plan a strategy, but after today you will be in isolated cells and have no communication with one another. I have set up a "switch room" which contains a light switch, which is either on or off. The switch is not connected to anything. Every now and then, I will select one prisoner at random to enter the "switch room". This prisoner may throw the switch (from on to off, or vice-versa), or may leave the switch unchanged. Nobody else will ever enter this room. Each prisoner will visit the switch room arbitrarily often. More precisely, for any N, eventually each of you will visit the switch room at least N times. At any time, any of you may declare: "we have all visited the switch room at least once." If the claim is correct, I will set you free. If the claim is incorrect, I will feed all of you to the sharks."
Devise a winning strategy when you know that the initial state of the switch is off. Hint: not all prisoners need to do the same thing.
1
1
TDT4260
Introduction to Green Computing Asymmetric multicore processors
Alexandru Iordan
2
Introduction to Green Computing
• What do we mean by Green Computing?
• Why Green Computing?
• Measuring “greenness”
• Research into energy consumption reduction
3
What do we mean by Green Computing?
4
What do we mean by Green Computing?
The green computing movement is a multifaceted global effort to reduce energy consumption and to promote sustainable development in the IT world.[Patrick Kurp, Green computing in Communications of the ACM, 2008]
5
Why Green Computing?
• Heat dissipation problems
• High energy bills
• Growing environmental impact
6
Measuring “greenness”
• Non-standard metrics– Energy (Joules)– Power (Watts)– Energy-per-instructions ( Joules / No. instructions )– Energy-delayN-product ( Joules * secondsN )– PerformanceN / Watt ( (No. instructions / second)N / Watt )
• Standard metrics– Data centers: Power Usage Effectiveness metric (The Green Grid
consortium)– Servers: ssj_ops / Watt metric (SPEC consortium)
2
8
Research into energy consumption reduction
9
Maximizing Power Efficiency with Asymmetric Multicore SystemsFedorova et al., Communications of the ACM, 2009
• Outline
– Asymmetric multicore processors
– Scheduling for parallel and serial applications
– Scheduling for CPU- and memory-intensive applications
10
Asymmetric multicore processors
• What makes a multicore asymmetric?– a few powerful cores (high clock freq., complex pipelines, OoO
execution)– many simple cores (low clock freq., simple pipeline, low power
requirement)
• Homogeneous ISA AMP– the same binary code can run on both types of cores
• Heterogeneous ISA AMP– code compiled separately for each type of core– examples: IBM Cell, Intel Larrabee
11
Efficient utilization of AMPs
• Efficient mapping of threads/workloads
– parallel applications• serial part → complex cores
• scalable parallel part → simple cores
– microarchitectural characteristics of workloads• CPU intensive applications → complex cores
• memory intensive applications → simple cores
12
Sequential vs. parallel characteristics
• Sequential programs– high degree of ILP– can utilize features of a complex core (super-scalar pipeline, OoO
execution, complex branch prediction)
• Parallel programs– high number of parallel threads/tasks (compensates for low ILP and
masks memory delays)
• Having both complex and simple cores, give AMPs applicability for wider range of applications
13
Parallelism-aware scheduling
• Goal: improve overall system efficiency (not the performance of a particular application)
• Idea: assign sequential applications/phases to run on the complex cores
• Does NOT provide fairness
3
14
Challenges of PA scheduling
• Detecting serial and parallel phases– limited scalability of threads can yield wrong solutions
• Thread migration overhead– migration across memory domains is expensive– scheduler must be topology aware
15
“Heterogeneity”-aware scheduling
• Goal: improve overall system efficiency
• Idea: – CPU-intensive applications/phases → complex cores– memory-intensive applications/phases → simple cores
• Inherently unfair
16
Challenges of HA scheduling
• Classifying threads/phases as CPU- or memory-bound– two approaches presented: direct measurement and modeling
• Long execution time (direct measurement approach) or need of offline information (modeling approach)
17
Summary
• Green Computing focuses on improving energy-efficiency and sustainable development in the IT world
• AMPs promise higher energy-efficiency than symmetric processors
• Schedulers must by designed to take advantage of the asymmetric hardware
18
References
• Kirk W. Cameron, The road to greener IT pasturesin IEEE Computer, 2009
• Dan Herrick and Mark Ritschard, Greening your computing technology, the near and far perspectives in Proceedings of the 37th ACM SIGUCCS, 2009
• Luiz A. Barroso, The price of performance in ACMQueue , 2005
19
NTNU HPC InfrastructureIBM AIX Power5+, CentOS AMD Istanbul
Jørn AmundsenIDI/NTNU IT2011-03-25
www.ntnu.no Jørn Amundsen, NTNU IT
2
Contents
1 Njord Power5+ hardware
2 Kongull AMD Istanbul hardware
3 Resource Managers
4 Documentation
www.ntnu.no Jørn Amundsen, NTNU IT
3
Power5+ hardware
Cache and memory
Chip layout
System levelTOC
www.ntnu.no Jørn Amundsen, NTNU IT
4
Cache and memory
• 16 x 64-bit word cache lines (32 in L3)• Hardware cache line prefetch on loads• Reads from memory are written into L2• External L3, acts as a victim cache for L2• L2 and L3 are shared between cores• L1 is write-through• Cache coherence is maintained system-wide at L2 level
• 4K pages sizes default, kernel supports 64K and 16M pages
www.ntnu.no Jørn Amundsen, NTNU IT
5
Chip design
logic
decode &
schedule
64−bit registers
32 GPR, 32 FPR
2 FXU
2 FPU
1 BXU
1 CRL
2 LSU
Execution
Units
logic
decode &
schedule
64−bit registers
32 GPR, 32 FPR
2 FXU
2 FPU
1 BXU
1 CRL
2 LSU
Execution
Units
2−way
L1 I−cache
64K
10−way
L2 cache
1.92M
L1 D−cache
32K4−way
Memory Controller
L3 cache
36M12−way
2−way
L1 I−cache
64K
16−128GB
Main memory
DDR2
power5+ core power5+ core
power5+ chip
4−way
L1 D−cache
32K
Switch Fabric
35.2 GB/s
25.6 GB/s
www.ntnu.no Jørn Amundsen, NTNU IT
6
SMT• In a concrete application, the processor core might be idle 50-80% of
the time, waiting for memory• An obvious solution would be to let another thread execute while our
thread is waiting for memory• This is known as hyper-threading in the Intel/AMD world, and
Simultaneous Multithreading (SMT) with IBM• SMT is supported in hardware throughout the processor core• SMT is more efficient than hyper-threading with less context switch
overhead• Power5 and 6 supports 1 thread/core or SMT with 2 threads/core,
while the latest Power7 supports 4 threads/core• SMT is enabled or disabled dynamically on a node with the
(privileged) command smtctl
www.ntnu.no Jørn Amundsen, NTNU IT
7
SMT (2)
• SMT is beneficial if you are doing a lot of memory references, andyour application performance is memory bound
• Enabling SMT doubles the number of MPI tasks per node, from 16 to32. Requires your application to be sufficiently scalable.
• SMT is only available in user space with batch processing, by addingthe structured comment string:
#@ requirements = ( Feature == "SMT" )
www.ntnu.no Jørn Amundsen, NTNU IT
8
Chip module packaging
• 4 chips and 4 L3 caches are HW integrated onto a MCM• 90.25 cm2, 89 layers of metal
www.ntnu.no Jørn Amundsen, NTNU IT
9
The system level
• On a p575 system, a node is 2 MCM’s / 8 chips / 16 1.9GHz cores
• The Njord system is- 2 x 16-way 32 GiB login nodes- 4 x 16-way 16 GiB I/O nodes (used with GPFS)- 186 x 16-way 32 GiB compute nodes- 6 x 16-way 128 GiB compute nodes
• GPFS parallel file system, 33 TiB fiber disks 62 TiB SATA disks
• Interconnect- IBM Federation, a multistage crossbar network providing 2 GiB/s
bidirectional bandwidth and 5µs latency system-wide MPI performance
www.ntnu.no Jørn Amundsen, NTNU IT
10
GPFS
• An important feature of a HPC system is the capability of movinglarge amounts of data from or to memory, across nodes and from orto permanent storage
• In this respect a high quality and high performance global file systemis essential
• GPFS is a robust parallel FS geared at high BW I/O, usedextensively in HPC and in the database industry
• Disk access is ≈ 1000 times slower than memory access, hence keyfactor for performance are
- spreading (striping) files across many disk units- using memory to cache files- hiding latencies in software
www.ntnu.no Jørn Amundsen, NTNU IT
11
GPFS and parallel I/O (2)
• High transfer rates is achieved by distributing files in blocks roundrobin across a large number of disk units, up to thousands of disks
• On njord, the GPFS block size and stripe unit is 1 MB• In addition to multiple disks servicing file I/O, multiple threads might
read, write or update (R+W) a file simultaneously• GPFS use multiple I/O servers (4 dedicated nodes on njord), working
in parallel for performance, maintaining file and file metadataconsistency.
• High performance comes at a cost. Although GPFS can handledirectories with millions of files, it is usually the best to use fewer andlarger files, and to access files in larger chunks.
www.ntnu.no Jørn Amundsen, NTNU IT
12
File buffering
• The kernel does read-aheadsand write-behinds of file blocks
• The kernel does heuristics onI/O to discover sequential andstrided forward and backwardreads.
• The disadvantage is memorycopying of all data
• Can bypass with DIRECT_IO –can be useful with large I/O(MB-sized), utilizing applicationI/O patterns
application
buffer
KERNELfile system
buffer
DISK
SUBSYSTEM
application
user
www.ntnu.no Jørn Amundsen, NTNU IT
13
AMD Istanbul hardware
Cache and memory
System levelTOC
www.ntnu.no Jørn Amundsen, NTNU IT
14
Cache and memory
• 6 x 128 KiB L1 cache• 6 x 512 KiB L2 cache• 1 x 6 MiB L3 cache• 24 or 48 GiB DDR3 RAM
www.ntnu.no Jørn Amundsen, NTNU IT
15
The system level• A node is 2 chips / 12 2.4GHz cores• The Kongull system is
- 1 x 12-way 24 GiB login nodes- 4 x 12-way 24 GiB I/O nodes (used with GPFS)- 52 x 12-way 24 GiB compute nodes- 44 x 12-way 48 GiB compute nodes
• Nodes compute-0-0 – compute-0-39 and compute-1-0 –compute-1-11 are 24 GiB @ 800 MHz, while compute-1-12 –compute-1-15 and compute-2-0 – compute-2-39 are 48 GiB@ 667 MHz bus frequency
• GPFS parallel file system, 73 TiB
• Interconnect- A fat tree implemented with HP Procurve switches, 1 Gb from node to
rack switch, then 10Gb from the rack switch to the toplevel switch.Bandwidth and latency is left as a programming exercise.
www.ntnu.no Jørn Amundsen, NTNU IT
16
Resource Managers
Resource Managers
Njord classes
Kongull queuesTOC
www.ntnu.no Jørn Amundsen, NTNU IT
17
Resource Managers• Need efficient (and fair) utilization of the large pool of resources• This is the domain of queueing (batch) systems or resource
managers• Administers the execution of (computational) jobs and provides
resource accounting across usersand accounts• Includes distribution of parallel (OpenMP/MPI) threads/processes
across physical cores and gang scheduling of parallel execution• Jobs are Unix shell scripts with batch system keywords embedded
within structured comments• Both Njord and Kongull employs a series of queues (classes)
administering various sets of possibly overlapping nodes withpossibly different priorities
• IBM LoadLeveler on Njord, Torque (development from OpenPBS) onKongull
www.ntnu.no Jørn Amundsen, NTNU IT
18
Njord job class overview
class min-maxnodes
max nodes/ job
maxruntime description
forecast 1-180 180 unlimited top priority class dedicatedto forecast jobs
bigmem 1-6 4 7 days high priority 115GB memoryclass
large 4-180 128 21 days high priority class for jobs of64 processors or more
normal 1-52 42 21 days default class
express 1-186 4 1 hour high priority class for debug-ging and test runs
small 1/2 1/2 14 days low priority class for serial orsmall SMP jobs
optimist 1-186 48 unlimited checkpoint-restart jobs
www.ntnu.no Jørn Amundsen, NTNU IT
19
Njord job class overview (2)
• Forecast is the highest priority queue, suspends everything else
• Beware: node memory (except bigmem) is split in 2, to guaranteeavailable memory for forecast jobs
• A C-R job runs at the very lowest priority, any other job will terminateand requeue an optimist queue job if not enough available nodes
• Optimist class jobs need an internal checkpoint-restart mechanism• AIX LoadLeveler impose node job memory limits, e.g. jobs
oversubscribing available node memory are aborted with an email
www.ntnu.no Jørn Amundsen, NTNU IT
20
LoadLeveler sample jobscript
# @ job_name = hybrid_job# @ account_no = ntnuXXX# @ job_type = parallel# @ node = 3# @ tasks_per_node = 8# @ class = normal# @ ConsumableCpus(2) ConsumableMemory(1664mb)# @ error = $(job_name).$(jobid).err# @ output = $(job_name).$(jobid).out# @ queue
export OMP_NUM_THREADS=2# Create (if necessary) and move to my working directoryw=$WORKDIR/$USER/testif [ ! -d $w ]; then mkdir -p $w; ficd $w$HOME/a.outllq -w $LOADL_STEP_ID
exit 0
www.ntnu.no Jørn Amundsen, NTNU IT
21
LoadLeveler sample C-R email (1/2)
Date: Mon, 21 Mar 2011 18:31:37 +0100From: [email protected]: [email protected]: z2rank_s_5
From: LoadLeveler
LoadLeveler Job Step: f05n02io.791345.0Executable: /home/ntnu/joern/run/z2rank/logs/skipped/z2rank_s_5.jobExecutable arguments:State for machine: f14n06LoadL_starter: The program, z2rank_s_5.job, exited normally and returnedan exit code of 0.
State for machine: f09n06State for machine: f13n04State for machine: f14n04State for machine: f08n06State for machine: f12n06State for machine: f15n07State for machine: f18n04
www.ntnu.no Jørn Amundsen, NTNU IT
22
LoadLeveler sample C-R email (2/2)
This job step was dispatched to run 18 time(s).This job step was rejected by Starter 0 time(s).Submitted at: Mon Mar 21 10:02:56 2011Started at: Mon Mar 21 18:16:59 2011Exited at: Mon Mar 21 18:31:37 2011
Real Time: 0 08:28:41Job Step User Time: 16 06:34:29
Job Step System Time: 0 00:21:15Total Job Step Time: 16 06:55:44
Starter User Time: 0 00:00:19Starter System Time: 0 00:00:09Total Starter Time: 0 00:00:28
www.ntnu.no Jørn Amundsen, NTNU IT
23
Kongull job queue overview
class min-maxnodes
max nodes/ job
maxruntime description
default 1-52 52 35 days default queue except IPT,SFI IO and Sintef Petroleum
express 1-96 96 1 hour high priority queue for de-bugging and test runs
bigmem 1-44 44 7 days default queue for IPT, SFI IOand Sintef Petroleum
optimist 1-96 48 28 days checkpoint-restart jobs
• Oversubscribing node physical memory crashes the node
• this might happen if not specifying the below in your job script:
#PBS -lnodes=1:ppn=12
• If all nodes are not reserved, the batch system will attempt to share nodes by default
www.ntnu.no Jørn Amundsen, NTNU IT
24
Documentation
Njord User Guidehttp://docs.notur.no/ntnu/njord-ibm-power-5
Notur load statshttp://www.notur.no/hardware/status/
Kongull support wikihttp://hpc-support.idi.ntnu.no/
Kongull load statshttp://kongull.hpc.ntnu.no/ganglia/
www.ntnu.no Jørn Amundsen, NTNU IT
TDT4260 Computer ArchitectureMini-Project Guidelines
Alexandru Ciprian [email protected]
January 10, 2011
1 Introduction
The Mini-Project accounts for 20% of the final grade in TDT4260 Computer Architecture. Your task isto develop and evaluate a prefetcher using the M5 simulator. M5 is currently one of the most popularsimulators for computer architecture research and has a rich feature set. Consequently, it is a verycomplex piece of software. To make your task easier, we have created a simple interface to the memorysystem that you can use to develop your prefetcher. Furthermore, you can evaluate your prefetchers bysubmitting your code via web interface. This web interface runs your code on the Kongull cluster withthe default simulator setup. It is also possible to experiment with other parameters, but then you will haveto run the simulator yourself. The web interface, the modified M5 simulator and more documentationcan be found at http://dm-ark.idi.ntnu.no/.
The Mini-Project is carried out in groups of 2 to 4 students. In some cases we will allow students towork alone. Your will be graded based on both a written paper and a short oral presentation.
Make sure you clearly cite the source of information, data and figures. Failure to do so is regarded ascheating and is handled according to NTNU guidelines. If you have any questions, send an e-mail toteaching assistant Alexandru Ciprian Iordan ([email protected]) .
1.1 Mini-Project Goals
The Mini-Project has the following goals:
• Many computer architecture topics are best analyzed by experiments and/or detailed studies. TheMini-Project should provide training in such exercises.
• Writing about a topic often increases the understanding of it. Consequently, we require that theresult of the Mini-Project is a scientific paper.
2 Practical Guidelines
2.1 Time Schedule and Deadlines
The Mini-Project schedule is shown in Table 1. If these deadlines collide with deadlines in other subjects,we suggest that you consider handing in the Mini-Project earlier than the deadline. If you miss the finaldeadline, this will reduce the maximum score you can be awarded.
1
Deadline DescriptionFriday 21. January List of group members delivered to Alexandru Ciprian Ior-
dan ([email protected]) by e-mailFriday 4. March Short status report and an outline of the final report delivered to
Alexandru Ciprian Iordan ([email protected]) by e-mailFriday 8. April 12:00 (noon) Final paper deadline. Deliver the paper through It’s Learning. De-
tailed report layout requirements can be found in section 2.2.Week 15 (11. - 15. April) Compulsory 10 minute oral presentations
Table 1: Mini-Project Deadlines
2.2 Paper Layout
The paper must follow the IEEE Transactions style guidelines available here:
http://www.ieee.org/publications_standards/publications/authors/authors_journals.html#sect2
Both Latex and Word templates are available, but we recommend that you use Latex. The paper mustuse a maximum of 8 pages. Failure to comply with these requirements will reduce the maximum scoreyou can be awarded.
In addition, we will deduct points if:
• The paper does not have a proper scientific structure. All reports must contain the following sec-tions: Abstract, Introduction, Related Work or Background, Prefetcher Description, Methodology,Results, Discussion and Conclusion. You may rename the “Prefetcher Description” section to amore descriptive title. Acknowledgements and Author biographies are optional.
• Use citations correctly. If you use a figure that somebody else has made, a citation must appear inthe figure text.
• NTNU has acquired an automated system that checks for plagiarism. We may run this system onyour papers so make sure you write all text yourself.
2.3 Evaluation
The Mini-Project accounts for 20% of the total grade in TDT4260 Computer Architecture. Within theMini-Project, the report counts 80% and the oral presentation 20%.
The report grade will be based on the following criteria:
• Language and use of figures• Clarity of the problem statement• Overall document structure• Depth of understanding for the field of computer architecture• Depth of understanding of the investigated problem
The oral presentation grade will be based on following criteria:
• Presentation structure• Quality and clarity of the slides• Presentation style• If you use more than the provided time, you will lose points.
2
M5 simulator system
TDT4260 Computer Architecture
User documentation
Last modified: November 23, 2010
Contents
1 Introduction 2
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Chapter outlines . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Installing and running M5 4
2.1 Download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 VirtualBox disk image . . . . . . . . . . . . . . . . . . 5
2.3 Build . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4.1 CPU2000 benchmark tests . . . . . . . . . . . . . . . . 6
2.4.2 Running M5 with custom test programs . . . . . . . . 7
2.5 Submitting the prefetcher for benchmarking . . . . . . . . . . 8
3 The prefetcher interface 9
3.1 Memory model . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Interface specification . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Using the interface . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.1 Example prefetcher . . . . . . . . . . . . . . . . . . . . 13
4 Statistics 14
5 Debugging the prefetcher 16
5.1 m5.debug and trace flags . . . . . . . . . . . . . . . . . . . . . 16
5.2 GDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.3 Valgrind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1
Chapter 1
Introduction
You are now going to write your own hardware prefetcher, using a modifiedversion of M5, an open-source hardware simulator system. This modifiedversion presents a simplified interface to M5’s cache, allowing you to con-centrate on a specific part of the memory hierarchy: a prefetcher for thesecond level (L2) cache.
1.1 Overview
This documentation covers the following:
• Installing and running the simulator
• Machine model and memory hierarchy
• Prefetcher interface specification
• Using the interface
• Testing and debugging the prefetcher on your local machine
• Submitting the prefetcher for benchmarking
• Statistics
1.2 Chapter outlines
The first chapter gives a short introduction, and contains an outline of thedocumentation.
2
The second chapter starts with the basics: how to install the M5 simulator.There are two possible ways to install and use it. The first is as a stand-alone VirtualBox disk-image, which requires the installation of VirtualBox.This is the best option for those who use Windows as their operating systemof choice. For Linux enthusiasts, there is also the option of downloading atarball, and installing a few required software packages.
The chapter then continues to walk you through the necessary steps toget M5 up and running: building from source, running with command-lineoptions that enables prefetching, running local benchmarks, compiling andrunning custom test-programs, and finally, how to submit your prefetcherfor testing on a computing cluster.
The third chapter gives an overview over the simulated system, and de-scribes its memory model. There is also a detailed specification of theprefetcher interface, and tips on how to use it when writing your ownprefetcher. It includes a very simple example prefetcher with extensive com-ments.
The fourth chapter contains definitions of the statistics used to quantita-tively measure prefetchers.
The fifth chapter gives details on how to debug prefetchers using advancedtools such as GDB and Valgrind, and how to use trace-flags to get detaileddebug printouts.
3
Chapter 2
Installing and running M5
2.1 Download
Download the modified M5 simulator from the PfJudgeβ website.
2.2 Installation
2.2.1 Linux
Software requirements (specific Debian/Ubuntu packages mentioned in paren-theses):
• g++ >= 3.4.6
• Python and libpython >= 2.4 (python and python-dev)
• Scons > 0.98.1 (scons)
• SWIG >= 1.3.31 (swig)
• zlib (zlib1g-dev)
• m4 (m4)
To install all required packages in one go, issue instructions to apt-get:
sudo apt-get install g++ python-dev scons swig zlib1g-dev m4
The simulator framework comes packaged as a gzipped tarball. Start the ad-venture by unpacking with tar xvzf framework.tar.gz. This will createa directory named framework.
4
2.2.2 VirtualBox disk image
If you do not have convenient access to a Linux machine, you can downloada virtual machine with M5 preconfigured. You can run the virtual machinewith VirtualBox, which can be downloaded from http.//www.virtualbox.
org.
The virtual machine is available as a zip archive from the PfJudgeβ web-site. After unpacking the archive, you can import the virtual machine intoVirtualBox by selecting “Import Appliance” in the file menu and opening“Prefetcher framework.ovf”.
2.3 Build
M5 uses the scons build system: scons -j2 ./build/ALPHA SE/m5.opt
builds the optimized version of the M5 binaries.
-j2 specifies that the build process should built two targets in parallel. Thisis a useful option to cut down on compile time if your machine has severalprocessors or cores.
The included build script compile.sh encapsulates the necessary build com-mands and options.
2.4 Run
Before running M5, it is necessary to specify the architecture and parametersfor the simulated system. This is a nontrivial task in itself. Fortunatelythere is an easy way: use the included example python script for runningM5 in syscall emulation mode, m5/config/example/se.py. When usinga prefetcher with M5, this script needs some extra options, described inTable 2.1.
For an overview of all possible options to se.py, do
./build/ALPHA SE/m5.opt common/example/se.py --help
When combining all these options, the command line will look somethinglike this:
./build/ALPHA SE/m5.opt common/example/se.py --detailed
--caches --l2cache --l2size=1MB --prefetcher=policy=proxy
--prefetcher=on access=True
This command will run se.py with a default program, which prints out“Hello, world!” and exits. To run something more complicated, use the
5
Option Description
--detailed Detailed timing simulation
--caches Use caches
--l2cache Use level two cache
--l2size=1MB Level two cache size
--prefetcher=policy=proxy Use the C-style prefetcher interface
--prefetcher=on access=True Have the cache notify the prefetcher
on all accesses, both hits and misses
--cmd The program (an Alpha binary) to run
Table 2.1: Basic se.py command line options.
--cmd option to specify another program. See subsection 2.4.2 about cross-compiling binaries for the Alpha architecture. Another possibility is to runa benchmark program, as described in the next section.
2.4.1 CPU2000 benchmark tests
The test prefetcher.py script can be used to evaluate the performance ofyour prefetcher against the SPEC CPU2000 benchmarks. It runs a selectedsuite of CPU2000 tests with your prefetcher, and compares the results tosome reference prefetchers.
The per-test statistics that M5 generates are written tooutput/<testname-prefetcher>/stats.txt. The statistics most relevantfor hardware prefetching are then filtered and aggregated to a stats.txt
file in the framework base directory.
See chapter 4 for an explanation of the reported statistics.
Since programs often do some initialization and setup on startup, a samplefrom the start of a program run is unlikely to be representative for the wholeprogram. It is therefore desirable to begin the performance tests after theprogram has been running for some time. To save simulation time, M5 canresume a program state from a previously stored checkpoint. The prefetcherframework comes with checkpoints for the CPU2000 benchmarks taken after109 instructions.
It is often useful to run a specific test to reproduce a bug. To run theCPU2000 tests outside of test prefetcher.py, you will need to set theM5 CPU2000 environment variable. If this is set incorrectly, M5 will give theerror message “Unable to find workload”. To export this as a shell variable,do
6
export M5 CPU2000=lib/cpu2000
Near the top of test prefetcher.py there is a commented-out call todry run(). If this is uncommented, test prefetcher.py will print thecommand line it would use to run each test. This will typically look likethis:
m5/build/ALPHA SE/m5.opt --remote-gdb-port=0 -re
--outdir=output/ammp-user m5/configs/example/se.py
--checkpoint-dir=lib/cp --checkpoint-restore=1000000000
--at-instruction --caches --l2cache --standard-switch
--warmup-insts=10000000 --max-inst=10000000 --l2size=1MB
--bench=ammp --prefetcher=on access=true:policy=proxy
This uses some additional command line options, these are explained inTable 2.2.
Option Description
--bench=ammp Run one of the SPEC CPU2000 benchmarks.
--checkpoint-dir=lib/cp The directory where program checkpoints are stored.
--at-instruction Restore at an instruction count.
--checkpoint-restore=n The instruction count to restore at.
--standard-switch Warm up caches with a simple CPU model,
then switch to an advanced model to gather statistics.
--warmup-insts=n Number of instructions to run warmup for.
--max-inst=n Exit after running this number of instructions.
Table 2.2: Advanced se.py command line options.
2.4.2 Running M5 with custom test programs
If you wish to run your self-written test programs with M5, it is necessary tocross-compile them for the Alpha architecture. The easiest way to achievethis is to download the precompiled compiler-binaries provided by crosstoolfrom the M5 website. Install the one that fits your host machine best (32or 64 bit version). When cross-compiling your test program, you must usethe -static option to enforce static linkage.
To run the cross-compiled Alpha binary with M5, pass it to the script withthe --cmd option. Example:
./build/ALPHA SE/m5.opt configs/example/se.py --detailed
--caches --l2cache --l2size=512kB --prefetcher=policy=proxy
--prefetcher=on access=True --cmd /path/to/testprogram
7
2.5 Submitting the prefetcher for benchmarking
First of all, you need a user account on the PfJudgeβ web pages. Theteaching assistant in TDT4260 Computer Architecture will create one foryou. You must also be assigned to a group to submit prefetcher code orview earlier submissions.
Sign in with your username and password, then click “Submit prefetcher”in the menu. Select your prefetcher file, and optionally give the submissiona name. This is the name that will be shown in the highscore list, so choosewith care. If no name is given, it defaults to the name of the uploaded file.If you check “Email on complete”, you will receive an email when the resultsare ready. This could take some time, depending on the cluster’s currentworkload.
When you click “Submit”, a job will be sent to the Kongull cluster, whichthen compiles your prefetcher and runs it with a subset of the CPU2000
tests. You are then shown the “View submissions” page, with a list of allyour submissions, the most recent at the top.
When the prefetcher is uploaded, the status is “Uploaded”. As soon as it issent to the cluster, it changes to “Compiling”. If it compiles successfully, thestatus will be “Running”. If your prefetcher does not compile, status willbe “Compile error”. Check “Compilation output” found under the detailedview.
When the results are ready, status will be “Completed”, and a score will begiven. The highest scoring prefetcher for each group is listed on the highscorelist, found under “Top prefetchers” in the menu. Click on the prefetchername to go a more detailed view, with per-test output and statistics.
If the prefetcher crashes on some or all tests, status will be “Runtime error”.To locate the failed tests, check the detailed view. You can take a look atthe output from the failed tests by clicking on the “output” link found aftereach test statistic.
To allow easier exploration of different prefetcher configurations, it is possi-ble to submit several prefetchers at once, bundled into a zipped file. Each.cc file in the archive is submitted independently for testing on the cluster.The submission is named after the compressed source file, possibly prefixedwith the name specified in the submission form.
There is a limit of 50 prefetchers per archive.
8
Chapter 3
The prefetcher interface
3.1 Memory model
The simulated architecture is loosely based on the DEC Alpha Tsunamisystem, specifically the Alpha 21264 microprocessor. This is a superscalar,out-of-order (OoO) CPU which can reorder a large number of instructions,and do speculative execution.
The L1 prefetcher is split in a 32kB instruction cache, and a 64kB datacache. Each cache block is 64B. The L2 cache size is 1MB, also with a cacheblock size of 64B. The L2 prefetcher is notified on every access to the L2cache, both hits and misses. There is no prefetching for the L1 cache.
The memory bus runs at 400MHz, is 64 bits wide, and has a latency of 30ns.
3.2 Interface specification
The interface the prefetcher will use is defined in a header file located atprefetcher/interface.hh. To use the prefetcher interface, you shouldinclude interface.hh by putting the line #include "interface.hh" atthe top of your source file.
#define Value Description
BLOCK SIZE 64 Size of cache blocks (cache lines) in bytes
MAX QUEUE SIZE 100 Maximum number of pending prefetch requests
MAX PHYS MEM SIZE 228 − 1 The largest possible physical memory address
Table 3.1: Interface #defines.
NOTE: All interface functions that take an address as a parameter block-align the address before issuing requests to the cache.
9
Function Description
void prefetch init(void) Called before any memory access to let the
prefetcher initialize its data structures
void prefetch access(AccessStat stat) Notifies the prefetcher about a cache access
void prefetch complete(Addr addr) Notifies the prefetcher about a prefetch load
that has just completed
Table 3.2: Functions called by the simulator.
Function Description
void issue prefetch(Addr addr) Called by the prefetcher to initiate a prefetch
int get prefetch bit(Addr addr) Is the prefetch bit set for addr?
int set prefetch bit(Addr addr) Set the prefetch bit for addr
int clear prefetch bit(Addr addr) Clear the prefetch bit for addr
int in cache(Addr addr) Is addr currently in the L2 cache?
int in mshr queue(Addr addr) Is there a prefetch request for addr in
the MSHR (miss status holding register) queue?
int current queue size(void) Returns the number of queued prefetch requests
void DPRINTF(trace, format, ...) Macro to print debug information.
trace is a trace flag (HWPrefetch),
and format is a printf format string.
Table 3.3: Functions callable from the user-defined prefetcher.
AccessStat member Description
Addr pc The address of the instruction that caused the access
(Program Counter)
Addr mem addr The memory address that was requested
Tick time The simulator time cycle when the request was sent
int miss Whether this demand access was a cache hit or miss
Table 3.4: AccessStat members.
10
The prefetcher must implement the three functions prefetch init,prefetch access and prefetch complete. The implementation may beempty.
The function prefetch init(void) is called at the start of the simulationto allow the prefetcher to initialize any data structures it will need.
When the L2 cache is accessed by the CPU (through the L1 cache), the func-tion void prefetch access(AccessStat stat) is called with an argument(AccessStat stat) that gives various information about the access.
When the prefetcher decides to issue a prefetch request, it should callissue prefetch(Addr addr), which queues up a prefetch request for theblock containing addr.
When a cache block that was requested by issue prefetch arrives frommemory, prefetch complete is called with the address of the completedrequest as parameter.
Prefetches issued by issue prefetch(Addr addr) go into a prefetch requestqueue. The cache will issue requests from the queue when it is not fetchingdata for the CPU. This queue has a fixed size (available as MAX QUEUE SIZE),and when it gets full, the oldest entry is evicted. If you want to check thecurrent size of this queue, use the function current queue size(void).
3.3 Using the interface
Start by studying interface.hh. This is the only M5-specific header fileyou need to include in your header file. You might want to include standardheader files for things like printing debug information and memory alloca-tion. Have a look at what the supplied example prefetcher (a very simplesequential prefetcher) to see what it does.
If your prefetcher needs to initialize something, prefetch init is the placeto do so. If not, just leave the implementation empty.
You will need to implement the prefetch access function, which the cachecalls when accessed by the CPU. This function takes an argument,AccessStat stat, which supplies information from the cache: the addressof the executing instruction that accessed cache, what memory address wasaccess, the cycle tick number, and whether the access was a cache miss. Theblock size is available as BLOCK SIZE. Note that you probably will not needall of this information for a specific prefetching algorithm.
If your algorithm decides to issue a prefetch request, it must call theissue prefetch function with the address to prefetch from as argument.The cache block containing this address is then added to the prefetch request
11
queue. This queue has a fixed limit of MAX QUEUE SIZE pending prefetch re-quests. Unless your prefetcher is using a high degree of prefetching, thenumber of outstanding prefetches will stay well below this limit.
Every time the cache has loaded a block requested by the prefetcher,prefetch complete is called with the address of the loaded block.
Other functionality available through the interface are the functions for get-ting, setting and clearing the prefetch bit. Each cache block has one suchtag bit. You are free to use this bit as you see fit in your algorithms. Notethat this bit is not automatically set if the block has been prefetched, ithas to be set manually by calling set prefetch bit. set prefetch bit onan address that is not in cache has no effect, and get prefetch bit on anaddress that is not in cache will always return false.
When you are ready to write code for your prefetching algorithm of choice,put it in prefetcher/prefetcher.cc. When you have several prefetchers,you may want to to make prefetcher.cc a symlink.
The prefetcher is statically compiled into M5. After prefetcher.cc hasbeen changed, recompile with ./compile.sh. No options needed.
12
3.3.1 Example prefetcher
/*
* A sample prefetcher which does sequential one-block lookahead.
* This means that the prefetcher fetches the next block _after_ the one that
* was just accessed. It also ignores requests to blocks already in the cache.
*/
#include "interface.hh"
void prefetch_init(void)
{
/* Called before any calls to prefetch_access. */
/* This is the place to initialize data structures. */
DPRINTF(HWPrefetch, "Initialized sequential-on-access prefetcher\n");
}
void prefetch_access(AccessStat stat)
{
/* pf_addr is now an address within the _next_ cache block */
Addr pf_addr = stat.mem_addr + BLOCK_SIZE;
/*
* Issue a prefetch request if a demand miss occured,
* and the block is not already in cache.
*/
if (stat.miss && !in_cache(pf_addr)) {
issue_prefetch(pf_addr);
}
}
void prefetch_complete(Addr addr) {
/*
* Called when a block requested by the prefetcher has been loaded.
*/
}
13
Chapter 4
Statistics
This chapter gives an overview of the statistics by which your prefetcher ismeasured and ranked.
IPC instructions per cycle. Since we are using a superscalar architecture,IPC rates > 1 is possible.
Speedup Speedup is a commonly used proxy for overall performance whenrunning benchmark tests suites.
speedup =execution timeno prefetcher
execution timewith prefetcher=
IPCwith prefetcher
IPCno prefetcher
Good prefetch The prefetched block is referenced by the application be-fore it is replaced.
Bad prefetch The prefetched block is replaced without being referenced.
Accuracy Accuracy measures the number of useful prefetches issued bythe prefetcher.
acc =good prefetches
total prefetches
Coverage How many of the potential candidates for prefetches were actu-ally identified by the prefetcher?
cov =good prefetches
cache misses without prefetching
Identified Number of prefetches generated and queued by the prefetcher.
14
Issued Number of prefetches issued by the cache controller. This canbe significantly less than the number of identified prefetches, due toduplicate prefetches already found in the prefetch queue, duplicateprefetches found in the MSHR queue, and prefetches dropped due toa full prefetch queue.
Misses Total number of L2 cache misses.
Degree of prefetching Number of blocks fetched from memory in a singleprefetch request.
Harmonic mean A kind of average used to aggregate each benchmarkspeedup score into a final average speedup.
Havg =n
1x1
+ 1x2
+ ... + 1xn
=n∑ni=1
1xi
15
Chapter 5
Debugging the prefetcher
5.1 m5.debug and trace flags
When debugging M5 it is best to use binaries built with debugging support(m5.debug), instead of the standard build (m5.opt). So let us start byrecompiling M5 to be better suited to debugging:
scons -j2 ./build/ALPHA SE/m5.debug.
To see in detail what’s going on inside M5, one can specify enable traceflags, which selectively enables output from specific parts of M5. The mostuseful flag when debugging a prefetcher is HWPrefetch. Pass the option--trace-flags=HWPrefetch to M5:
./build/ALPHA SE/m5.debug --trace-flags=HWPrefetch [...]
Warning: this can produce a lot of output! It might be better to redirectstdout to file when running with --trace-flags enabled.
5.2 GDB
The GNU Project Debugger gdb can be used to inspect the state of thesimulator while running, and to investigate the cause of a crash. Pass GDBthe executable you want to debug when starting it.
gdb --args m5/build/ALPHA SE/m5.debug --remote-gdb-port=0
-re --outdir=output/ammp-user m5/configs/example/se.py
--checkpoint-dir=lib/cp --checkpoint-restore=1000000000
--at-instruction --caches --l2cache --standard-switch
--warmup-insts=10000000 --max-inst=10000000 --l2size=1MB
--bench=ammp --prefetcher=on access=true:policy=proxy
You can then use the run command to start the executable.
16
Some useful GDB commands:
run <args> Restart the executable with the given command line arguments.
run Restart the executable with the same arguments as last time.
where Show stack trace.
up Move up stack trace.
down Move down stack frame.
print <expr> Print the value of an expression.
help Get help for commands.
quit Exit GDB.
GDB has many other useful features, for more information you can consultthe GDB User Manual at http://sourceware.org/gdb/current/onlinedocs/gdb/.
5.3 Valgrind
Valgrind is a very useful tool for memory debugging and memory leak detec-tion. If your prefetcher causes M5 to crash or behave strangely, it is usefulto run it under Valgrind and see if it reports any potential problems.
By default, M5 uses a custom memory allocator instead of malloc. This willnot work with Valgrind, since it replaces malloc with its own custom mem-ory allocator. Fortunately, M5 can be recompiled with NO FAST ALLOC=True
to use normal malloc:
scons NO FAST ALLOC=True ./m5/build/ALPHA SE/m5.debug
To avoid spurious warnings by Valgrind, it can be fed a file with warningsuppressions. To run M5 under Valgrind, use
valgrind --suppressions=lib/valgrind.suppressions
./m5/build/ALPHA SE/m5.debug [...]
Note that everything runs much slower under Valgrind.
17
Page 1 of 5
Norwegian University of Science and Technology (NTNU)
DEPT. OF COMPUTER AND INFORMATION SCIENCE (IDI)
Course responsible: Professor Lasse Natvig
Quality assurance of the exam: PhD Jon Olav Hauglid
Contact person during exam: Magnus Jahre
Deadline for examination results: 23rd
of June 2009.
EXAM IN COURSE TDT4260 COMPUTER ARCHITECTURE
Tuesday 2nd
of June 2009
Time: 0900 - 1300
Supporting materials: No written and handwritten examination support materials are permitted. A
specified, simple calculator is permitted.
By answering in short sentences it is easier to cover all exercises within the duration of the exam. The
numbers in parenthesis indicate the maximum score for each exercise. We recommend that you start
by reading through all the sub questions before answering each exercise.
The exam counts for 80% of the total evaluation in the course. Maximum score is therefore 80 points.
Exercise 1) Instruction level parallelism (Max 10 points)
a) (Max 5 points) What is the difference between (true) data dependencies and name
dependencies? Which of the two presents the most serious problem? Explain why such
dependencies will not always result in a data hazard.
Solution sketch:
True data dependency: One instruction reads what an earlier has written (data flows) (RAW).
Name dependency: Two instructions use the same register or memory location, but there is no
flow of data between them. One instruction writes what an earlier has read (WAR) or written
(WAW). (no data flow).
True data dependency is the most serious problem, as name dependencies can be prevented by
register renaming. Also, many pipelines are designed so that name-dependencies will not cause a
hazard.
A dependency between two instructions will only result in a data hazard if the instructions are
close enough together and the processor executes them out of order.
b) (Max 5 points) Explain why loop unrolling can improve performance. Are there any potential
downsides to using loop unrolling?
Solution sketch:
Loop unrolling can improve performance by reducing the loop overhead (e.g. loop overhead
instructions executed every 4th element, rather than for each). It also makes it possible for
scheduling techniques to further improve instruction order as instructions for different elements
Page 2 of 5
(iterations) now can be interchanged. Downsides include increased code size which may lead to
more cache misses and increased number of registers used.
Exercise 2) Multithreading (Max 15 points)
a) (Max 5 points) What are the differences between fine-grained and coarse-grained
multithreading?
Solution sketch:
Fine-grained: Switch between threads after each instruction. Coarse-grained: Switch on costly
stalls (cache miss).
b) (Max 5 points) Can techniques for instruction level parallelism (ILP) and thread level parallelism
(TLP) be used simultaneously? Why/why not?
Solution sketch:
ILP and TLP can be used simultaneously. TLP looks at parallelism between different threads,
while ILP looks at parallelism inside a single instruction stream/thread.
c) (Max 5 points) Assume that you are asked to redesign a processor from single threaded to
simultaneous multithreading (SMT). How would that change the requirements for the caches?
(I.e., what would you look at to ensure that the caches would not degrade performance when
moving to SMT)
Solution sketch:
Several threads executing at once will lead to increased cache traffic and more cache conflicts.
Techniques that could help: Increased cache size, more cache ports/banks, higher associativity,
non-blocking caches.
Exercise 3) Multiprocessors (Max 15 points)
a) (Max 5 points) Give a short example illustrating the cache coherence problem for
multiprocessors.
Solution sketch:
See Figure 4.3 on page 206 of the text book. (A reads X, B reads X, A stores X, B now has
inconsistent value for X).
b) (Max 5 points) Why does bus snooping scale badly with number of processors? Discuss how
cache block size could influence the choice between write invalidate and write update.
Solution sketch:
Bus snooping relies on a common bus where information is broadcasted. As number of devices
increase, this common medium becomes a bottleneck.
Invalidates are done at cache block level, while updates are done on individual words. False
sharing coherence misses only appear when using write invalidate with block sizes larger than
Page 3 of 5
one word. So as cache block size increases, the number of false sharing coherence misses will
increase, thereby making write update increasingly more appealing.
c) (Max 5 points) What makes the architecture of UltraSPARC T1 (“Niagara”) different from most
other processor architectures?
Solution sketch:
High focus on TLP, low focus on ILP. Poor single thread performance, but great multithread
performance. Thread switch on any stall. Short pipeline, in-order, no branch prediction.
Exercise 4) Memory, vector processors and networks (Max 15 points)
a) (Max 5 points) Briefly describe 5 different optimizations of cache performance.
Solution sketch:
(1 point pr. optimization) 6 techniques listed on page 291 in the text book, 11 more in 5.2 on
page 293.
b) (Max 5 points) What makes vector processors fast at executing a vector operation?
Solution sketch:
A Vector operation can be executed with a single instruction, reducing code size and improving
cache utilization. Further, the single instruction has no loop overhead and no control
dependencies which a scalar processor would have. Hazard checks can also be done per vector,
rather than per element. A vector processor also contains a deep pipeline especially designed for
vector operations.
c) (Max 5 points) Discuss how the number of devices to be connected influences the choice of
topology.
Solution sketch:
This is a classic example of performance vs. cost. Different topologies scale differently with
respect to performance or cost as the number of devices grows. Crossbar scales performance
well, but cost badly. Ring or bus scale performance badly, but cost well.
Exercise 5) Multicore architectures and programming (Max 25 points)
a) (Max 6 points) Explain briefly the research method called design space exploration (DSE). When
doing DSE, explain how a cache sensitive application can be made processor bound, and how it
can be made bandwidth bound.
Solution sketch:
(Lecture 10-slide 4) DSE is to try out different points in an n-dimensional space of possible
designs, where n is the number of different main design parameters, such as #cores, core-types
(IO vs. OOO etc.), cache size etc. Cache sensitive applications can become processor bound by
Page 4 of 5
increasing the cache size, and they can be made bandwidth bound by decreasing it..
b) (Max 5 points) In connection with GPU-programming (shader programming), David Blythe uses
the concept ”computational coherence”. Explain it briefly.
LF: See lecture 10, slide 36 + evt. the paper.
c) (Max 8 points) Give an overview of the architecture of the Cell processor.
Solution sketch:
All details of this figure are not expected, but the main elements.
* One main processor (Power-architecture, called PPE = Power processing element) – this acts as
a host (master) processor. (Power arch., 64 bit, in-order two-issue superscalar, SMT
(Simultaneous multithreading. Has a vector media extension (VMX) (Kahle figure 2))
* 8 identical SIMD processors (called SPE = Synergistic Processing element), each of these
consists of SPU processing element (Synergistic processor unit) and local storage (LS, 256 KB
SRAM --- not cache). On chip memory controller + bus interface. (Can operate on integers in
different formats., 8, 16, 32 and floating point numbers in 32 og 64 bit. (64 bit floats in later
version).
* Interconnect is a ring-bus (Element Interconnect Bus, EIB), connects PPE + 8 SPE. two
unidirectional busses in each direction. Worst case latency is half distance, can support up to
three simultaneous transfers
* Highly programmable DMA controller.
d) (Max 6 points) The Cell design team made several design decisions that were motivated by a wish
to make it easier to develop programs with predictable (more deterministic) processing time
(performance). Describe two of these.
Solution sketch:
1) They discarded the common out-of-order execution in the Power-processor, developed a
simpler in-order processor
Page 5 of 5
2) The local store memory (LS) in the SPE processing elements do not use HW cache-coherency
snooping protocols to avoid the in-determinate nature of cache misses. The programmer handles
memory in a more explicit way
3) Also the large number of registers (128) might help making the processing more deterministic
wrt. execution time.
4) Extensive timers and counters (probably performance counters) (that may be used by the
SW/programmer to monitor/adjust/control performance)
…---oooOOOooo---…
Page 1 of 4
Norwegian University of Science and Technology (NTNU) DEPT. OF COMPUTER AND INFORMATION SCIENCE (IDI) Contact person for questions regarding exam exercises: Name: Lasse Natvig Phone: 906 44 580
EXAM IN COURSE TDT4260 COMPUTER ARCHITECTURE Monday 26th of May 2008 Time: 0900 – 1300 Solution sketches in blue text
Supporting materials: No handwritten or printed materials allowed, simple specified calculator is allowed. By answering in short sentences it is easier to cover all exercises within the duration of the exam. The numbers in parenthesis indicate the maximum score for each exercise. We recommend that you start by reading through all the sub questions before answering each exercise. The exam counts for 80% of the total evaluation in the course. Maximum score is therefore 80 points. Exercise 1) Parallel Architecture (Max 25 points) a) (Max 5 points) The feature size of integrated circuits is now often 65 nanometres or smaller, and it is still decreasing. Explain briefly how the number of transistors on a chip and the wire delay changes with shrinking feature size. The number of transistors can be 4 times larger when the feature size is halved. However the wire delay does not improve (scales poorly). (The textbook page 17 gives more details, but we here ask for the main trends) b) (Max 5 points) In a cache coherent multiprocessor, the concepts migration and replication of shared data items are central. Explain both concepts briefly and also how they influence on latency to access of shared data and the bandwidth demand on the shared memory. Migration means that data move to a place closer to requesting/accessing unit. Replication just means storing several copies. Having a local copy in general means faster access, and it is harmelss to have several copies of read-only data. (Textbook page 207) c) (Max 5 points) Explain briefly how a write buffer can be used in cache systems to increase performance. Explain also what “write merging” is in this context. The main purpose of the write buffer is to temporarily store data that are evicted from the cache so new data can reuse the cache space as fast as possible, i.e. to avoid waiting for the latency of the memory one level further away from the processor. If more writes are to the same cache block (adress) these writes can be combined, resulting in a reduced traffic towards the next memory level. (Textbook page 300) ((Also slides 11-6-3)). // Retting: 3 poeng for skrive-buffer-forståelse og 2 for skrive-fletting. d) (Max 5 points) Sketch a figure that shows how a hypercube with 16 nodes are built by combining two smaller hypercubes. Compare the hypercube-topology with the 2-dimensional mesh topology with respect to connectivity and node cost (number of links/ports per node). (Figure E-14 c) A mesh has a fixed degree of connectivity and becomes slower in general when the number of nodes is increased, since the number of hops needed for reaching another node on average is increasing. For a hypercube it is the other way around, the connectivity increase for larger networks, so the communication time does not increase much, but the node cost does also increase. When going to a larger network, increasing the
Page 2 of 4
dimension, every node must be extended with a new port, and this is a drawback when it comes to building computers using such networks. e) (Max 5 points) When messages are sent between nodes in a multiprocessor two possible strategies are source routing and distributed routing. Explain the difference between these two. For source routing, the entire routing path is precomputed by the source (possibly by table lookup—and placed in the packet header). This usually consists of the output port or ports supplied for each switch along the predetermined path from the source to the destination, (which can be stripped off by the routing control mechanism at each switch. An additional bit field can be included in the header to signify whether adaptive routing is allowed (i.e., that any one of the supplied output ports can be used). For distributed routing, the routing information usually consists of the destination address. This is used by the routing control mechanism in each switch along the path to determine the next output port, either by computing it using a finite-state machine or by looking it up in a local routing table (i.e., forwarding table). (Textbook page E-48) Exercise 2) Parallel processing (Max 15 points) a) (Max 5 points) Explain briefly the main difference between a VLIW processor and a dynamically scheduled superscalar processor. Include the role of the compiler in your explanation. Parallel execution of several operations is scheduled (analysed and planned) at compile time and assembled into very long/broad instructions for VLIW. (Such work done at compile time is often called static). In a dynamically scheduled superscalar processor dependency and resource analysis are done at run time (dynamically) to find opportunities to do operations in parallell. (Textbook page 114 -> and VLIW paper) b) (Max 5 points) What function has the vector mask register in a vector processor? If you want to update just some subset of the elements in a vector register, i.e. to implement IF A[i] != 0 THEN A[i] = A[i] – B[i] for (i=0..n) in a simple way, this can be done by setting the vector mask register to 1 only for the elements with A[i] != 0. In this way, the vectorinstruction A = A - B can be performed without testing every element explicitly. c) (Max 5 points) Explain briefly the principle of vector chaining in vector processors. The execution of instructions using several/different functional and memory pipelines can be chained together directly or by using vector registers. The chaining forms one longer pipeline. (This is the technique of forwarding (used in processor, as in Tomasulos algorithm) extended to vector registers (Textbook F-23) ((Slides forel-9, slide 20)) – bør sjekkes Exercise 3) Multicore processors (Max 20 points) a) (Max 5 points) In the paper Chip Multithreading: Opportunities and Challenges, by Spracklen & Abraham is the concept Chip Multithreaded processor (CMT) described. The authors describe three generations of CMT processors. Describe each of these briefly. Make simple drawings if you like. a) 1. generation: typically 2 cores pr. chip, every core is a traditional processor-core, no shared resources except the off-chip bandwidth. 2.generation: Shared L2 cache, but still traditional processor cores. 3. generation: as 2. gen., but the cores are now custom-made for being used in a CMP, and might also use simultaneous multithreading (SMT). (This description is a bit ”biased” and colored by the backgorund of the authors (in Sun Microsystems) that was involved in the design of Niagara 1 og 2 (T1)) // fig. 1 i artikkel, og slides // Var deloppgave mai 2007, b) (Max 5 points) Outline the main architecture in SUN’s T1 (Niagara) multicore processor. Describe the placement of L1 and L2 cache, as well as how the L1 caches are kept coherent. Fig 4.24 at page 250 in the textbook, that shows 8 cores, each with its own L1-cache (described in the text), 4 x L2 cache banks, each having a channel to external memory, 1x FPU unit, crossbar as interconnection. Coherence
Page 3 of 4
is maintained by a catalog associated with each L2 cache. This knows which L1-caches that havbe a copy of data in the L2 cache. // Læreboka side 249-250, også forelsning c) (Max 6 points) In the paper Exploring the Design Space of Future CMP’s the authors perform a design space exploration where several main architectural parameters are varied assuming a fixed total chip area of 400mm2. Outline the approach by explaining the following figure;
Technology independent area models – found empirically, – core area and cache area measured in cache byte equivalents (CBE). Study the relative costs in area versus the associated performance gains --- maximize performance per unit area for future technology generations. With smaller feature sizes, the available area for cache banks and processing cores increases. Table 3 displays die area in terms of the cache-byte-equivalents (CBE), and PIN and POUT columns show how many of each type of processor with 32KB separate L1 instruction and data caches could be implemented on the chip if no L2 cache area were required. (PIN is a simple in-order-execution processor, POUT is a larger out-of-order exec processor). And, for reference, Lambda-squared where lambda is equal to one half of the feature size. The primary goal of this paper is to determine the best balance between per-processor cache area, area consumed by different processor organizations, and the number of cores on a single die. LF; Ny oppgave / Middels/vanskelig / foil 1-6, og 2-3 d) (Max 4 points) Explain the argument of the authors of the paper Exploring the Design Space of Future CMP’s that we in the future may have chips with useless area on the chip that performs no other function than as a placeholder for pin area? As applications become bandwidth bound, and global wire delays increase, an interesting scenario may arise. It is likely that monolithic caches cannot be grown past a certain point in 50 or 35nm technologies, since the wire delays will make them too slow. It is also likely that, given a ceiling on cache size, off-chip bandwidth will limit the number of cores. Thus, there may be useless area on the chip which cannot be used for cache or processing logic, and which performs no function other than as a placeholder for pin area. That area may be useful to use for compression engines, or intelligent controllers to manage the caches and memory channels. (Fra forel 8, slide 6 på side 4) Exercise 4) Research prototypes (Max 20 points) a) (Max 5 points) Sketch a figure of the main system structure of the Manchester Dataflow Machine (MDM). Include the following units: Matching unit, Token Queue, IO switch, Instruction store, Overflow unit and Processing unit. Show also how these are connected. See figure 5 in the paper, and slides. The Overflow unit is coupled to the matching unit, in parallel..
Page 4 of 4
b) (Max 5 points) What was the function of the overflow unit in MDM and explain very briefly how it was implemented. If an operand does not find its corresponding operand in the Matching Unit (MU), and it is not space in MU to store it (for waiting on the other operand), the operand is stored in the overflow store. This is a separate and much slower subsystem with much larger storage capcity. It is composed of a separate overflow-bus, memory and a microcoded processors, in other words a SW-solution. See also figure 7 in the paper. c) (Max 5 points) In the paper The Stanford FLASH Multiprocessor by Kuskin et.al., the FLASH computer is described. FLASH is an abbreviation for FLexible Architecture for SHared memory. What kind of flexibility was the main goal for the project? Programming paradigm, flexibility in the choice between distributed shared memory (DSM) i.e. cache coherent shared memory and message passing, but also other alternative ways of communication between the nodes could be explored. d) (Max 5 points) Outline the main architecture of a node in a FLASH system. What was the most central design choice to achieve this flexibility? Fig. 2.1 explain much of this
Interconnection of PE’s in a mesh. The most central design choice was the MAGIC unit, a specially designed node controller. All memory accesses goes through this, and it can as an example realise a cache-coherence protocol. Every Node is identical. The whole computer has one single adress space, but the memory is physically distributed.
---oooOOOooo---
Output
Matching Unit
Instruction Store
Processing
Unit
P0...P19
Switch
Token Queue
Input