tdt4260

1

TDT 4260 – lecture 1 – 2011• Course introduction

– course goals– staff– contents– evaluation– web, ITSL

1 Lasse Natvig

• Textbook– Computer Architecture, A

Quantitative Approach, Fourth Edition

• by John Hennessy & David Patterson(HP90 - 96 – 03) - 06

• Today: Introduction (Chapter 1)– Partly covered

Course goal• To get a general and deep understanding of the

organization of modern computers and the motivation for different computer architectures. Give a base for understanding of research themes within the field.

• High level

2 Lasse Natvig

• High level• Mostly HW and low-level SW• HW/SW interplay• Parallelism• Principles, not details

inspire to learn more

Contents• Computer architecture fundamentals, trends, measuring

performance, quantitative principles. Instruction set architectures and the role of compilers. Instruction-level parallelism, thread-level parallelism, VLIW.

• Memory hierarchy design, cache. Multiprocessors, shared memory architectures, vector processors, NTNU/Notur supercomputers distributed shared memory

3 Lasse Natvig

supercomputers, distributed shared memory, synchronization, multithreading.

• Interconnection networks, topologies• Multicores,homogeneous and heterogeneous, principles and

product examples• Green computing (introduction)• Miniproject - prefetching

TDT-4260 / DT8803• Recommended background

– Course TDT4160 Computer Fundamentals, or equivalent.

• http://www.idi.ntnu.no/emner/tdt4260/– And Its Learning

• Friday 1215-1400– And/or some Thursdays 1015-1200

4 Lasse Natvig

– 12 lectures planned

– some exceptions may occur

• Evaluation– Obligatory exercise (counts 20%). Written

exam counts 80%. Final grade (A to F) given at end of semester. If there is a re-sit examination, the examination form may change from written to oral.

Lecture planDate and lecturer Topic

1: 14 Jan (LN, AI) Introduction, Chapter 1 / Alex: PfJudge

2: 21 Jan (IB) Pipelining, Appendix A; ILP, Chapter 2

3: 28 Jan (IB) ILP, Chapter 2; TLP, Chapter 3

4: 4 Feb (LN) Multiprocessors, Chapter 4

5: 11 Feb MG(?)) Prefetching + Energy Micro guest lecture

Subject to change

5 Lasse Natvig

6: 18 Feb (LN) Multiprocessors continued

7: 25 Feb (IB) Piranha CMP + Interconnection networks

8: 4 Mar (IB) Memory and cache, cache coherence (Chap. 5)

9: 11 Mar (LN) Multicore architectures (Wiley book chapter) + Hill Marty Amdahl

multicore ... Fedorova ... assymetric multicore ...

10: 18 Mar (IB) Memory consistency (4.6) + more on memory

11: 25 Mar (JA, AI) (1) Kongull and other NTNU and NOTUR supercomputers (2) Green

computing

12: 1 Apr (IB/LN) Wrap up lecture, remaining stuff

13: 8 Apr Slack – no lecture planned

EMECS, new European Master's Course in Embedded Computing Systems

6 Lasse Natvig

2

Preliminary reading list, subject to change!!!• Chap.1: Fundamentals, sections 1.1 - 1.12 (pages 2-54)• Chap.2: ILP, sections 2.1 - 2.2 and parts of 2.3 (pages 66-81), section 2.7

(pages 114-118), parts of section 2.9 (pages 121-127, stop at speculation), section 2.11 - 2.12 (pages 138 - 141). (Sections 2.4 - 2.6 are covered by similar material in our computer design course)

• Chap.3: Limits on ILP, section 3.1 and parts of section 3.2 (pages 154 -159), section 3.5 - 3.8 (pages 172-185).

• Chap.4: Multiprocessors and TLP, sections 4.1 - 4.5, 4.8 - 4.10 • Chap.5: Memory hierachy, section 5.1 - 5.3 (pages 288 - 315).

App A: section A 1 (Expected to be repetition from other courses)

7 Lasse Natvig

• App A: section A.1 (Expected to be repetition from other courses)• Appendix E, interconnection networks, pages E2-E14, E20-E25, E29-E37

and E45-E51.• App. F: Vector processors, sections F1 - F4 and F8 (pages F-2 - F-32, F-

44 - F-45)• Data prefetch mechanisms (ACM Computing Survey)• Piranha, (To be announced)• Multicores (New bookchapter) (To be announced)• (App. D; embedded systems?) see our new course TDT4258

Mikrokontroller systemdesign

People involvedLasse Natvig

Course responsible, [email protected]

Ian Bratt

Lecturer (Al t Til )

8 Lasse Natvig

Lecturer (Also at Tilera.com)[email protected]

Alexandru Iordan

Teaching assistant (Also PhD-student)[email protected]

http://www.idi.ntnu.no/people/

research.idi.ntnu.no/multicore

9 Lasse Natvig

Some few highlights:- Green computing, 2xPhD + master students- Multicore memory systems, 3 x PhD theses- Multicore programming and parallel computing- Cooperation with industry

Prefetching ---pfjudge

10 Lasse Natvig

”Computational computer architecture” • Computational science and engineering (CSE)

– Computational X, X = comp.arch.• Simulates new multicore architectures

– Last level, shared cache fairness (PhD-student M. Jahre)– Bandwidth aware prefetching (PhD-student M. Grannæs)

• Complex cycle-accurate simulators– 80 000 lines C++ 20 000 lines python

11 Lasse Natvig

– 80 000 lines C++, 20 000 lines python– Open source, Linux-based

• Design space exploration (DSE)– one dimension for each arch. parameter– DSE sample point = specific multicore configuration– performance of a selected set of configurations evaluated by

simulating the execution of a set of workloads

Experiment Infrastructure• Stallo compute cluster

– 60 Teraflop/s peak

– 5632 processing cores

– 12 TB total memory

– 128 TB centralized disk

– Weighs 16 tons

12 Lasse Natvig

• Multi-core research– About 60 CPU years allocated per

year to our projects

– Typical research paper uses 5 to 12 CPU years for simulation (extensive, detailed design space exploration)

3

The End of Moore’s lawfor single-core microprocessors

13 Lasse Natvig

But Moore’s law still holds for FPGA, memory and multicore processors

Motivational background• Why multicores

– in all market segments from mobile phones to supercomputers

• The ”end” of Moores law

• The power wall

• The memory wall

14 Lasse Natvig

The memory wall

• The bandwith problem

• ILP limitations

• The complexity wall

Energy & Heat Problems• Large power

consumption– Costly

– Heat problems

– Restricted battery operation time

15 Lasse Natvig

operation time

• Google ”Open House Trondheim 2006”– ”Performance/Watt

is the only flat trend line”

The Memory Wall

CPU60%/year

DRAM9%/1

10

100

1000

P-M gap grows 50% / year

Per

form

ance

“Moore’s Law”

16 Lasse Natvig

• The Processor Memory Gap

• Consequence: deeper memory hierachies– P – Registers – L1 cache – L2 cache – L3 cache – Memory - - -

– Complicates understanding of performance• cache usage has an increasing influence on performance

9%/year1

1980 1990 2000

The I/O pin or Bandwidth problem

• # I/O signaling pins– limited by physical

tecnology

– speeds have not increased at the same rate as processor clock rates

17 Lasse Natvig

• Projections– from ITRS (International

Technology Roadmap for Semiconductors)

[Huh, Burger and Keckler 2001]

The limitations of ILP (Instruction Level Parallelism) in Applications

20

25

30

2

2.5

3

cycl

es (

%)

dup

18 Lasse Natvig

0 1 2 3 4 5 6+0

5

10

15

0 5 10 150

0.5

1

1.5

Fra

ctio

n of

tota

l

Number of instructions issued

Spe

ed

Instructions issued per cycle

4

Reduced Increase in Clock Frequency

19 Lasse Natvig

Solution: Multicore architectures (also called Chip Multi-processors - CMP)

• More power-efficient– Two cores with clock frequency f/2

can potentially achieve the same speed as one at frequency f with 50% reduction in total energy consumption[Olukotun & Hammond 2005]

20 Lasse Natvig

• Exploits Thread Level Parallelism (TLP)– in addition to ILP

– requires multiprogramming orparallel programming

• Opens new possibilities for architectural innovations

Why heterogeneous multicores?• Specialized HW is

faster than general HW– Math co-processor

– GPU, DSP, etc…

• Benefits of

Cell BE processor

21 Lasse Natvig

customization– Similar to ASIC vs. general

purpose programmable HW

• Amdahl’s law– Parallel speedup limited by

serial fraction• 1 super-core

CPU – GPU – convergence(Performance – Programmability)

Processors: Larrabee, Fermi, …Languages: CUDA, OpenCL, …

22 Lasse Natvig

Parallel processing – conflicting goals

PowerefficiencyProgrammability

Portability

PerformanceThe P6-model: Parallel Processing challenges: Performance, Portability, Programmability and Power efficiency

23 Lasse Natvig

PowerefficiencyProgrammability

• Examples;

– Performance tuning may reduce portability• Eg. Datastructures adapted to cache block size

– New languages for higher programmability may reduce performance and increase power consumption

Multicore programming challenges• Instability, diversity, conflicting goals … what to do?• What kind of parallel programming?

– Homogeneous vs. heterogeneous– DSL vs. general languages– Memory locality

• What to teach?– Teaching should be founded on

active research

• Two layers of programmers

24 Lasse Natvig

y p g– The Landscape of Parallel Computing Research: A View from

Berkeley [Asan+06]• Krste Asanovic presentation at ACACES Summerschool 2007

– 1) Programmability layer (Productivity layer) (80 - 90%)• ”Joe the programmer”

– 2) Performance layer (Efficiency layer) (10 - 20%)• Both layers involved in HPC• Programmability an issue also at the performance-layer

5

Personal Health

Image Retrieval

Hearing, Music

SpeechParallel Browser

Design Patterns/Motifs

Parallel Computing Laboratory, U.C. Berkeley,(Slide adapted from Dave Patterson )

Easy to write correct programs that run efficiently on manycore

Composition & Coordination Language (C&CL)

P ll l

C&CL Compiler/Interpreter

orm

ance

25 Lasse Natvig25

Sketching

Legacy Code

SchedulersCommunication & Synch.

Primitives

Efficiency Language Compilers

Legacy OS

Multicore/GPGPU

OS Libraries & Services

RAMP Manycore

Hypervisor

Parallel Libraries

Parallel Frameworks

Autotuners

Efficiency Languages

Dia

gn

osi

ng

Po

wer

/Per

fo

Classes of computers• Servers

– storage servers– compute servers (supercomputers) – web servers– high availability– scalability– throughput oriented (response time of less importance)

• Desktop (price 3000 NOK – 50 000 NOK)– the largest market

26 Lasse Natvig

g– price/performance focus– latency oriented (response time)

• Embedded systems– the fastest growing market (”everywhere”)– TDT 4258 Microcontroller system design– ATMEL, Nordic Semic., ARM, EM, ++

Falanx (Mali) ARM Norway

27 Lasse Natvig

Borgar FXI Technologies”An idependent compute platform to gather the fragmented mobile space and thus help accelerate the prolifitation of content and applications eco- systems (I.e build an ARM based SoC, put it

28 Lasse Natvig

• http://www.fxitech.com/– ”Headquartered in Trondheim

• But also an office in Silicon Valley …”

, pin a memory card, connect it to the web- and voila, you got iPhone for the masses ).”

Trends • For technology, costs, use

• Help predicting the future

• Product development time – 2-3 years

– design for the next technology

29 Lasse Natvig

– Why should an architecture live longer than a product?

Comp. Arch. is an Integrated Approach

• What really matters is the functioning of the complete system – hardware, runtime system, compiler, operating system, and

application

– In networking, this is called the “End to End argument”

30 Lasse Natvig

• Computer architecture is not just about transistors(not at all), individual instructions, or particular implementations– E.g., Original RISC projects replaced complex instructions with a

compiler + simple instructions

6

Computer Architecture is Design and Analysis

Design

Analysis

Architecture is an iterative process:• Searching the space of possible designs• At all levels of computer systems

C ti it

31 Lasse Natvig

Creativity

Good IdeasGood IdeasMediocre IdeasBad Ideas

Cost /PerformanceAnalysis

TDT4260 Course FocusUnderstanding the design techniques, machine

structures, technology factors, evaluation methods that will determine the form of computers in 21st Century

Technology ProgrammingLanguages

Parallelism

32 Lasse Natvig

Languages

OperatingSystems History

Applications Interface Design(ISA)

Measurement & Evaluation

Computer Architecture:• Organization• Hardware/Software Boundary

Compilers

Holistic approache.g., to programmability

33 Lasse Natvig

Multicore, interconnect, memory

Operating System & system software

Parallel & concurrent programming

Moore’s Law: 2X transistors / “year”

34 Lasse Natvig

• “Cramming More Components onto Integrated Circuits”– Gordon Moore, Electronics, 1965

• # of transistors / cost-effective integrated circuit double every N months (12 ≤ N ≤ 24)

Tracking Technology Performance Trends• 4 critical implementation technologies:

– Disks, – Memory, – Network, – Processors

• Compare for Bandwidth vs. Latency

35 Lasse Natvig

improvements in performance over time• Bandwidth: number of events per unit time

– E.g., M bits/second over network, M bytes / second from disk

• Latency: elapsed time for a single event– E.g., one-way network delay in microseconds,

average disk access time in milliseconds

Latency Lags Bandwidth (last ~20 years)

100

1000

10000

Relative BW

Processor

Memory

Network

Disk

• Performance Milestones

• Processor: ‘286, ‘386, ‘486, Pentium, Pentium Pro, Pentium 4 (21x,2250x)

• Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x)

• Memory Module: 16bit plain DRAM, P M d DRAM 32b 64b SDRAM

CPU high, Memory low(“Memory Wall”)

36 Lasse Natvig

1

10

100

1 10 100

Relative Latency Improvement

Improvement

(Latency improvement = Bandwidth improvement)

Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x)

• Disk : 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)

(Processor latency = typical # of pipeline-stages * time pr. clock-cycle)

7

COST and COTS• Cost

– to produce one unit

– include (development cost / # sold units)

– benefit of large volume

• COTSdit ff th h lf

37 Lasse Natvig

– commodity off the shelf

Speedup• General definition:

Speedup (p processors) =

• For a fixed problem size (input data set), performance = 1/time

Performance (p processors)

Performance (1 processor)

Superlinear speedup ?

38 Lasse Natvig

performance 1/time– Speedup

fixed problem (p processors) =

• Note: use best sequential algorithm in the uni-processor

solution, not the parallel algorithm with p = 1

Time (1 processor)

Time (p processors)

Amdahl’s Law (1967) (fixed problem size)• “If a fraction s of a

(uniprocessor) computation is inherently serial, the speedup is at most 1/s”

• Total work in computation– serial fraction s– parallel fraction p

39 Lasse Natvig

p p– s + p = 1 (100%)

• S(n) = Time(1) / Time(n)

= (s + p) / [s +(p/n)]

= 1 / [s + (1-s) / n]

= n / [1 + (n - 1)s]• ”pessimistic and famous”

Gustafson’s “law” (1987)(scaled problem size, fixed execution time)

• Total execution time on parallel computer with nprocessors is fixed– serial fraction s’– parallel fraction p’– s’ + p’ = 1 (100%)

• S’(n) = Time’(1)/Time’(n)

40 Lasse Natvig

• S (n) = Time (1)/Time (n) = (s’ + p’n)/(s’ + p’)= s’ + p’n = s’ + (1-s’)n= n +(1-n)s’

• Reevaluating Amdahl's law, John L. Gustafson, CACM May 1988, pp 532-533. ”Not a new law, but Amdahl’s law with changed assumptions”

How the serial fraction limits speedup

• Amdahl’s law

• Work hard to

41 Lasse Natvig

reduce the serial part of the application– remember IO

– think different(than traditionally or sequentially)

= serial fraction

1

TDT4260 Computer architectureMini-project

PhD candidate Alexandru Ciprian IordanInstitutt for datateknikk og informasjonsvitenskap

2

What is it…? How much…?

• The mini-project is the exercise part of TDT4260 course

• This year the students will need to develop and evaluate a PREFETCHER

• The mini-project accounts for 20 % of the final grade in TDT4260

• 80 % for report• 20 % for oral presentation

3

What will you work with…

• Modified version of M5 (for development and evaluation)

• Computing time on Kongull cluster (for benchmarking)

• More at: http://dm-ark.idi.ntnu.no/

4

M5

• Initially developed by the University of Michigan

• Enjoys a large community of users and developers

• Flexible object-oriented architecture

• Has support for 3 ISA: ALPHA, SPARC and MIPS

5

Team work…

• You need to work in groups of 2-4 students

• Grade is based on written paper AND oral presentation (chose you best speaker)

6

Time Schedule and Deadlines

More on It’s learning

7

Web page presentation

TDT 4260App A.1, Chap 2

Instruction Level Parallelism

Contents

• Instruction level parallelism Chap 2

• Pipelining (repetition) App A

▫ Basic 5-step pipeline

• Dependencies and hazards Chap 2.1

▫ Data, name, control, structural

• Compiler techniques for ILP Chap 2.2

• (Static prediction Chap 2.3)

▫ Read this on your own

• Project introduction

Instruction level parallelism (ILP)

• A program is sequence of instructions typically written to be executed one after the other

• Poor usage of CPU resources! (Why?)

• Better: Execute instructions in parallel

▫ 1: PipelinePartial overlap of instruction execution

▫ 2: Multiple issueTotal overlap of instruction execution

• Today: Pipelining

Pipelining

(1/3)

Pipelining (2/3)

• Multiple different stages executed in parallel

▫ Laundry in 4 different stages

▫ Wash / Dry / Fold / Store

• Assumptions:

▫ Task can be split into stages

▫ Storage of temporary data

▫ Stages synchronized

▫ Next operation known before last finished?

Pipelining (3/3)

• Good Utilization: All stages are ALWAYS in use

▫ Washing, drying, folding, ...

▫ Great usage of resources!

• Common technique, used everywhere

▫ Manufacturing, CPUs, etc

• Ideal: time_stage = time_instruction / stages

▫ But stages are not perfectly balanced

▫ But transfer between stages takes time

▫ But pipeline may have to be emptied

▫ ...

Example: MIPS64 (1/2)

• RISC

• Load/store

• Few instruction formats

• Fixed instruction length

• 64-bit▫ DADD = 64 bits ADD

▫ LD = 64 bits L(oad)

• 32 registers (R0 = 0)

• EA = offset(Register)

• Pipeline▫ IF: Instruction fetch

▫ ID: Instruction decode / register fetch

▫ EX: Execute / effective address (EA)

▫ MEM: Memory access

▫ WB: Write back (reg)

Example: MIPS64 (2/2)

Instr.

Order

Time (clock cycles)

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5

Big Picture:

• What are some real world examples of pipelining?

• Why do we pipeline?• Does pipelining increase or decrease instruction

throughput?• Does pipelining increase or decrease instruction

latency?

Big Picture (continued):

• Computer Architecture is the study of design tradeoffs!!!!

• There is no “philosophy of architecture” and no “perfect architecture”. This is engineering, not science.

• What are the costs of pipelining?• For what types of devices is pipelining not a

good choice?

Improve speedup?

• Why not perfect speedup?▫ Sequential programs▫ One instruction dependent on another▫ Not enough CPU resources

• What can be done?▫ Forwarding (HW)▫ Scheduling (SW / HW)▫ Prediction (SW / HW)

• Both hardware (dynamic) and compiler (static) can help

Dependencies and hazards

• Dependencies▫ Parallel instructions can be executed in parallel▫ Dependent instructions are not parallel

� I1: DADD R1, R2, R3� I2: DSUB R4, R1, R5

▫ Property of the instructions• Hazards

▫ Situation where a dependency causes an instruction to give a wrong result

▫ Property of the pipeline▫ Not all dependencies give hazards

� Dependencies must be close enough in the instruction stream to cause a hazard

Dependencies

• (True) data dependencies

▫ One instruction reads what an earlier has written

• Name dependencies

▫ Two instructions use the same register / mem loc

▫ But no flow of data between them

▫ Two types: Anti and output dependencies

• Control dependencies

▫ Instructions dependent on the result of a branch

• Again: Independent of pipeline implementation

Hazards

• Data hazards

▫ Overlap will give different result from sequential

▫ RAW / WAW / WAR

• Control hazards

▫ Branches

▫ Ex: Started executing the wrong instruction

• Structural hazards

▫ Pipeline does not support this combination of instr.

▫ Ex: Register with one port, two stages want to read

Instr.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Data dependency � Hazard?Figure A.6, Page A-16

• Read After Write (RAW)InstrJ tries to read operand before InstrI writes it

• Caused by a true data dependency

• This hazard results from an actual need for communication.

Data Hazards (1/3)

I: add r1,r2,r3J: sub r4,r1,r3

• Write After Read (WAR)InstrJ writes operand before InstrI reads it

• Caused by an anti dependencyThis results from reuse of the name “r1”

• Can’t happen in MIPS 5 stage pipeline because:

▫ All instructions take 5 stages, and

▫ Reads are always in stage 2, and

▫ Writes are always in stage 5

I: sub r4,r1,r3 J: add r1,r2,r3

Data Hazards (2/3) Data Hazards (3/3)• Write After Write (WAW)

InstrJ writes operand before InstrI writes it.

• Caused by an output dependency

• Can’t happen in MIPS 5 stage pipeline because: ▫ All instructions take 5 stages, and ▫ Writes are always in stage 5

• WAR and WAW can occur in more complicated pipes

I: sub r1,r4,r3 J: add r1,r2,r3

Instr.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

RegALU

DMemIfetch Reg

RegALU

DMemIfetch Reg

ForwardingFigure A.7, Page A-18

IF ID/RF EX MEM WB

Instr.

Order

Ld r1,r2

add r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Can all data hazards be solved via

forwarding???IF ID/RF EX MEM WB

Structural Hazards (Memory Port)Figure A.4, Page A-14

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg


Reg

ALU

DMemIfetch Reg

Hazards, Bubbles (Similar to Figure A.5, Page A-15)

Instr.

Order

Time (clock cycles)

Load

Instr 1

Ld r1, r2

Stall

Add r1, r1, r1

RegALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg


Reg

ALU

DMemIfetch Reg

Bubble Bubble Bubble BubbleBubble

How do you “bubble” the pipe? How can we avoid this hazard?

Control hazards (1/2)

• Sequential execution is predictable,(conditional) branches are not

• May have fetched instructions that should not be executed

• Simple solution (figure): Stall the pipeline (bubble)▫ Performance loss depends on number of branches in the program

and pipeline implementation

▫ Branch penaltyC

Possibly wrong instruction Correct instruction

Control hazards (2/2)• What can be done?

▫ Always stop (previous slide)

� Also called freeze or flushing of the pipeline

▫ Assume no branch (=assume sequential)

� Must not change state before branch instr. is complete

▫ Assume branch

� Only smart if the target address is ready early

▫ Delayed branch

� Execute a different instruction while branch is evaluated

� Static techniques (fixed rule or compiler)

Example

• Assume branch conditionals are evaluated in the EX stage, and determine the fetch address for the following cycle.

• If we always stall, how many cycles are bubbled? • Assume branch not taken, how many bubbles for an

incorrect assumption? • Is stalling on every branch ok? • What optimizations could be done to improve stall

penalty?

Dynamic scheduling

• So far: Static scheduling

▫ Instructions executed in program order

▫ Any reordering is done by the compiler

• Dynamic scheduling

▫ CPU reorders to get a more optimal order

� Fewer hazards, fewer stalls, ...

▫ Must preserve order of operations where reordering could change the result

▫ Covered by TDT 4255 Hardware design

Compiler techniques for ILP

• For a given pipeline and superscalarity▫ How can these be best utilized?

▫ As few stalls from hazards as possible

• Dynamic scheduling▫ Tomasulo’s algorithm etc. (TDT4255)

▫ Makes the CPU much more complicated

• What can be done by the compiler?▫ Has ”ages” to spend, but less knowledge

▫ Static scheduling, but what else?

Example

Source code:

for (i = 1000; i >0; i=i-1)x[i] = x[i] + s;

Notice:

• Lots of dependencies

• No dependencies between iterations

• High loop overhead

� Loop unrolling

MIPS:

Loop: L.D F0,0(R1) ; F0 = x[i]

ADD.D F4,F0,F2 ; F2 = s

S.D F4,0(R1) ; Store x[i] + s

DADDUI R1,R1,#-8 ; x[i] is 8 bytes

BNE R1,R2,Loop ; R1 = R2?

Static schedulingLoop: L.D F0,0(R1)

stopp

ADD.D F4,F0,F2

stopp

stopp

S.D F4,0(R1)

DADDUI R1,R1,#-8

stopp

BNE R1,R2,Loop

Loop: L.D F0,0(R1)

DADDUI R1,R1,#-8

ADD.D F4,F0,F2

stopp

stopp

S.D F4,8(R1)BNE R1,R2,Loop

Result: From 9 cycles per iteration to 7(Delays from table in figure 2.2)

Loop unrolling

Loop: L.D F0,0(R1)

ADD.D F4,F0,F2

S.D F4,0(R1)

DADDUI R1,R1,#-8

BNE R1,R2,Loop

Loop: L.D F0,0(R1)

ADD.D F4,F0,F2

S.D F4,0(R1)

L.D F6,-8(R1)

ADD.D F8,F6,F2

S.D F8,-8(R1)

L.D F10,-16(R1)

ADD.D F12,F10,F2

S.D F12,-16(R1)

L.D F14,-24(R1)

ADD.D F16,F14,F2

S.D F16,-24(R1)

DADDUI R1,R1,#-32

BNE R1,R2,Loop

• Reduced loop overhead

• Requires number of iterations divisible by n (here n=4)

• Register renaming

• Offsets have changed

• Stalls not shown

Loop: L.D F0,0(R1)

L.D F6,-8(R1)

L.D F10,-16(R1)

L.D F14,-24(R1)

ADD.D F4,F0,F2

ADD.D F8,F6,F2

ADD.D F12,F10,F2

ADD.D F16,F14,F2

S.D F4,0(R1)

S.D F8,-8(R1)

DADDUI R1,R1,#-32

S.D F12,-16(R1)

S.D F16,-24(R1)BNE R1,R2,Loop

Loop: L.D F0,0(R1)

ADD.D F4,F0,F2

S.D F4,0(R1)

L.D F6,-8(R1)

ADD.D F8,F6,F2

S.D F8,-8(R1)

L.D F10,-16(R1)

ADD.D F12,F10,F2

S.D F12,-16(R1)

L.D F14,-24(R1)

ADD.D F16,F14,F2

S.D F16,-24(R1)

DADDUI R1,R1,#-32

BNE R1,R2,Loop

Avoids stall after: L.D(1), ADD.D(2), DADDUI(1)

Loop unrolling: Summary

• Original code 9 cycles per element

• Scheduling 7 cycles per element

• Loop unrolling 6,75 cycles per element

▫ Unrolled 4 iterations

• Combination 3,5 cycles per element

▫ Avoids stalls entirely

Compiler reduced execution time by 61%

Loop unrolling in practice

• Do not usually know upper bound of loop• Suppose it is n, and we would like to unroll the loop

to make k copies of the body• Instead of a single unrolled loop, we generate a pair

of consecutive loops:▫ 1st executes (n mod k) times and has a body that is the

original loop▫ 2nd is the unrolled body surrounded by an outer loop

that iterates (n/k) times

• For large values of n, most of the execution time will be spent in the unrolled loop

TDT 4260Chap 2, Chap 3

Instruction Level Parallelism (cont)

Review

• Name real-world examples of pipelining

• Does pipelining lower instruction latency?

• What is the advantage of pipelining?

• What are some disadvantages of pipelining?

• What can a compiler do to avoid processor stalls?

• What are the three types of data dependences?

• What are the three types of pipeline hazards?

Contents

• Very Large Instruction Word Chap 2.7

▫ IA-64 and EPIC

• Instruction fetching Chap 2.9

• Limits to ILP Chap 3.1/2

• Multi-threading Chap 3.5

• CPI ≥ 1 if issue only 1 instruction every clock cycle

• Multiple-issue processors come in 3 flavors:

1. Statically-scheduled superscalar processors• In-order execution

• Varying number of instructions issued (compiler)

2. Dynamically-scheduled superscalar processors • Out-of-order execution

• Varying number of instructions issued (CPU)

3. VLIW (very long instruction word) processors• In-order execution

• Fixed number of instructions issued

Getting CPI below 1

VLIW: Very Large Instruction Word (1/2)

• Each VLIW has explicit coding for multiple operations▫ Several instructions combined into packets

▫ Possibly with parallelism indicated

• Tradeoff instruction space for simple decoding▫ Room for many operations

▫ Independent operations => execute in parallel

▫ E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch

VLIW: Very Large Instruction Word (2/2)

• Assume 2 load/store, 2 fp, 1 int/branch▫ VLIW with 0-5 operations.

▫ Why 0?

• Important to avoid empty instruction slots▫ Loop unrolling

▫ Local scheduling

▫ Global scheduling

� Scheduling across branches

• Difficult to find all dependencies in advance▫ Solution1: Block on memory accesses

▫ Solution2: CPU detects some dependencies

Recall:

Unrolled Loop

that minimizes

stalls for Scalar

Loop: L.D F0,0(R1)

L.D F6,-8(R1)

L.D F10,-16(R1)

L.D F14,-24(R1)

ADD.D F4,F0,F2

ADD.D F8,F6,F2

ADD.D F12,F10,F2

ADD.D F16,F14,F2

S.D F4,0(R1)

S.D F8,-8(R1)

DADDUI R1,R1,#-32

S.D F12,-16(R1)

S.D F16,-24(R1)BNE R1,R2,Loop

Source code:

for (i = 1000; i >0; i=i-1)x[i] = x[i] + s;

Register mapping:

s � F2

i � R1

Loop Unrolling in VLIW

Memory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branch

L.D F0,0(R1) L.D F6,-8(R1) 1

L.D F10,-16(R1) L.D F14,-24(R1) 2

L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3

L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4

ADD.D F20,F18,F2 ADD.D F24,F22,F2 5

S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6

S.D -16(R1),F12 S.D -24(R1),F16 7

S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#48 8

S.D -0(R1),F28 BNEZ R1,LOOP 9

Unrolled 7 iterations to avoid delays

7 results in 9 clocks, or 1.3 clocks per iteration (1.8X)

Average: 2.5 ops per clock, 50% efficiency

Note: Need more registers in VLIW (15 vs. 6 in SS)

Problems with 1st Generation VLIW

• Increase in code size

▫ Loop unrolling

▫ Partially empty VLIW

• Operated in lock-step; no hazard detection HW

▫ A stall in any functional unit pipeline causes entire processor to stall, since all functional units must be kept synchronized

▫ Compiler might predict function units, but caches hard to predict

▫ Moder VLIWs are “interlocked” (identify dependences between bundles and stall).

• Binary code compatibility

▫ Strict VLIW => different numbers of functional units and unit latencies require different versions of the code

VLIW Tradeoffs

• Advantages▫ “Simpler” hardware because the HW does not have to

identify independent instructions.

• Disadvantages▫ Relies on smart compiler▫ Code incompatibility between generations▫ There are limits to what the compiler can do (can’t move

loads above branches, can’t move loads above stores)

• Common uses▫ Embedded market where hardware simplicity is

important, applications exhibit plenty of ILP, and binary compatibility is a non-issue.

IA-64 and EPIC• 64 bit instruction set architecture

▫ Not a CPU, but an architecture▫ Itanium and Itanium 2 are CPUs

based on IA-64

• Made by Intel and Hewlett-Packard (itanium 2 and 3 designed in Colorado)

• Uses EPIC: Explicitly Parallel Instruction Computing • Departure from the x86 architecture• Meant to achieve out-of-order performance with in-

order HW + compiler-smarts▫ Stop bits to help with code density▫ Support for control speculation (moving loads above

branches)▫ Support for data speculation (moving loads above stores)

Details in Appendix G.6

Instruction bundle (VLIW)

Functional units and template

• Functional units:▫ I (Integer), M (Integer + Memory), F (FP), B (Branch),

L + X (64 bit operands + special inst.)

• Template field:▫ Maps instruction to functional unit

▫ Indicates stops: Limitations to ILP

Code example (1/2)

Code example 2/2Control Speculation

• Can the compiler schedule an independent load above a branch?

Bne R1, R2, TARGET

Ld R3, R4(0)

• What are the problems?

• EPIC provides speculative loadsLd.s R3, R4(0)

Bne R1, R2, TARGET

Check R4(0)

Data Speculation

• Can the compiler schedule an independent load above a store?

St R5, R6(0)

Ld R3, R4(0)

• What are the problems?

• EPIC provides “advanced loads” and an ALAT (Advanced Load Address Table)

Ld.a R3, R4(0) � creates entry in ALAT

St R5, R6(0) �looks up ALAT, if match, jump to fixup code

EPIC Conclusions• Goal of EPIC was to maintain advantages of VLIW, but

achieve performance of out-of-order.

• Results:

▫ Complicated bundling rules saves some space, but makes the hardware more complicated

▫ Add special hardware and instructions for scheduling loads above stores and branches (new complicated hardware)

▫ Add special hardware to remove branch penalties (predication)

▫ End result is a machine as complicated as an out-of-order, but now also requiring a super-sophisticated compiler.

Instruction fetching

• Want to issue >1 instruction every cycle

• This means fetching >1 instruction▫ E.g. 4-8 instructions fetched every cycle

• Several problems▫ Bandwidth / Latency

▫ Determining which instructions� Jumps

� Branches

• Integrated instruction fetch unit

Branch Target Buffer (BTB)

• Predicts next instruction address, sends it out beforedecoding instruction

• PC of branch sent to BTB

• When match is found, Predicted PC is returned

• If branch predicted taken, instruction fetch continues at Predicted PC

Branch Target Buffer (BTB)

• Predicts next instruction address, sends it out beforedecoding instruction

• PC of branch sent to BTB

• When match is found, Predicted PC is returned

• If branch predicted taken, instruction fetch continues at Predicted PC

Possible Optimizations????

Return Address Predictor

• Small buffer of return addresses acts as a stack

• Caches most recent return addresses

• Call ⇒ Push a return address on stack

• Return ⇒ Pop an address off stack & predict as new PC

0%

10%

20%

30%

40%

50%

60%

70%

0 1 2 4 8 16

Return address buffer entries

Misprediction frequency

go

m88ksim

cc1

compress

xlisp

ijpeg

perl

vortex

Integrated Instruction Fetch Units

• Recent designs have implemented the fetch stage as a separate, autonomous unit

▫ Multiple-issue in one simple pipeline stage is too complex

• An integrated fetch unit provides:

▫ Branch prediction

▫ Instruction prefetch

▫ Instruction memory access and buffering

Limits to ILP

• Advances in compiler technology + significantly new and different hardware techniques may be able to overcome limitations assumed in studies

• However, unlikely such advances when coupled with realistic hardware will overcome these limits in near future

• How much ILP is available using existing mechanisms with increasing HW budgets?

Chapter 3

Ideal HW Model1. Register renaming – infinite virtual registers

all register WAW & WAR hazards are avoided

2. Branch prediction – perfect; no mispredictions

3. Jump prediction – all jumps perfectly predicted

2 & 3 ⇒ no control dependencies; perfect speculation & an unbounded buffer of instructions available

4. Memory-address alias analysis – addresses known & a load can be moved before a store provided addresses not equal

1&4 eliminates all but RAW

5. perfect caches; 1 cycle latency for all instructions; unlimited instructions issued/clock cycle

Upper Limit to ILP: Ideal Machine(Figure 3.1)

Programs

0

20

40

60

80

100

120

140

160

gcc espresso li fpppp doducd tomcatv

54.862.6

17.9

75.2

118.7

150.1

Integer: 18 - 60

FP: 75 - 150

Inst

ruct

ion

s P

er C

lock

Instruction window

• Ideal HW need to know entire code

• Obviously not practical▫ Register dependencies scales quadratically

• Window: The set of instructions examined for simultaneous execution

• How does the size of the window affect IPC?▫ Too small window => Can’t see whole loops

▫ Too large window => Hard to implement

5563

18

75

119

150

3641

15

61 59 60

1015 12

49

16

45

10 13 11

35

15

34

8 8 914

914

0

20

40

60

80

100

120

140

160

gcc espresso li fpppp doduc tomcatv

Inst

ruct

ions

Per

Clo

ck

Infinite 2048 512 128 32

More Realistic HW: Window ImpactFigure 3.2

FP: 9 - 150

Integer: 8 - 63

IPC

Thread Level Parallelism (TLP)

• ILP exploits implicit parallel operations within a loop or straight-line code segment

• TLP explicitly represented by the use of multiple threads of execution that are inherently parallel

• Use multiple instruction streams to improve:1. Throughput of computers that run many programs

2. Execution time of a single application implemented as a multi-threaded program (parallel program)

Multi-threaded execution

• Multi-threading: multiple threads share the

functional units of 1 processor via overlapping▫ Must duplicate independent state of each thread e.g., a

separate copy of register file, PC and page table

▫ Memory shared through virtual memory mechanisms

▫ HW for fast thread switch; much faster than full process switch ≈ 100s to 1000s of clocks

• When switch?▫ Alternate instruction per thread (fine grain)

▫ When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)

Fine-Grained Multithreading

• Switches between threads on each instruction▫ Multiples threads interleaved

• Usually round-robin fashion, skipping stalled threads

• CPU must be able to switch threads every clock

• Hides both short and long stalls▫ Other threads executed when one thread stalls

• But slows down execution of individual threads▫ Thread ready to execute without stalls will be delayed by

instructions from other threads

• Used on Sun’s Niagara

• Switch threads only on costly stalls (L2 cache miss)• Advantages

▫ No need for very fast thread-switching▫ Doesn’t slow down thread, since switches only when

thread encounters a costly stall

• Disadvantage: hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs▫ Since CPU issues instructions from 1 thread, when a stall

occurs, the pipeline must be emptied or frozen ▫ New thread must fill pipeline before instructions can

complete

• => Better for reducing penalty of high cost stalls, where pipeline refill << stall time

Coarse-Grained Multithreading

Do both ILP and TLP?

• TLP and ILP exploit two different kinds of parallel structure in a program

• Can a high-ILP processor also exploit TLP?▫ Functional units often idle because of stalls or

dependences in the code

• Can TLP be a source of independent instructions that might reduce processor stalls?

• Can TLP be used to employ functional units that would otherwise lie idle with insufficient ILP?

• => Simultaneous Multi-threading (SMT)▫ Intel: Hyper-Threading

Simultaneous Multi-threading

1

2

3

4

5

6

7

8

9

M M FX FX FP FP BR CCCycleOne thread, 8 units

M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes

1

2

3

4

5

6

7

8

9

M M FX FX FP FP BR CCCycleTwo threads, 8 units

Simultaneous Multi-threading (SMT)

• A dynamically scheduled processor already has many HW mechanisms to support multi-threading▫ Large set of virtual registers

� Virtual = not all visible at ISA level

� Register renaming

▫ Dynamic scheduling

• Just add a per thread renaming table and keeping separate PCs▫ Independent commitment can be supported by logically

keeping a separate reorder buffer for each thread

Multi-threaded categories

Time (processor cycle) Superscalar Fine-Grained Coarse-Grained Multiprocessing

Simultaneous

Multithreading

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Idle slot

Design Challenges in SMT• SMT makes sense only with fine-grained

implementation▫ How to reduce the impact on single thread performance?▫ Give priority to one or a few preferred threads

• Large register file needed to hold multiple contexts• Not affecting clock cycle time, especially in

▫ Instruction issue - more candidate instructions need to be considered

▫ Instruction completion - choosing which instructions to commit may be challenging

• Ensuring that cache and TLB conflicts generated by SMT do not degrade performance

1

TDT 4260 – lecture 4 – 2011• Contents

– Computer architecture introduction• Trends• Moore’s law• Amdahl’s law• Gustafson’s law

1 Lasse Natvig

Gustafson s law

– Why multiprocessor? Chap 4.1• Taxonomy• Memory architecture• Communication

– Cache coherence Chap 4.2• The problem• Snooping protocols

Updated lecture plan pr. 4/2Date and lecturer Topic

1: 14 Jan (LN, AI) Introduction, Chapter 1 / Alex: PfJudge2: 21 Jan (IB) Pipelining, Appendix A; ILP, Chapter 23: 3 Feb (IB) ILP, Chapter 2; TLP, Chapter 34: 4 Feb (LN) Multiprocessors, Chapter 4 5: 11 Feb MG Prefetching + Energy Micro guest lecture by Marius Grannæs &

pizza6: 18 Feb (LN) Multiprocessors continued

2 Lasse Natvig

7: 24 Feb (IB) Memory and cache, cache coherence (Chap. 5)8: 4 Mar (IB) Piranha CMP + Interconnection networks

9: 11 Mar (LN) Multicore architectures (Wiley book chapter) + Hill Marty Amdahl multicore ... Fedorova ... assymetric multicore ...

10: 18 Mar (IB) Memory consistency (4.6) + more on memory11: 25 Mar (JA, AI) (1) Kongull and other NTNU and NOTUR supercomputers (2)

Green computing12: 7 Apr (IB/LN) Wrap up lecture, remaining stuff


Trends • For technology, costs, use

• Help predicting the future

• Product development time – 2-3 years

– design for the next technology

3 Lasse Natvig

– Why should an architecture live longer than a product?

Comp. Arch. is an Integrated Approach

• What really matters is the functioning of the complete system – hardware, runtime system, compiler, operating

system, and application

– In networking this is called the “End to End argument”

4 Lasse Natvig

– In networking, this is called the End to End argument

• Computer architecture is not just about transistors(not at all), individual instructions, or particular implementations– E.g., Original RISC projects replaced complex

instructions with a compiler + simple instructions

Computer Architecture is Design and Analysis

Design

Analysis

Architecture is an iterative process:• Searching the huge space of possible designs• At all levels of computer systems

C ti it

5 Lasse Natvig

Creativity

Good IdeasGood IdeasMediocre IdeasBad Ideas

Cost /PerformanceAnalysis

TDT4260 Course FocusUnderstanding the design techniques, machine

structures, technology factors, evaluation methods that will determine the form of computers in 21st Century

Technology ProgrammingLanguages

Parallelism

6 Lasse Natvig

Languages

OperatingSystems History

Applications Interface Design(ISA)

Measurement & Evaluation

Computer Architecture:• Organization• Hardware/Software Boundary

Compilers

2

Holistic approache.g., to programmability combined with performance

TBP (Wool, TBB)Energy aware task pool implementation

NTNU-principle: Teaching based on research, example, PhD-project of Alexandru Iordan:

7 Lasse Natvig

Multicore, interconnect, memory

Operating System & system software

Parallel & concurrent programming

Multicore memory systems (Dybdahl-PhD, Grannæs-PhD, Jahre-PhD, M5-sim, pfJudge)

Moore’s Law: 2X transistors / “year”

8 Lasse Natvig

• “Cramming More Components onto Integrated Circuits”– Gordon Moore, Electronics, 1965

• # of transistors / cost-effective integrated circuit double every N months (12 ≤ N ≤ 24)

Tracking Technology Performance Trends• 4 critical implementation technologies:

– Disks, – Memory, – Network, – Processors

• Compare for Bandwidth vs. Latency

9 Lasse Natvig

improvements in performance over time• Bandwidth: number of events per unit time

– E.g., M bits/second over network, M bytes / second from disk

• Latency: elapsed time for a single event– E.g., one-way network delay in microseconds,

average disk access time in milliseconds

Latency Lags Bandwidth (last ~20 years)

100

1000

10000

Relative BW

Processor

Memory

Network

Disk

• Performance Milestones

• Processor: ‘286, ‘386, ‘486, Pentium, Pentium Pro, Pentium 4 (21x,2250x)

• Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x)

• Memory Module: 16bit plain DRAM, P M d DRAM 32b 64b SDRAM

CPU high, Memory low(“Memory Wall”)

10 Lasse Natvig

1

10

100

1 10 100

Relative Latency Improvement

Improvement

(Latency improvement = Bandwidth improvement)

Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x)

• Disk : 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)

-----------------

(Processor latency = typical # of pipeline-stages * time pr. clock-cycle)

COST and COTS• Cost

– to produce one unit

– include (development cost / # sold units)

– benefit of large volume

• COTS

11 Lasse Natvig

• COTS– commodity off the shelf

• much better performance/price pr. component

• strong influence on the selection of components for building supercomputers in more than 20 years

Speedup• General definition:

Speedup (p processors) =

• For a fixed problem size (input data set), performance = 1/time

Performance (p processors)

Performance (1 processor)

12 Lasse Natvig

performance 1/time– Speedup

fixed problem (p processors) =

• Note: use best sequential algorithm in the uni-processor

solution, not the parallel algorithm with p = 1

Time (1 processor)

Time (p processors)

Superlinear speedup ?

3

Amdahl’s Law (1967) (fixed problem size)• “If a fraction s of a

(uniprocessor) computation is inherently serial, the speedup is at most 1/s”

• Total work in computation– serial fraction s– parallel fraction p

13 Lasse Natvig

p p– s + p = 1 (100%)

• S(n) = Time(1) / Time(n)

= (s + p) / [s +(p/n)]

= 1 / [s + (1-s) / n]

= n / [1 + (n - 1)s]• ”pessimistic and famous”

Gustafson’s “law” (1987)(scaled problem size, fixed execution time)

• Total execution time on parallel computer with nprocessors is fixed– serial fraction s’– parallel fraction p’– s’ + p’ = 1 (100%)

• S’(n) = Time’(1)/Time’(n)

14 Lasse Natvig

• S (n) = Time (1)/Time (n) = (s’ + p’n)/(s’ + p’)= s’ + p’n = s’ + (1-s’)n= n +(1-n)s’

• Reevaluating Amdahl's law, John L. Gustafson, CACM May 1988, pp 532-533. ”Not a new law, but Amdahl’s law with changed assumptions”

How the serial fraction limits speedup

• Amdahl’s law

• Work hard to

15 Lasse Natvig

reduce the serial part of the application– remember IO

– think different(than traditionally or sequentially)

= serial fraction

Single/ILP Multi/TLP• Uniprocessor trends

– Getting too complex– Speed of light– Diminishing returns from ILP

• Multiprocessor

16 Lasse Natvig

Multiprocessor – Focus in the textbook: 4-32 CPUs– Increased performance through parallelism– Multichip– Multicore ((Single) Chip Multiprocessors – CMP)– Cost effective

• Right balance of ILP and TLP is unclear today– Desktop vs. server?

Other Factors Multiprocessors• Growth in data-intensive applications

– Databases, file servers, multimedia, …

• Growing interest in servers, server performance• Increasing desktop performance less important

– Outside of graphics

17 Lasse Natvig

Outside of graphics

• Improved understanding in how to use multiprocessors effectively – Especially in servers where significant natural TLP

• Advantage of leveraging design investment by replication – Rather than unique design

• Power/cooling issues multicore

Multiprocessor – Taxonomy• Flynn’s taxonomy (1966, 1972)

– Taxonomy = classification

– Widely used, but perhaps a bit coarse

• Single Instruction Single Data (SISD)– Common uniprocessor

Si l I t ti M lti l D t (SIMD)

18 Lasse Natvig

• Single Instruction Multiple Data (SIMD)– “ = Data Level Parallelism (DLP)”

• Multiple Instruction Single Data (MISD)– Not implemented?– Pipeline / Stream processing / GPU ?

• Multiple Instruction Multiple Data (MIMD)– Used today– “ = Thread Level Parallelism (TLP)”

4

Flynn’s taxonomy (1/2)Single/Multiple Instruction/Data Stream

19 Lasse Natvig

SISD uniprocessor

MIMD w/shared memorySIMD w/distributed memory

Flynn’s taxonomy (2/2), MISDSingle/Multiple Instruction/Data Stream

20 Lasse Natvig

MISD (software pipeline)

Advantages to MIMD• Flexibility

– High single-user performance, multiple programs, multiple threads

– High multiple-user performance

– Combination

• Built using commercial off-the-shelf (COTS) t

21 Lasse Natvig

components– 2 x Uniprocessor = Multi-CPU

– 2 x Uniprocessor core on a single chip = Multicore

MIMD: Memory architecture

P1 Pn

P1

$ $

Pn

22 Lasse Natvig

$

Interconnection network (IN)

$Mem MemInterconnection network (IN)

Mem Mem

Centralized Memory Distributed Memory

Centralized Memory Multiprocessor • Also called

• Symmetric Multiprocessors (SMPs)

• Uniform Memory Access (UMA) architecture

• Shared memory becomes bottleneck

23 Lasse Natvig

y

• Large caches single memory can satisfy memory demands of small number of processors

• Can scale to a few dozen processors by using a switch and by using many memory banks

• Scaling beyond that is hard

Distributed (Shared) Memory Multiprocessor

• Pro: Cost-effective way to scale memory bandwidth • If most accesses are to

local memory

• Pro: Reduces latency of local memory accesses

24 Lasse Natvig

• Con: Communication becomes more complex

• Pro/Con: Possible to change software to take advantage of memory that is close, but this can also make SW less portable– Non-Uniform Memory Access (NUMA)

5

MP (MIMD), cluster of SMPs

Proc.

Caches

Node Interc Network

Proc.

Caches

Proc.

Caches

Proc.

Caches

Node Interc Network

Proc.

Caches

Proc.

Caches

25 Lasse Natvig

Cluster Interconnection Network

Memory I/O

Node Interc. Network

Memory I/O

Node Interc. Network

• Combination of centralized and distributed

• Like an early version of the kongull-cluster

Distributed memory

1. Shared address space• Logically shared, physically distributed

• Distributed Shared Memory (DSM)

NUMA hit t

P

M

Network

P P

Conceptual Model

26 Lasse Natvig

• NUMA architecture

2. Separate address spaces• Every P-M module is a separate

computer

• Multicomputer

• Clusters

• Not a focus in this course

Conceptual Model

P

M

P

M

P

M

Network

Implementation

Communication models• Shared memory

– Centralized or Distributed Shared Memory– Communication using LOAD/STORE– Coordinated using traditional OS methods

• Semaphores, monitors, etc.

– Busy-wait more acceptable than for uniprosessor

27 Lasse Natvig

Busy wait more acceptable than for uniprosessor

• Message passing– Using send (put) and receive (get)

• Asynchronous / Synchronous

– Libraries, standards• …, PVM, MPI, …

Limits to parallelism• We need separate processes and threads!

– Can’t split one thread among CPUs/cores• Parallel algorithms needed

– Separate field– Some problems are inherently serial

• P-complete problems– Part of parallel complexity theory

• See minicourse TDT6 - Heterogeneous and green

28 Lasse Natvig

See minicourse TDT6 Heterogeneous and green computing

• http://www.idi.ntnu.no/emner/tdt4260/tdt6

• Amdahl’s law– Serial fraction of code limits speedup– Example: speedup = 80 with 100 processors require

maximum 0,25% of the time spent on serial code

SMP: Cache Coherence Problem

/O

P1

cache cache cache

P2 P3

34 5

u = ?u = ?

u :5 u :5

u = 7

29 Lasse Natvig

• Processors see different values for u after event 3• Old (stale) value read in event 4 (hit)• Event 5 (miss) reads

– correct value (if write-through caches)– old value (if write-back caches)

• Unacceptable to programs, and frequent!

I/O devices

Memory

12u :5

Enforcing coherence• Separate caches makes multiple copies frequent

– Migration

• Moved from shared memory to local cache

• Speeds up access, reduces memory bandwidth requirements

– Replication

• Several local copies when item is read by several

30 Lasse Natvig

p y

• Speeds up access, reduces memory contention

• Need coherence protocols to track shared data– Directory based

• Status in shared location (Chap. 4.4)

– (Bus) snooping

• Each cache maintains local status

• All caches monitor broadcast medium

• Write invalidate / Write update

6

Snooping: Write invalidate

• Several reads or one write: No change• Writes require exclusive access• Writes to shared data: All other cache copies

i lid t d

31 Lasse Natvig

invalidated– Invalidate command and address broadcasted– All caches listen (snoops) and invalidates if necessary

• Read miss:– Write-Through: Memory always up to date

– Write-Back: Caches listen and any exclusive copy is put on the bus

Snooping: Write update• Also called write broadcast

• Must know which cache blocks are shared

32 Lasse Natvig

• Usually Write-Through– Write to shared data: Broadcast, all caches listen and updates their

copy (if any)

– Read miss: Main memory is up to date

Snooping: Invalidate vs. Update• Repeated writes to the same address (no reads) requires

several updates, but only one invalidate

• Invalidates are done at cache block level, while updates are done of individual words

• Delay from a word is written until it can be read is shorter for updates

33 Lasse Natvig

updates

• Invalidate most common– Less bus traffic

– Less memory traffic

– Bus and memory bandwidth typical bottleneck

An Example Snoopy Protocol• Invalidation protocol, write-back cache

• Each cache block is in one state– Shared : Clean in all caches and up-to-date in

34 Lasse Natvig

memory, block can be read

– Exclusive : One cache has only copy, its writeable, and dirty

– Invalid : block contains no data

Snooping: Invalidation protocol (1/6)Processor

0Processor

1Processor

2Processor

N-1

read x

35 Lasse Natvig

Interconnection Network

I/O Systemox

Main Memory

read miss

Processor0

Processor1

Processor2

ProcessorN-1

ox

Snooping: Invalidation protocol (2/6)

36 Lasse Natvig


I/O Systemox

Main Memory

oxshared

7

Processor0

Processor1

Processor2

ProcessorN-1

ox

read x


37 Lasse Natvig


I/O Systemox

Main Memory

oxshared

read miss

Processor0

Processor1

Processor2

ProcessorN-1

ox ox


38 Lasse Natvig


I/O Systemox

Main Memory

oxshared

oxshared

Processor0

Processor1

Processor2

ProcessorN-1

ox ox

write x


39 Lasse Natvig


ox

Main Memory

oxshared

oxshared

invalidate

I/O System

Processor0

Processor1

Processor2

ProcessorN-1

1x


40 Lasse Natvig


I/O Systemox

Main Memory

1xexclusive

Prefetching

Marius Grannæs

Feb 11th, 2011

www.ntnu.no M. Grannæs, Prefetching

2

About Me

• PhD from NTNU in Computer Architecture in 2010• “Reducing Memory Latency by Improving Resource Utilization”• Supervised by Lasse Natvig• Now working for Energy Micro• Working on energy profiling, caching and prefetching• Software development


3

About Energy Micro• Fabless semiconductor company• Founded in 2007 by ex-chipcon founders• 50 employees• Offices around the world• Designing the world most energy friendly microcontrollers• Today: EFM32 Gecko• Next friday: EFM32 Tiny Gecko (cache)• May(ish): EFM32 Giant Gecko (cache + prefetching)• Ambition: 1% marketshare...

• of a $30 bn market.


3

About Energy Micro• Fabless semiconductor company• Founded in 2007 by ex-chipcon founders• 50 employees• Offices around the world• Designing the world most energy friendly microcontrollers• Today: EFM32 Gecko• Next friday: EFM32 Tiny Gecko (cache)• May(ish): EFM32 Giant Gecko (cache + prefetching)• Ambition: 1% marketshare...• of a $30 bn market.


4

What is Prefetching?

Prefetching

Prefetching is a technique for predicting future prefetches andfetching the data into the cache


5

The Memory Wall

1

10

100

1000

10000

100000

1980 1985 1990 1995 2000 2005 2010

Per

form

ance

Year

CPU performanceMemory performance

W.Wulf and S. McKee, "Hitting the Memory Wall: Implications ofthe Obvious"


6

A Useful Analogy• An Intel Core i7 can execute 147600 Million Instructions per

second.• ⇒ A carpenter can hammer one nail per second.

• DDR3-1600 RAM can perform 65 Million transfers per second.• ⇒ The carpenter must wait 38 minutes per nail.


6



• DDR3-1600 RAM can perform 65 Million transfers per second.

• ⇒ The carpenter must wait 38 minutes per nail.


6



• DDR3-1600 RAM can perform 65 Million transfers per second.• ⇒ The carpenter must wait 38 minutes per nail.


7

Solution

Solution outline:1 You bring an entire box of nails.2 Keep the box close to the carpenter


8

Analysis: CarpentingHow long (on average) does it take to get one nail?


8

Analysis: CarpentingHow long (on average) does it take to get one nail?

Nail latency

LNail = LBox + pBox is empty · (LShop + LTraffic)

LNail Time to get one nail.LBox Time to check and fetch one nail from the box.

pBox is empty Probabilty that the box you have is empty.LShop Time to go to the shop (38 minutes).LTraffic Time lost due to traffic.


9

Solution: (For computers)

• Faster, but smaller memory closer to the processor.• Temporal locality

• If you needed X in the past, you are probably going to need Xin the near future.

• Spatial locality• If you need X , you probably need X + 1

⇒ If you need X, put it in the cache, along with everything elseclose to it (cache line)


10

Analysis: Caches

Nail latency

LSystem = LCache + pMiss · (LMain Memory + LCongestion)

LSystem Total system latency.LCache Latency of the cache.

pMiss Probabilty of a cache miss.LMain Memory Main memory latency.

LCongestion Latency due to main memory congestion.


11

DRAM in perspective• “Incredibly slow” DRAM has a response time of 15.37 ns.• Speed of light is 3 · 108m/s.• Physical distance from processor to DRAM chips is typically

20cm.

2 · 20 · 10−3m3 · 108m/s

= 0.13ns (1)

• Just 2 orders of magnitude!• Intel Core i7 - 147600 Million Instructions per second.• Ultimate laptop - 5 · 1050 operations per second/kg.

Lloyd, Seth, “Ultimate physical limits to computation”


11


20cm.2 · 20 · 10−3m

3 · 108m/s= 0.13ns (1)

• Just 2 orders of magnitude!

• Intel Core i7 - 147600 Million Instructions per second.• Ultimate laptop - 5 · 1050 operations per second/kg.



11


20cm.2 · 20 · 10−3m

3 · 108m/s= 0.13ns (1)

• Just 2 orders of magnitude!• Intel Core i7 - 147600 Million Instructions per second.• Ultimate laptop - 5 · 1050 operations per second/kg.



12

When does caching not work?The four Cs:• Cold/Compulsory:

• The data has not been referenced before• Capacity

• The data has been referenced before, but has been thrown out,because of the limited size of the cache.

• Conflict• The data has been thrown out of a set-assosciative cache

because it would not fit in the set.• Coherence

• Another processor (in a muti-processor/core environment) hasinvalidated the cacheline.

We can buy our way out of Capacity and Conflict misses, but notCold or Coherence misses!


13

Cache Sizes

1

10

100

1000

10000

1985 1990 1995 2000 2005 2010

Cac

he s

ize

(kB

)

Year

8048

6DX

Pent

ium

Pent

ium

Pro

Pent

ium

IIPe

ntiu

m II

IPe

ntiu

m 4

Pent

ium

4E

Cor

e 2

Cor

e i7


14

Core i7 (Lynnfield) - 2009


15

Pentium M - 2003


16

Prefetching

Prefetching increases the performance of caches by predictingwhat data is needed and fetching that data into the cache before itis referenced. Need to know:• What to prefetch?• When to prefetch?• Where to put the data?• How do we prefetch? (Mechanism)


17

Prefetching Terminology

Good PrefetchA prefetch is classified as Good if the prefetched block isreferenced by the application before it is replaced.

Bad PrefetchA prefetch is classified as Bad if the prefetched block is notreferenced by the application before it is replaced.


18

Accuracy

The accuracy of a given prefetch algorithm that yields G goodprefetches and B bad prefetches is calculated as:

Accuracy

Accuracy = GG+B


19

Coverage

If a conventional cache has M misses without using any prefetchalgorithm, the coverage of a given prefetch algorithm that yields Ggood prefetches and B bad prefetches is calculated as:

Accuracy

Coverage = GM


20

Prefetching

System Latency

Lsystem = Lcache + pmiss · (Lmain memory + Lcongestion)

• If a prefetch is good:• pmiss is lowered• ⇒ Lsystem decreases

• If a prefetch is bad:• pmiss becomes higher because useful data might be replaced• Lcongestion becomes higher because of useless traffic• ⇒ Lsystem increases


21

Prefetching TechniquesTypes of prefetching:• Software

• Special instructions.• Most modern high performance processors have them.• Very flexible.• Can be good at pointer chasing.• Requires compiler or programmer effort.• Processor executes prefetches instead of computation.• Static (performed at compile-time).

• Hardware• Hybrid


21

Prefetching Techniques

Types of prefetching:• Software• Hardware

• Dedicated hardware analyzes memory references.• Most modern high performance processors have them.• Fixed functionality.• Requires no effort by the programmer or compiler.• Off-loads prefetching to hardware.• Dynamic (performed at run-time)

• Hybrid


21

Prefetching Techniques

Types of prefetching:• Software• Hardware• Hybrid

• Dedicated hardware unit.• Hardware unit programmed by software.• Some effort required by the programmer or compiler.


22

Software Prefetchingf o r ( i =0; i < 10000; i ++) {

acc += data [ i ] ;}

MOV r1, 0 ; Acc

MOV rO, #0 ; i

Label: LOAD r2, r0(#data) ; Cache miss! (400 cycles!)

ADD r1, r2 ; acc += date[i]

INC r0 ; i++

CMP r0, #100000 ; i < 100000

BL Label ; branch if less


23

Software Prefetching II

f o r ( i =0; i < 10000; i ++) {acc += data [ i ] ;

}

Simple optimization using __builtin_prefetch()

f o r ( i =0; i < 10000; i ++) {_ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ;acc += data [ i ] ;

}

Why add 10 (and not 1?)Prefetch Distance - Memory latency >> Computation latency.


24

Software Prefetching IIIf o r ( i =0; i < 10000; i ++) {

_ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ;acc += data [ i ] ;

}

Note:• data[0]→ data[9] will not be prefetched.• data[10000]→ data[10009] will be prefetched, but not used.

Accuracy =G

G + B=

999010000

= 0.999 = 99,9%

Coverage =GM

=999010000

= 0.999 = 99,9%


25

Complex Softwaref o r ( i =0; i < 10000; i ++) {

_ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ;i f ( someFunction ( i ) == True ){


}

Does prefetching pay off in this case?

• How many times is someFunction(i) true?• How much memory bus access is perfomed in

someFunction(i)?• Does power matter?

We have to profile the program to know!


25

Complex Softwaref o r ( i =0; i < 10000; i ++) {

_ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ;i f ( someFunction ( i ) == True ){


}

Does prefetching pay off in this case?• How many times is someFunction(i) true?• How much memory bus access is perfomed in

someFunction(i)?• Does power matter?

We have to profile the program to know!


26

Dynamic Data Structures I

typedef s t r u c t {i n t data ;node_t next ;

} node_t ;

wh i le ( ( node = node−>next ) != NULL) {acc += node−>data ;

}


27

Dynamic Data Structures II

typedef s t r u c t {i n t data ;node_t next ;node_t jump ;

} node_t ;

wh i le ( ( node = node−>next ) != NULL) {_ _ b u l t i n _ p r e f e t c h ( node−>jump ) ;acc += node−>data ;

}


28

Hardware PrefetchingSoftware prefetching:• Need programmer effort to implement• Prefetch instructions is not computing• Compile-time• Very flexible

Hardware prefetching:• No programmer effort• Does not displace compute instructions• Run-time• Not flexible


29

Sequential Prefetching

The simplest prefetcher, but suprisingly effective due to spatiallocality.

Sequential Prefetching

Miss on address X⇒ Fetch X+n, X+n+1 ... , X+n+j

n Prefetch distancej Prefetch degree

Collectively known as prefetch agressiveness.


30

Sequential Prefetching II

1

1.5

2

2.5

3

3.5

4

4.5

5

libqu

antu

m

milc

lesl

ie3d

Gem

sFD

TD lbm

sphi

nx3

Spe

edup

Benchmark

Sequential


31

Reference Prediction TablesTien-Fu Chen and Jean-Loup Baer (1995)• Builds upon sequential prefetching, stride directed prefetching.• Observation: Non-unit strides in many applications

• 2, 4, 6, 8, 10 (stride 2)

• Observation: Each load instruction has a distinct accesspattern

Reference Prediction Tables (RPT):• Table index by the load instruction• Simple state machine• Store a single delta of history.


32

Reference Prediction Tables

PC Last Addr. StateDelta

Cache Miss:

Init ial Training Prefetch


33



Cache Miss:


1


34



Cache Miss:


1

1100 Init--


35



Cache Miss:


1

1100 Train

--

3

3 2


36



Cache Miss:


1

3100 Prefetch

2

3

5 2

5


37


1

1.5

2

2.5

3

3.5

4

4.5

5

libqu

antu

m

milc

lesl

ie3d

Gem

sFD

TD lbm

sphi

nx3

Spe

edup

Benchmark

SequentialRPT


38

Global History Buffer

K. Nesbit, A. Dhodapkar and J.Smith (2004)• Observation: Predicting more complex patterns require more

history• Observation: A lot of history in the RPT is very old

Program Counter/Delta Correlation (PC/DC)• Store all misses in a FIFO called Global History Buffer (GHB)• Linked list of all misses from one load instruction• Traversing linked list gives a history for that load


39


PC Ptr Address Ptr

100

Index Table Global History Buffer

1

3

Delta Buffer


40


PC Ptr Address Ptr

100


1

3

Delta Buffer

5


41


PC Ptr Address Ptr

100


1

3

Delta Buffer

5


42


PC Ptr Address Ptr

100


1

3

Delta Buffer

5


43


PC Ptr Address Ptr

100


1

3

Delta Buffer

5


44


PC Ptr Address Ptr

100


1

3

Delta Buffer

5


45


PC Ptr Address Ptr

100


1

3

Delta Buffer

5

2


46


PC Ptr Address Ptr

100


1

3

Delta Buffer

5

2

2


47

Delta Correlation

• In the previous example, the delta buffer only contained twovalues (2,2).

• Thus it is easy to guess that the next delta is also 2.• We can then prefetch: Current address + Delta = 5 + 2 = 7

What if the pattern is repeating, but not regular?1, 2, 3, 4, 5, 1, 2, 3, 4, 5


48

Delta Correlation

1 2 3 1 2 3 124 25

210 11 13 16 17 19 22


49

Delta Correlation

1 2 3 1 2 3 124 25

210 11 13 16 17 19 22


50

Delta Correlation

10 11 13 17 18 20 231 2 3 1 2 3 1

24 252


51

Delta Correlation

10 11 13 17 18 20 231 2 3 1 2 3 1

24 252


52

Delta Correlation

10 11 13 17 18 20 231 2 3 1 2 3 1

24 252


53

Delta Correlation

10 11 13 17 18 20 231 2 3 1 2 3 1

24 252


54

Delta Correlation

1 2 3 1 2 3 124 25

210 11 13 16 17 19 22


55

Delta Correlation

1 2 3 1 2 3 123 25

210 11 13 16 17 19 22


56

Delta Correlation

1 2 3 1 2 3 123 25

210 11 13 16 17 19 22


57

PC/DC

1

1.5

2

2.5

3

3.5

4

4.5

5

libqu

antu

m

milc

lesl

ie3d

Gem

sFD

TD lbm

sphi

nx3

Spe

edup

Benchmark

SequentialRPT

PC/DC


58

Data Prefetching Championships

• Organized by JILP• Held in conjunction with HPCA’09• Branch prediction championships• Everyone uses the same API (six function calls)• Same set of benchmarks• Third party evaluates performance• 20+ prefetchers submitted

http://www.jilp.org/dpc/


http://www.jilp.org/dpc/

59

Delta Correlating Prediction Tables• Our submission to DPC-1• Observation: GHB pointer chasing is expensive.• Observation: History doesn’t really get old.• Observation: History would reach a steady state.• Observation: Deltas are typically small, while the address

space is large.• Table indexed by the PC of the load• Each entry holds the history of the load in the form of deltas.• Delta Correlation

PC Last Addr. DLast Pref. D D D D D Ptr


60

Delta Correlating Prefetch Tables



61



10

100 10 - -- -- - - -


62



10

100 10 - -- -- - - -

11


63



10

100 10 - -- -- 1 - -

11


64



10

100 10 - -- -- 1 - -

11


65



10

100 11 - -- -- 1 - -

11


66



10

100 13 - -- -- 1 2 -

11 13


67



10

100 16 3 -- -- 1 2 -

11 13 16


68



10

100 22 3 31 2- 1 2 -

11 13 16 17 19 22


69


1

1.5

2

2.5

3

3.5

4

4.5

5

libqu

antu

m

milc

lesl

ie3d

Gem

sFD

TD lbm

sphi

nx3

Spe

edup

Benchmark

SequentialRPT

PC/DCDCPT


70

DPC-1 Results

1 Access Map Pattern Matching2 Global History Buffer - Local Delta Buffer3 Prefetching based on a Differential Finite Context Machine4 Delta Correlating Prediction Tables

What did the winning entries do differently?• AMPM - Massive reordering to expose more patterns.• GHB-LDB and PDFCM - Prefetch into the L1.


71

Access Map Pattern Matching• Winning entry by Ishii et al.• Divides memory into hot zones• Each zone is tracked by using a 2 bit vector• Examines each zone for constant strides• Ignores temporal information

LessonBecause of reordering, modern processors/compilers can reorderloads, thus the temporal information might be off.


72

Global History Buffer - Local Delta Buffer

• Second place by Dimitrov et al.• Somewhat similar to DCPT• Improves PC/DC prefetching by including global correlation• Most common stride• Prefetches directly into the L1

LessonPrefetch into L1 gives that extra performance boostMost common stride


73

Prefetching based on a Differential FiniteContext Machine• Third place by Ramos et al.• Table with the most recent history for each load.• A hash of the history is computed and used to look up into a

table containing the predicted stride• Repeat process to increase prefetching degree/distance• Separate prefetcher for L1

LessonFeedback to adjust prefetching degree/prefetching distancePrefetch into the L1


74

Improving DCPT

Partial Matching

Technique for handling reordering, common strides, etc

L1 Hoisting

Technique for handling L1 prefetching


75

Partial Matching

• AMPM ignores all temporal information• Reordering the delta history is very expensive

Reorder 5 accesses: 5! = 120 possibilities• Solution: Reduce spatial resolution by ignoring low bits

Example delta stream

8, 9, 10, 8, 10, 9⇒ (Ignore lower 2 bits)

8, 8, 8, 8, 8, 8 , 8


75

Partial Matching




8, 9, 10, 8, 10, 9⇒ (Ignore lower 2 bits)8, 8, 8, 8, 8, 8

, 8


75

Partial Matching




8, 9, 10, 8, 10, 9⇒ (Ignore lower 2 bits)8, 8, 8, 8, 8, 8 , 8


76

L1 Hoisting

• All three top entries had mechanisms for prefetching into L1• Problem: Pollution• Solution: Use the same highly accurate mechanism to

prefetch into the L1.• In the steady state, only the last predicted delta will be used.• All other deltas has been prefetched and is either in the L2 or

on it’s way.• Hoist the first delta from the L2 to the L1 to increase

performance.


77

L1 Hoisting II


2, 3, 1, 2, 3, 1, 2, 3,

1, 2, 3, 1, 2, 3

Steady state

Prefetch the last delta into L2Hoist the first delta into L1


77

L1 Hoisting II


2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3

Steady state

Prefetch the last delta into L2Hoist the first delta into L1


78

DCPT-P

0

1

2

3

4

5

6

7

milcGemsFDTD

libquantum

leslie3dlbm sphinx3

Spe

edup

DCPT-PAMPM

GHB-LDBPDFCM

RPTPC/DC


79

Interaction with the memory controller

• So far we’ve talked about what to prefetch (address)• When and how is equally important• Modern DRAM is complex• Modern DRAM controllers are even more complex• Bandwidth limited


80

Modern DRAM

• Can have multiple independent memory controllers• Can have multiple channels per controller• Typically multiple banks• Each bank contains several pages (rows) of data (typical 1k-

8k)• Each page accesses is put in a single pagebuffer• Access time to the pagebuffer is much lower than a full access


81

The 3D structure of modern DRAM


82



83



84



85



86

Example

Suppose a processor requires data at locations X1 and X2 that arelocated on the same page at times T1 and T2.There are two separate outcomes:


87

Case 1:

The requests occur at roughly the same time:1 Read 1 (T1) enters the memory controller

2 The page is opened3 Read 2 (T2) enters the memory controller4 Data X1 is returned from DRAM5 Data X2 is returned from DRAM6 The page is closed

Although there are two separate reads, the page is only openedonce.


87

Case 1:

The requests occur at roughly the same time:1 Read 1 (T1) enters the memory controller2 The page is opened

3 Read 2 (T2) enters the memory controller4 Data X1 is returned from DRAM5 Data X2 is returned from DRAM6 The page is closed



87

Case 1:

The requests occur at roughly the same time:1 Read 1 (T1) enters the memory controller2 The page is opened3 Read 2 (T2) enters the memory controller

4 Data X1 is returned from DRAM5 Data X2 is returned from DRAM6 The page is closed



87

Case 1:

The requests occur at roughly the same time:1 Read 1 (T1) enters the memory controller2 The page is opened3 Read 2 (T2) enters the memory controller4 Data X1 is returned from DRAM

5 Data X2 is returned from DRAM6 The page is closed



87

Case 1:

The requests occur at roughly the same time:1 Read 1 (T1) enters the memory controller2 The page is opened3 Read 2 (T2) enters the memory controller4 Data X1 is returned from DRAM5 Data X2 is returned from DRAM

6 The page is closedAlthough there are two separate reads, the page is only openedonce.


87

Case 1:

The requests occur at roughly the same time:1 Read 1 (T1) enters the memory controller2 The page is opened3 Read 2 (T2) enters the memory controller4 Data X1 is returned from DRAM5 Data X2 is returned from DRAM6 The page is closed



88

Case 2:The requests are separated in time:

1 Read 1 (T1) enters the memory controller

2 The page is opened3 Data X1 is returned from DRAM4 The page is closed5 Read 2 (T2) enters the memory controller6 The page is opened again7 Data X2 is returned from DRAM8 The page is closed

The page is opened and closed twice. By prefetching X2 we canincrease performance by reducing latency and increase memorythroughput.


88


1 Read 1 (T1) enters the memory controller2 The page is opened

3 Data X1 is returned from DRAM4 The page is closed5 Read 2 (T2) enters the memory controller6 The page is opened again7 Data X2 is returned from DRAM8 The page is closed



88


1 Read 1 (T1) enters the memory controller2 The page is opened3 Data X1 is returned from DRAM

4 The page is closed5 Read 2 (T2) enters the memory controller6 The page is opened again7 Data X2 is returned from DRAM8 The page is closed



88


1 Read 1 (T1) enters the memory controller2 The page is opened3 Data X1 is returned from DRAM4 The page is closed

5 Read 2 (T2) enters the memory controller6 The page is opened again7 Data X2 is returned from DRAM8 The page is closed



88


1 Read 1 (T1) enters the memory controller2 The page is opened3 Data X1 is returned from DRAM4 The page is closed5 Read 2 (T2) enters the memory controller

6 The page is opened again7 Data X2 is returned from DRAM8 The page is closed



88


1 Read 1 (T1) enters the memory controller2 The page is opened3 Data X1 is returned from DRAM4 The page is closed5 Read 2 (T2) enters the memory controller6 The page is opened again

7 Data X2 is returned from DRAM8 The page is closed



88


1 Read 1 (T1) enters the memory controller2 The page is opened3 Data X1 is returned from DRAM4 The page is closed5 Read 2 (T2) enters the memory controller6 The page is opened again7 Data X2 is returned from DRAM

8 The page is closedThe page is opened and closed twice. By prefetching X2 we canincrease performance by reducing latency and increase memorythroughput.


88


1 Read 1 (T1) enters the memory controller2 The page is opened3 Data X1 is returned from DRAM4 The page is closed5 Read 2 (T2) enters the memory controller6 The page is opened again7 Data X2 is returned from DRAM8 The page is closed



89

When does prefetching pay off?The break-even point:

Prefetching Accuracy · Cost of Prefetching = Cost of Single Read

What is the cost of prefetching?• Application dependant• Less than the cost of a single read, because:

• Able to utilize open pages• Reduce latency• Increase throughput• Multiple banks• Lower latency


90

Performance vs. Accuracy

0

10

20

30

40

50

60

70

80

90

100

-40 -20 0 20 40 60

Acc

urac

y

IPC improvement (%)

Sequential prefetchingScheduled Region prefetching

CZone/Delta Correlation prefetchingReference Predicton Tables prefetching

Treshold


91

Q&A

Thank you for listening!


TDT 4260 – lecture 17/2• Contents

– Cache coherence Chap 4.2• Repetition• Snooping protocols

• SMP performance Chap 4.3– Cache performance

• Directory based cache coherence Chap 4.4y p• Synchronization Chap 4.5• UltraSPARC T1 (Niagara) Chap 4 8• UltraSPARC T1 (Niagara) Chap 4.8

1 Lasse Natvig

Updated lecture plan pr. 17/2Updated lecture plan pr. 7/Date and lecturer Topic

1: 14 Jan (LN, AI) Introduction, Chapter 1 / Alex: PfJudge2: 21 Jan (IB) Pipelining, Appendix A; ILP, Chapter 23: 3 Feb (IB) ILP, Chapter 2; TLP, Chapter 34: 4 Feb (LN) Multiprocessors Chapter 44: 4 Feb (LN) Multiprocessors, Chapter 4 5: 11 Feb MG Prefetching + Energy Micro guest lecture by Marius Grannæs &

pizza 6: 18 Feb (LN, MJ) Multiprocessors continued // Writing a comp.arch. paper

(relevant for miniproject, by (MJ))7: 24 Feb (IB) Memory and cache, cache coherence (Chap. 5)8: 3 Mar (IB) Piranha CMP + Interconnection networks

9: 11 Mar (LN) Multicore architectures (Wiley book chapter) + Hill Marty Amdahl multicore ... Fedorova ... assymetric multicore ...

10: 18 Mar (IB) Memory consistency (4.6) + more on memory10: 18 Mar (IB) Memory consistency (4.6) + more on memory11: 25 Mar (JA, AI) (1) Kongull and other NTNU and NOTUR supercomputers (2)

Green computing12: 7 Apr (IB/LN) Wrap up lecture, remaining stuff

2 Lasse Natvig


Mi i j t d t ?Miniproject groups, updates?

Rank Prefetcher Group Score

f f1 rpt64k4_pf Farfetched 1.089

2 rpt_prefetcher_rpt_seq L2Detour 1.072

3 teeest Group 6 1.000

3 Lasse Natvig

IDI Open, a challenge for you?IDI Open, a challenge for you?• http://events.idi.ntnu.no/open11/

• 2 april programming contest informal fun pizza2 april, programming contest, informal, fun, pizza, coke (?), party (?), 100- 150 people, mostly students low thresholdstudents, low threshold

• Teams: 3 persons, one PC, Java, C/C++ ?P bl S i l t i k• Problems: Some simple, some tricky

• Our team ”DM-gruppas beskjedne venner” is challenging you students!– And we will challenge some of all the ICT companies in

4 Lasse Natvig

Trondheim

SMP: Cache Coherence ProblemSMP: Cache Coherence ProblemP1 P2 P3

cache cache cache 34 5

u = ?u = ?

u :5 u :5

u = 7

I/O d i1 I/O devices

Memory

12u :5

• Processors see different values for u after event 3• Old (stale) value read in event 4 (hit)( ) ( )• Event 5 (miss) reads

– correct value (if write-through caches)old value (if write back caches)

5 Lasse Natvig

– old value (if write-back caches)• Unacceptable to programs, and frequent!

Enforcing coherence (recap)Enforcing coherence (recap)• Separate caches speed up access

– Migration• Moved from shared memory to local cache

Replication– Replication• Several local copies when item is read by several

• Need coherence protocols to track shared data• Need coherence protocols to track shared data– (Bus) snooping

• Each cache maintains local status• All caches monitor broadcast medium• Write invalidate / Write update

6 Lasse Natvig

State Machine (1/3) State Machine (1/3) State machine

CPU Read hit

State machinefor CPUrequestsfor each

InvalidShared

(read/only)CPU Read miss

Place read misson busfor each

cache blockCPU Write

on bus

CPU read missWrite back blockPlace Write

Miss on bus

Write back block,Place read misson bus

CPU Read missPlace read miss on bus

CPU WriteMiss => Write Miss on BusHit => Invalidate on Bus

Exclusive(read/write)

Hit => Invalidate on Bus

CPU Write MissW it b k h bl k

CPU read hit

7 Lasse Natvig

Write back cache blockPlace write miss on bus

CPU write hit

State Machine (2/3)State Machine (2/3)State machine

for busWrite miss/ Invalidatefor bus

requestsfor each cache block

Invalid Shared(read/only)

for this block

cache block

Write miss Read miss

W it B k

for this blockRead miss

Write BackBlock; (abortmemory access)

Read miss for this blockWrite Back Block; ( b t l


(abort excl.memory access)

8 Lasse Natvig

State Machine (3/3)State Mach ne (3/3)• State machine

for CPU requestsCPU Read hit

Write miss/Invqfor each cache block andfor bus requests Place read miss

InvalidShared

(read/only)CPU Read

Write miss/Invfor this block

for bus requestsfor each cache block

Place read misson busCPU Write

Place Write Miss on busMiss on bus

CPU read missWrite back block,

CPU Write

CPU Read missPlace read miss on busWrite Back

Write missfor this block

Place read misson bus

CPU WriteMiss => Write Miss on BusHit => Invalidate on Bus

Write BackBlock; (abort excl.memory access)

R d i


CPU d hit

Read miss for this block

Write BackBlock; (abortmemory access)

9 Lasse Natvig

( )CPU Write Miss, Write back cache block, Place write miss on bus

CPU read hitCPU write hit

Directory based cache coherence (1/2)Directory based cache coherence (1/2)

• Large MP systems, lots of CPUs• Distributed memory preferable

– Increases memory bandwidth• Snooping bus with broadcast?

– A single bus become a bottleneckA single bus become a bottleneck– Other ways of communicating needed

• With these broadcasting is hard/expensiveWith these broadcasting is hard/expensive– Can avoid broadcast if we know exactly which caches

have a copy Directory

10 Lasse Natvig

py y

Directory based cache coherence (2/2)• Directory knows which blocks are in which cache and their state• Directory can be partitioned and distributed• Typical states:

– Shared– Uncached– Modified

• Protocol based on• Protocol based on messages

• Invalidate and update sent only where needed

Avoids broadcast

11 Lasse Natvig

– Avoids broadcast, reduces traffic Fig 4.19

SMP performance (shared memory)SMP performance (shared memory)

• Focus on cache performance

• 3 types of cache misses in uniprocessor (3 C’s)yp p ( )– Capacity (too small for working set)– Compulsory (cold-start)– Conflict (placement strategy)

• Multiprocessor also give coherence misses– True sharing

• Misses because of sharing of dataFalse sharing– False sharing

• Misses because of invalidates that would not have happened with cache block size = one word

12 Lasse Natvig

E l h (f )Example: L3 cache size (fig 4.11)Al h S 4100• AlphaServer 4100– 4 x Alpha @ 300 MHz

L1 8 KB I +

80

90

100– L1: 8 KB I +8 KB D

– L2: 96 KB Tim

e

50

60

70

IdlePAL CodeM A

– L3: off-chip2 MB

ecut

ion

30

40

50 Memory AccessL2/L3 Cache AccessInstruction Execution

lized

Ex

0

10

20

Nor

mal

13 Lasse Natvig

01 MB 2 MB 4 MB 8MB

L3 Cache Size

Example: L3 cache size (fig 4.12)

2.75

3

3.25

InstructionCapacity/Conflictct

ion

1 75

2

2.25

2.5p y

ColdFalse SharingTrue Sharing

er In

stru

1

1.25

1.5

1.75

ycle

s pe

0.25

0.5

0.75

1

emor

y C

y

01 MB 2 MB 4 MB 8 MB

Cache size

Me

14 Lasse Natvig

Example: Increasing parallelism (fig 4.13)

2.5

3InstructionConflict/CapacityColduc

tion

2

ColdFalse SharingTrue Sharing

per I

nstru

1

1.5

Cyc

les

p

0.5

emor

y C

01 2 4 6 8

Processor count

M

15 Lasse Natvig

Example: Increased block size (fig 4.14)

1516

Insructions

11121314

ruc

tio

ns Capacity/Conflict

Cold

truct

ions

789

10

1,0

00

ins

tr

False Sharing

True Sharing

000

Inst

3456

Mis

se

s p

er

es p

er 1

0123

32 64 128 256

M

Mis

se

16 Lasse Natvig

32 64 128 256Block size in bytes

2/18/2011

1

1

How to Write a Computer Architecture PaperHow to Write a Computer Architecture Paper

TDT4260 Computer Architecture18. February 2011

Magnus Jahre

2

2nd Branch Prediction Championship• International competition similar to our prefetching

exercise system

• Task: Implement your best possible branch predictor• Task: Implement your best possible branch predictor and write a paper about it

• Submission deadline: 15. April 2011

• More info: http://www.jilp.org/jwac-2/

3

How does pfJudge work?• Each submitted file is one kongull job

– Contains 12 M5 instances since there are 12 CPUs per core– Each M5 instance runs a different SPEC 2000 benchmark

• The kongull job added to the job queue• The kongull job added to the job queue– Status “Running” can mean running or queued, be patient– Running a job can take a long time depending on load– Kongull is usually able to empty the queue during the night

• We can give you a regular user account on Kongull– Remember that Kongull is a shared resource!– Always calculate the expected CPU-hour demand of

your experiment before submitting

4

Storage Estimation

• We impose an storage limit of 8KB on your prefetchers– This limit is not checked by the exercise system

• This is realistic: hardware components are usually designed with an area budget in mind

• Estimating storage is simple– Table based prefetcher: add up the bits used in each entry and

multiply by the number of entries

5

HOW TO USE A SIMULATOR

6

Research WorkflowEvaluate Solution on

Compute ClusterRecieve PhD(get a real job)

2/18/2011

2

7

Why simulate?

• Model of a system– Model the interesting parts with high accuracy– Model the rest of the system with sufficient accuracy

• “All models are wrong but some are useful” (G. Box, 1979)

• The model does not necessarily have a one-to-one correspondence with the actual hardware– Try to model behavior

– Simplify your code wherever possible

8

Know your model

• You need to figure out which system is being modeled!

• Pfsys is a help to getting started, but to drawPfsys is a help to getting started, but to draw conclusions from you work you need to understand what you are modeling

9

HOW TO WRITE A PAPER

10

Find Your Story

• A good computer architecture paper tells a story– All good stories have a bad guy: the problem– All good stories have a hero: the scheme

• Writing a good paper is all about finding and identifying your story

• Note that this story has to be told within the strict structure of a scientific article

11

Paper Format

• You will be pressed for space

• Try to say things as precisely as possibleYour first write up can be as much as 3x the page limit and it’s still– Your first write-up can be as much as 3x the page limit and it s still easy (possible) to get it under the limit

• Think about your plots/figures– A good plot/figure gives a lot of information– Is this figure the best way of conveying this idea?– Is this plot the best way for visualizing this data?– Plots/figures need to be area efficient (but readable!)

12

Typical Paper Outline

• Abstract• Introduction• Background/Related Work

Th S h ( b tit t ith d i ti titl )• The Scheme (substitute with a descriptive title)• Methodology• Results• Discussion• Conclusion (with optional further work)

2/18/2011

3

13

Abstract• An experienced reader should be able to understand

exactly what you have done from only reading the abstract– This is different from a summary

• Should be short, varies from 150 to 200 word maximum

• Should include a description of the problem, the solution and the main results

• Typically the last thing you write

14

Introduction

• Introduces the larger research area that the paper is a part of

• Introduces the problem at hand• Introduces the problem at hand

• Explains the scheme

• Level of abstraction: “20 000 feet”

15

Related Work

• Reference the work that other researchers have done that is related to your scheme

• Should be complete (i e contain all relevant work)• Should be complete (i.e. contain all relevant work)– Remember: you define the scope of your work

• Can be split into two sections: Background and Related Work– Background is an informative introduction to the field (often section 2)– Related work is a very dense section that includes all relevant

references (often section n-1)

16

The Scheme

• Explain your scheme in detail– Choose an informative title

• Trick: Add an informative figure that helps explain• Trick: Add an informative figure that helps explain your scheme

• If your scheme is complex, an informative example may be in order

17

Methodology

• Explains your experimental setup

• Should answer the following questions:– Which simulator did you use?– How have you extended the simulator?– Which parameters did you use for your simulations? (aim: reproducibility)– Which benchmarks did you use?– Why did you chose these benchmarks?

• Important: should be realistic

• If you are unsure about a parameter, run a simulationto check its impact

18

Results• Show that your scheme works

• Compare to other schemes that do the same thing– Hopefully you are better, but you need to compare anyway

• Trick: “Oracle Scheme”– Uses “perfect” information to create an upper bound on the

performance of a class of schemes– Prefetching: Best case is that all L2 accesses are hits

• Sensitivity analysis– Check the impact of model assumptions on your

scheme

2/18/2011

4

19

Discussion

• Only include this if you need it

• Can be used if:You have weaknesses in your model that you have not accounted– You have weaknesses in your model that you have not accounted for

– You tested improvements to your scheme that did not give good enough results to be included in “The Scheme” section

20

Conclusion

• Repeat the main results of your work

• Remember that the abstract, introduction and conclusion are usually read before the rest of theconclusion are usually read before the rest of the paper

• Can include Further Work:– Things you thought about doing that you did not have time to do

21

Thank You

Visit our website:http://research.idi.ntnu.no/multicore/

TDT 4260Chap 5

TLP & Memory Hierarchy

Review on ILP

• What is ILP ?

• Let the compiler find the ILP

▫ Advantages?

▫ Disadvantages?

• Let the HW find the ILP

▫ Advantages?

▫ Disadvantages?

Contents

• Multi-threading Chap 3.5

• Memory hierarchy Chap 5.1

▫ 6 basic cache optimizations

• 11 advanced cache optimizations Chap 5.2

Multi-threaded execution

• Multi-threading: multiple threads share the

functional units of 1 processor via overlapping▫ Must duplicate independent state of each thread e.g., a

separate copy of register file, PC and page table

▫ Memory shared through virtual memory mechanisms

▫ HW for fast thread switch; much faster than full process switch ≈ 100s to 1000s of clocks

• When switch?▫ Alternate instruction per thread (fine grain)

▫ When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)

Fine-Grained Multithreading

• Switches between threads on each instruction▫ Multiples threads interleaved

• Usually round-robin fashion, skipping stalled threads

• CPU must be able to switch threads every clock

• Hides both short and long stalls▫ Other threads executed when one thread stalls

• But slows down execution of individual threads▫ Thread ready to execute without stalls will be delayed by

instructions from other threads

• Used on Sun’s Niagara

• Switch threads only on costly stalls (L2 cache miss)• Advantages

▫ No need for very fast thread-switching▫ Doesn’t slow down thread, since switches only when

thread encounters a costly stall

• Disadvantage: hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs▫ Since CPU issues instructions from 1 thread, when a stall

occurs, the pipeline must be emptied or frozen ▫ New thread must fill pipeline before instructions can

complete

• => Better for reducing penalty of high cost stalls, where pipeline refill << stall time

Coarse-Grained Multithreading

Do both ILP and TLP?

• TLP and ILP exploit two different kinds of parallel structure in a system

• Can a high-ILP processor also exploit TLP?▫ Functional units often idle because of stalls or

dependences in the code

• Can TLP be a source of independent instructions that might reduce processor stalls?

• Can TLP be used to employ functional units that would otherwise lie idle with insufficient ILP?

• => Simultaneous Multi-threading (SMT)▫ Intel: Hyper-Threading

Simultaneous Multi-threading

1

2

3

4

5

6

7

8

9

M M FX FX FP FP BR CCCycleOne thread, 8 units

M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes

1

2

3

4

5

6

7

8

9

M M FX FX FP FP BR CCCycleTwo threads, 8 units

Simultaneous Multi-threading (SMT)

• A dynamically scheduled processor already has many HW mechanisms to support multi-threading▫ Large set of virtual registers

� Virtual = not all visible at ISA level

� Register renaming

▫ Dynamic scheduling

• Just add a per thread renaming table and keeping separate PCs▫ Independent commitment can be supported by logically

keeping a separate reorder buffer for each thread

Multi-threaded categories

Time (processor cycle) Superscalar Fine-Grained Coarse-Grained Multiprocessing

Simultaneous

Multithreading

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Idle slot

Design Challenges in SMT• SMT makes sense only with fine-grained

implementation▫ How to reduce the impact on single thread performance?▫ Give priority to one or a few preferred threads

• Large register file needed to hold multiple contexts• Not affecting clock cycle time, especially in

▫ Instruction issue - more candidate instructions need to be considered

▫ Instruction completion - choosing which instructions to commit may be challenging

• Ensuring that cache and TLB conflicts generated by SMT do not degrade performance

Why memory hierarchy? (fig 5.2)

1

10

100

1,000

10,000

100,000

1980 1985 1990 1995 2000 2005 2010

Year

Per

form

anc

e

Memory

ProcessorProcessor-MemoryPerformance GapGrowing

Why memory hierarchy?

• Principle of Locality▫ Spatial Locality� Addresses near each other are likely referenced close

together in time

▫ Temporal Locality� The same address is likely to reused in the near

future

• Idea: Store recently used elements a fast memories close to the processor▫ Managed by software or hardware?

Memory hierarchyWe want large, fast and cheap at the same time

Control

Datapath

Memory

Processor

Mem

ory

Memory

MemoryM

em

ory

Fastest Slowest

Smallest Largest

Most expensive Cheapest

Speed:

Capacity:

Cost:

Block 12 placed in cache with 8 Cache lines

0 1 2 3 4 5 6 7Blockno.

Fully associative:block 12 can go anywhere

0 1 2 3 4 5 6 7Blockno.

Direct mapped:block 12 can go only into block 4 (12 mod 8)

0 1 2 3 4 5 6 7Blockno.

Set associative:block 12 can go anywhere in set 0 (12 mod 4)

Set0

Set1

Set2

Set3

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Block Address

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3Blockno.

Cache block placement

Cache performance

• Miss rate alone is not an accurate measure

• Cache performance is important for CPU perf.• More important with higher clock rate

• Cache design can also affect instructions that don’taccess memory!• Example: A set associative L1 cache on the critical path

requires extra logic which will increase the clock cycle time• Trade off: Additional hits vs. cycle time reduction

Average access time = Hit time + Miss rate * Miss penalty

Reducing Hit Time1. Giving Reads Priority over Writes � Writes in write-buffer can be handled after a newer read if

not causing dependency problems

2. Avoiding Address Translation during Cache Indexing� Eg. use Virtual Memory page offset to index the cache

Reducing Miss Penalty3. Multilevel Caches� Both small and fast (L1) and large (&slower) (L2)

Reducing Miss Rate4. Larger Block size (Compulsory misses)5. Larger Cache size (Capacity misses)6. Higher Associativity (Conflict misses)

6 Basic Cache Optimizations

1: Giving Reads Priority over

Writes • Caches typically use a write buffer

▫ CPU writes to cache and write buffer▫ Cache controller transfers from buffer to RAM▫ Write buffer usually FIFO with N elements▫ Works well as long as buffer does not fill faster than

it can be emptied

• Optimization▫ Handle read misses before write buffer writes▫ Must check for conflicts with write buffer first

ProcessorCache

Write Buffer

DRAM

Virtual memory• Processes use a large virtual memory

• Virtual addresses are dynamically mapped to physical addresses using HW & SW

• Page, page frame, page error, translation lookaside buffer (TLB) etc.

Process 1:

Virtual address (VA)Physical address (PA)

vir. page

Process 2:

phy. pageaddress

translation

0

0

2n-1

0

2n-12m-1

2: Avoiding Address Translation during

Cache Indexing• Virtual cache: Use virtual addresses in caches

▫ Saves time on translation VA -> PA▫ Disadvantages

� Must flush cache on process switch� Can be avoided by including PID in tag

� Alias problem: OS and a process can have two VAs pointing to the same PA

• Compromise:”virtually indexed, physically tagged”▫ Use page offset to index cache▫ The same for VA and PA▫ At the same time as data is read from cache, VA � PA is

done for the tag▫ Tag comparison using PA▫ But: Page size restricts cache size

3: Multilevel Caches (1/2)

• Make cache faster to keep up with CPU or larger to reduce misses?

• Why not both?

• Multilevel caches� Small and fast L1

�Large (and cheaper) L2

3: Multilevel Caches (2/2)

• Local miss rate▫ #cache misses / # cache accesses

• Global miss rate▫ #cache misses / # CPU memory accesses

• L1 cache speed affects CPU clock rate• L2 cache speed affects only L1 miss penalty

▫ Can use more complex mapping for L2▫ L2 can be large

Average access time = L1 Hit time + L1 Miss rate * (L2 Hit time + L2 Miss rate * L2 Miss penalty)

4: Larger Block size

Block Size (bytes)

Miss Rate

0%

5%

10%

15%

20%

25%

16 32 64

128

256

1K

4K

16K

64K

256K

Capacitymisses

Compulsorymisses

Conflictmisses

Trade-off32 and 64 byte common

5: Larger Cache size

• Simple method

• Square-root Rule (quadrupling the size of the cache will half the miss rate)

• Disadvantages

▫ Longer hit time

▫ Higher cost

• Most used for L2/L3 caches

6: Higher Associativity • Lower miss rate

• Disadvantages

▫ Can increase hit time

▫ Higher cost

• 8-way has similar performance to fully

associative

11 Advanced Cache Optimizations

Reducing hit time

1. Small and simple caches

2.Way prediction

3.Trace caches

Increasing cache bandwidth

4.Pipelined caches

5.Non-blocking caches

6.Multibanked caches

Reducing Miss Penalty

7. Critical word first

8. Merging write buffers

Reducing Miss Rate

9. Compiler optimizations

Reducing miss penalty or miss rate via parallelism

10.Hardware prefetching

11.Compiler prefetching

1: Small and simple caches• Compare address to tag memory takes time

• ⇒ Small cache can help hit time ▫ E.g., L1 caches same size for 3 generations of AMD microprocessors: K6,

Athlon, and Opteron

▫ Also L2 cache small enough to fit on chip with the processor avoids time penalty of going off chip

• Simple ⇒ direct mapping▫ Can overlap tag check with data transmission since no choice

• Access time estimate for 90 nm using CACTI model 4.0▫ Median ratios of access time relative to the direct-mapped caches are 1.32, 1.39,

and 1.43 for 2-way, 4-way, and 8-way caches

-

0.50

1.00

1.50

2.00

2.50

16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB

Cache size

Acc

ess

time

(ns) 1-way 2-way 4-way 8-way

2: Way prediction

• Extra bits are kept in the cache to predict which way (block) in a set the next access will hit

▫ Can retrieve the tag early for comparison

▫ Achieves fast hit even with just one comparator

▫ Several cycles needed to check other blocks with misses

3: Trace caches• Increasingly hard to feed modern superscalar

processors with enough instructions

• Trace cache

▫ Stores dynamic instruction sequences rather than ”bytes of data”

▫ Instruction sequence may include branches

� Branch prediction integrated in with the cache

▫ Complex and relatively little used

▫ Used in Pentium 4: Trace cache stores up to 12K micro-ops decoded from x86 instructions (also saves decode time)

4: Pipelined caches

• Pipeline technology applied to cache lookups

▫ Several lookups in processing at once

▫ Results in faster cycle time

▫ Examples: Pentium (1 cycle), Pentium-III (2 cycles), P4 (4 cycles)

▫ L1: Increases the number of pipeline stages needed to execute an instruction

▫ L2/L3: Increases throughput

� Nearly for free since the hit latency on the order of 10 –20 processor cycles and caches are easy to pipeline

5: Non-blocking caches (1/2)

• Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss

• “hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests

• “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses

▫ Requires that the lower-level memory can service multiple concurrent misses

▫ Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses

▫ Pentium Pro allows 4 outstanding memory misses

5: Non-Blocking Cache Implementation

• The cache can handle as many concurrent misses as there are MSHRs

• Cache must block when all valid bits (V) are set

• Very common

...

MHA = Miss Handling ArchitectureMSHR = Miss information/Status Holding Register

DMHA = Dynamic Miss Handling Architecture

5: Non-blocking Cache Performance6: Multibanked caches

• Divide cache into independent banks that can support simultaneous accesses

▫ E.g.,T1 (“Niagara”) L2 has 4 banks

• Banking works best when accesses naturally spread themselves across banks ⇒ mapping of addresses to banks affects behavior of memory system

• Simple mapping that works well is “sequential interleaving”

▫ Spread block addresses sequentially across banks

▫ E,g, if there 4 banks, Bank 0 has all blocks whose address modulo 4 is 0; bank 1 has all blocks whose address modulo 4 is 1; …

7: Critical word first

• Don’t wait for full block before restarting CPU

• Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution

• Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block

▫ Long blocks more popular today ⇒⇒⇒⇒ Critical Word 1st widely used

block

8: Merging write buffers

• Write buffer allows processor to continue while waiting to write to memory

▫ If buffer contains modified blocks, the addresses can be checked to see if address of new data matches the address of a valid write buffer entry

▫ If so, new data are combined with that entry

• Multiword writes more efficient to memory

• The Sun T1 (Niagara) processor, among many others, uses write merging

9: Compiler optimizations• Instruction order can often be changed without

affecting correctness

▫ May reduce conflict misses

▫ Profiling may help the compiler

• Compiler generate instructions grouped in basic blocks

▫ If the start of a basic block is aligned to a cache block, misses will be reduced

� Important for larger cache block sizes

• Data is even easier to move

▫ Lots of different compiler optimizations

10: Hardware prefetching• Prefetching relies on having extra memory bandwidth that

can be used without penalty• Instruction Prefetching

▫ Typically, CPU fetches 2 blocks on a miss: the requested block and the next consecutive block.

▫ Requested block is placed in instruction cache when it returns, and prefetched block is placed into instruction stream buffer

• Data Prefetching▫ Pentium 4 can prefetch data into L2 cache from up to 8 streams

▫ Prefetching invoked if 2 successive L2 cache misses to a page

1.16

1.45

1.18 1.20 1.21 1.26 1.29 1.32 1.401.49

1.97

1.001.201.401.601.802.002.20

gap

mcf

fam3d

wup

wise

galgel

face

rec

swim

applu

luca

s

mgrid

equa

kePerform

ance

Improv

emen

t

SPECint2000 SPECfp2000

11: Compiler prefetching

• Data Prefetch

▫ Load data into register (HP PA-RISC loads)

▫ Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)

▫ Special prefetching instructions cannot cause faults;a form of speculative execution

• Issuing Prefetch Instructions takes time

▫ Is cost of prefetch issues < savings in reduced misses?

Cache Coherency

• Consider the following case. I have two processors

that are sharing address X.

• Both cores read address X

• Address X is brought from memory into the caches of both processors

• Now, one of the processors writes to address X and changes the value.

• What happens? How does the other processor get notified that address X has changed?

Two types of cache coherence

schemes

• Snooping

▫ Broadcast writes, so all copies in all caches will be properly invalidated or updated.

• Directory

▫ In a structure, keep track of which cores are caching each address.

▫ When a write occurs, query the directory and properly handle any other cached copies.

TDT 4260Appendix E

Interconnection Networks

Contents

• Introduction App E.1

• Two devices App E.2

• Multiple devices App E.3

• Topology App E.4

• Routing, arbitration, switching App E.5

Conceptual overview

• Basic network technology assumed known

• Motivation

▫ Increased importance

� System-to-system connections

� Intra system connections

▫ Increased demands

� Bandwidth, latency, reliability, ...

▫ � Vital part of system design

Motivation

Types of networks

Number of devices and distance• OCN – On-chip network

▫ Functional units, register files, caches, …▫ Also known as: Network on Chip (NoC)

• SAN – System/storage area network▫ Multiprocessor and multicomputer, storage

• LAN – Local area network• WAN – Wide area network

• Trend: Switches replace buses

E.2: Connecting two devices

Destination implicit

Software to Send and Receive

• SW Send steps1: Application copies data to OS buffer

2: OS calculates checksum, starts timer

3: OS sends data to network interface HW and says start

• SW Receive steps3: OS copies data from network interface HW to OS buffer

2: OS calculates checksum, if matches send ACK; if not, deletes message (sender resends when timer expires)

1: If OK, OS copies data to user address space and signals application to continue

• Sequence of steps for SW: protocol

Network media

Copper, 1mm thick, twisted to avoidantenna effect (telephone)

Used by cable companies: high BW, good noise immunity

Light: 3 parts are cable, light source, light detector.Multimode light disperse (LED), Single mode single wave (laser)

Twisted Pair:

Coaxial Cable:

Copper core

Insulator

Braided outer conductor

Plastic Covering

Fiber Optics

Transmitter– L.E.D– Laser Diode

Receiver

– Photodiode

lightsource Silica

Total internalreflectionAir

9

OCNs SANs LANs WANs

Media type

Distance (meters)

0.01 1 10 100 >1,000

Basic Network Structure and Functions• Media and Form Factor

Fiber Optics

Coaxialcables

Myrinetconnectors

Cat5E twisted pair

Metal layers

Printedcircuit

boards

InfiniBandconnectors

Ethernet

Packet latency

Sender

Receiver

SenderOverhead

Transmission time(size/bandwidth)

Transmission time(size/bandwidth)

Time ofFlight

ReceiverOverhead

Transport Latency

Total Latency = Sender Overhead + Time of Flight + Message Size / bandwidth + Receiver Overhead

Total Latency

(processorbusy)

(processorbusy)

Shared Media (Ethernet)

Switched Media (CM-5,ATM)

Node

Node

Node Node

Node

Node Node��Switch��E.3: Connecting multiple devices (1/3)

• New issues▫ Topology� What paths are possible for

packets?

▫ Routing� Which of the possible paths are

allowable (valid) for packets?

▫ Arbitration� When are paths available for

packets?

▫ Switching� How are paths allocated to

packets?



Node

Node

Node Node

Node

Node Node��Switch��E.3: Connecting multiple devices (2/3)

• Two types of topology▫ Shared media▫ Switched media

• Shared media (bus)▫ Arbitration� Carrier Sensing� Collision Detection

▫ Routing is simple� Only one possible path

Connecting multiple devices (3/3)

• Switched media▫ “Point-to-point” connections▫ Routing for each packet▫ Arbitration for each connection

• Comparison▫ Much higher aggregate BW in

switched network than shared media network

▫ Shared media is cheaper▫ Distributed arbitration simpler for

switched



Node

Node

Node Node

Node

Node Node��Switch�� • One switch or bus can connect a limited number of

devices▫ Complexity, cost, technology, …

• Interconnected switches needed for larger networks

• Topology: connection structure▫ What paths are possible for packets?▫ All pairs of devices must have path(s) available

• A network is partitioned by a set of links if their removal disconnects the graph▫ Bisection bandwidth▫ Important for performance

E.4: Interconnection Topologies

• Common topology for connecting CPUs and I/O units

• Also used for interconnecting CPUs

• Fast and expensive (O(N2))

• Non-blocking

Crossbar

P

P

C

C

I/O

I/O

M MM M

000

001

010

011

100

101

110

111

000

001

010

011

100

101

110

111

0

1

0

1

1

1

SourceSource DestinationDestination

2x2 switches2x2 switches

StraightStraight CrossoverCrossover

Upper broadcastUpper broadcast Lower broadcastLower broadcast

Omega network

• Example of multistage network

• Usually log2n stages for n inputs - O(N log N)

• Can block

Linear Arrays and Rings

• Linear array= 1D grid

• 2D grid

• Torus has wrap-around connections

• CRAY with 3D torusSwitch

P$

External I/O

Memctrl

and NI

Mem

• Distributed switched networks

• Node = switch + 1-n end nodes

Trees

• Diameter and average distance are logarithmic▫ k-ary tree, height d = logk N

▫ address = d-vector of radix k coordinates describing path down from root

• Fixed number of connections per node (i.e. fixed degree)

• Bisection bandwidth = 1 near the root

E.5: Routing, Arbitration, Switching

• Routing▫ Which of the possible paths are allowable for packets?

▫ Set of operations needed to compute a valid path

▫ Executed at source, intermediate, or even at destination nodes

• Arbitration▫ When are paths available for packets?

▫ Resolves packets requesting the same resources at the same time

▫ For every arbitration, there is a winner and possibly many losers

� Losers are buffered (lossless) or dropped on overflow (lossy)

• Switching▫ How are paths allocated to packets?

▫ The winning packet (from arbitration) proceeds towards destination

▫ Paths can be established one fragment at a time or in their entirety

Routing• Shared Media

▫ Broadcast to everyone

• Switched Media needs real routing. Options:

▫ Source-based routing: message specifies path to the destination (changes of direction)

▫ Virtual Circuit: circuit established from source to destination, message picks the circuit to follow

▫ Destination-based routing: message specifies destination, switch must pick the path

� Deterministic: always follow same path

� Adaptive: pick different paths to avoid congestion, failures

� Randomized routing: pick between several good paths to balance network load

Routing mechanism

• Need to select output port for each input packet▫ And fast…

• Simple arithmetic in regular topologies▫ Ex: ∆x, ∆y routing in a grid (first ∆x then ∆y)� west (-x) ∆x < 0� east (+x) ∆x > 0� south (-y) ∆x = 0, ∆y < 0� north (+y) ∆x = 0, ∆y > 0

• Unidirectional links sufficient for torus (+x, +y)• Dimension-order routing ▫ Reduce relative address of each dimension in

order to avoid deadlock

Deadlock

• How can it arise?▫ necessary conditions:

� shared resources

� incrementally allocated

� non-preemptible

• How do you handle it?▫ constrain how channel

resources are allocated(deadlock avoidance)

▫ Add a mechanism thatdetects likely deadlocks and fixes them(deadlock recovery)

TRC (0,0) TRC (0,1) TRC (0,2) TRC (0,3)

TRC (1,0) TRC (1,1) TRC (1,2) TRC (1,3)

TRC (2,0) TRC (2,1) TRC (2,2) TRC (2,3)

TRC (3,0) TRC (3,1) TRC (3,2) TRC (3,3)

XX

Arbitration (1/2)

• Several simultaneous requests to shared resource

• Ideal: Maximize usage of network resources

• Problem: Starvation

▫ Fairness needed

• Figure: Two phase arb.

▫ Request, Grant

▫ Poor usage

Arbitration (2/2)

• Three phases

• Multiple requests

• Better usage

• But: Increased latency

• Allocating paths for packets

• Two techniques:

▫ Circuit switching (connection oriented)

� Communication channel

� Allocated before first packet

� Packet headers don’t need routing info

� Wastes bandwidth

▫ Packet switching (connection less)

� Each packet handled independently

� Can’t guarantee response time

� Two types – next slide

Switching Store & Forward vs Cut-Through Routing

• Cut-through (on blocking)▫ Virtual cut-through (spools rest of packet into buffer)▫ Wormhole (buffers only a few flits, leaves tail along route)

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1

023

3 1 0

2 1 0

23 1 0

0

1

2

3

23 1 0Time

Store & Forward Routing Cut-Through Routing

Source Dest Dest

Piranha: Designing a Scalable CMP-based System for

Commercial Workloads

Piranha: Designing a Scalable CMP-based System for

Commercial Workloads

Luiz André BarrosoWestern Research Laboratory

Luiz André BarrosoWestern Research Laboratory

April 27, 2001 Asilomar Microcomputer Workshop

What is Piranha?What is Piranha?What is Piranha?lA scalable shared memory architecture based on chip

multiprocessing (CMP) and targeted at commercialworkloads

lA research prototype under development by CompaqResearch and Compaq NonStop Hardware DevelopmentGroup

lA departure from ever increasing processor complexityand system design/verification cycles

Importance of Commercial ApplicationsImportance of Commercial ApplicationsImportance of Commercial Applications

lTotal server market size in 1999: ~$55-60B– technical applications: less than $6B– commercial applications: ~$40B

Worldwide Server Customer Spending (IDC 1999)

Infrastructure29%

Business processing

22%

Decision support

14%

Software development

14%

Collaborative12%

Other3%

Scientific & engineering

6%

Price Structure of ServersPrice Structure of ServersPrice Structure of Serversl IBM eServer 680

(220KtpmC; $43/tpmC)§ 24 CPUs§ 96GB DRAM, 18 TB Disk§ $9M price tag

lCompaq ProLiant ML370(32KtpmC; $12/tpmC)§ 4 CPUs§ 8GB DRAM, 2TB Disk§ $240K price tag

- Software maintenance/management costs even higher (up to $100M)- Storage prices dominate (50%-70% in customer installations)

- Price of expensive CPUs/memory system amortized

Normalized breakdown of HW cost

0%10%20%30%40%50%60%70%80%90%

100%

IBM eServer 680 Compaq ProLiant ML570

I/ODRAMCPUBase

$/CPU $/MB DRAM $/GB Disk

IBM eServer 680 $65,417 $9 $359Compaq ProLiant ML570 $6,048 $4 $64

Price per componentSystem

OutlineOutlineOutline

l Importance of Commercial Workloads

lCommercial Workload Requirements

lTrends in Processor Design

lPiranha

lDesign Methodology

lSummary

Studies of Commercial WorkloadsStudies of Commercial WorkloadsStudies of Commercial Workloadsl Collaboration with Kourosh Gharachorloo (Compaq WRL)

– ISCA’98: Memory System Characterization of Commercial Workloads (with E. Bugnion)

– ISCA’98: An Analysis of Database Workload Performance onSimultaneous Multithreaded Processors

(with J. Lo, S. Eggers, H. Levy, and S. Parekh)

– ASPLOS’98: Performance of Database Workloads on Shared-MemorySystems with Out-of-Order Processors

(with P. Ranganathan and S. Adve)

– HPCA’00: Impact of Chip-Level Integration on Performance of OLTPWorkloads

(with A. Nowatzyk and B. Verghese)

– ISCA’01: Code Layout Optimizations for Transaction ProcessingWorkloads

(with A. Ramirez, R. Cohn, J. Larriba-Pey, G. Lowney, and M. Valero)

Studies of Commercial Workloads: summaryStudies of Commercial Workloads: summaryStudies of Commercial Workloads: summarylMemory system is the main bottleneck

– astronomically high CPI– dominated by memory stall times– instruction stalls as important as data stalls– fast/large L2 caches are critical

lVery poor Instruction Level Parallelism (ILP)– frequent hard-to-predict branches– large L1 miss ratios– Ld-Ld dependencies– disappointing gains from wide-issue out-of-order techniques!





lPiranha

lDesign Methodology

lSummary

Increasing Complexity of Processor DesignsIncreasing Complexity of Processor DesignsIncreasing Complexity of Processor Designs

lPushing limits of instruction-level parallelism– multiple instruction issue– speculative out-of-order (OOO) execution

lDriven by applications such as SPECl Increasing design time and team size

lYielding diminishing returns in performance

Processor(SGI MIPS)

YearShipped

TransistorCount

(millions)

DesignTeamSize

DesignTime

(months)

VerificationTeam Size

(% of total)R2000 1985 0.10 20 15 15%R4000 1991 1.40 55 24 20%R10000 1996 6.80 >100 36 >35%

courtesy: John Hennessy, IEEE Computer, 32(8)

Exploiting Higher Levels of IntegrationExploiting Higher Levels of IntegrationExploiting Higher Levels of Integration

l lower latency, higher bandwidthl reuse of existing CPU core

addresses complexity issues

1.5MBL2$

1GHz21264 CPU

64KBD$

64KBI$

I/ONet

wo

rk In

terf

ace

Co

her

ence

En

gin

e

ME

M-C

TL

0

31

ME

M-C

TL

0

31

Alpha 21364

364M

IO364

M

IO

364M

IO364

M

IO

364M

IO364

M

IO

l incrementally scalableglueless multiprocessing

Singlechip

Exploiting Parallelism in Commercial AppsExploiting Parallelism in Commercial AppsExploiting Parallelism in Commercial Apps

L2$

CPU

D$I$

I/O

Net

wo

rkC

oh

eren

ce

ME

M-C

TL

ME

M-C

TL

CPU

D$I$

Chip Multiprocessing (CMP)

Example: IBM Power4

time

thread 1thread 2thread 3thread 4

Simultaneous Multithreading (SMT)

l SMT superior in single-thread performance

l CMP addresses complexity by using simpler cores

time

thread 1thread 2thread 3thread 4

Example: Alpha 21464

OutlineOutlineOutlinel Importance of Commercial Workloads



lPiranha– Architecture– Performance

lDesign Methodology

lSummary

Piranha ProjectPiranha ProjectPiranha Project

lExplore chip multiprocessing for scalable serverslFocus on parallel commercial workloadslSmall team, modest investment, short design timelAddress complexity by using:

– simple processor cores– standard ASIC methodology

Give up on ILP, embrace TLP

Piranha Team MembersPiranha Team MembersPiranha Team MembersResearch

– Luiz André Barroso (WRL)– Kourosh Gharachorloo (WRL)– David Lowell (WRL)– Joel McCormack (WRL)– Mosur Ravishankar (WRL)– Rob Stets (WRL)– Yuan Yu (SRC)

NonStop Hardware DevelopmentASIC Design Center

– Tom Heynemann– Dan Joyce– Harland Maxwell– Harold Miller– Sanjay Singh– Scott Smith– Jeff Sprouse– … several contractors

Robert McNamaraBasem NayfehAndreas NowatzykJoan PendletonShaz Qadeer

Brian RobinsonBarton SanoDaniel ScalesBen Verghese

Former Contributors

Piranha Processing NodePiranha Processing NodePiranha Processing Node

CPU

Alpha core: 1-issue, in-order, 500MHzL1 caches: I&D, 64KB, 2-wayIntra-chip switch (ICS) 32GB/sec, 1-cycle delayL2 cache: shared, 1MB, 8-wayMemory Controller (MC) RDRAM, 12.8GB/secProtocol Engines (HE & RE): µprog., 1K µinstr., even/odd interleavingSystem Interconnect: 4-port Xbar router topology independent 32GB/sec total bandwidth

D$I$

L2$

ICS

CPU

D$I$

L2$

L2$

CPU

D$I$

CPU

D$I$L2$

CPU

D$I$L2$

CPU

D$I$L2$

L2$

CPU

D$I$L2$

CPU

D$I$

MEM-CTL

MEM-CTL

MEM-CTL MEM-CTL MEM-CTL

MEM-CTL MEM-CTL MEM-CTL

RE

HE

Ro

ute

r

Single Chip

Piranha I/O NodePiranha I/O NodePiranha I/O Node

Ro

ute

r

2 Links @8GB/s D$

L2$

CPU

I$

FBFB

RE

HE

ICS

D$PCI-X

MEM-CTL

l I/O node is a full-fledged member of system interconnect– CPU indistinguishable from Processing Node CPUs– participates in global coherence protocol

Example ConfigurationExample ConfigurationExample Configuration

P

P P P

P- I/O

P- I/O

P

P

l Arbitrary topologies

l Match ratio of Processing to I/O nodes to application requirements

L2 Cache and Intra-Node CoherenceL2 Cache and Intra-Node CoherenceL2 Cache and Intra-Node Coherence

lNo inclusion between L1s and L2 cache– total L1 capacity equals L2 capacity– L2 misses go directly to L1– L2 filled by L1 replacements

l L2 keeps track of all lines in the chip– sends Invalidates, Forwards– orchestrates L1-to-L2 write-backs to maximize

chip-memory utilization– cooperates with Protocol Engines to enforce

system-wide coherence

Inter-Node Coherence ProtocolInter-Node Coherence ProtocolInter-Node Coherence Protocoll ‘Stealing’ ECC bits for memory directory

lDirectory (2b state + 40b sharing info)

lDual representation: limited pointer + coarse vectorl “Cruise Missile” Invalidations (CMI)

– limit fan-out/fan-in serialization with CV

lSeveral new protocol optimizations

info on sharersstate

2b 20b

info on sharersstate

2b 20b

8x(64+8) 4X(128+9+7) 2X(256+10+22) 1X(512+11+53)Data-bitsECCDirectory-bits

0 28 44 53

010000001000CMI

Simulated ArchitecturesSimulated ArchitecturesSimulated Architectures

Single-Chip Piranha PerformanceSingle-Chip Piranha PerformanceSingle-Chip Piranha Performance

0

50

100

150

200

250

300

350

P1500 MHz1-issue

INO1GHz

1-issue

OOO1GHz

4-issue

P8500MHz1-issue

P1500 MHz1-issue

INO1GHz

1-issue

OOO1GHz

4-issue

P8500MHz1-issue

No

rmal

ized

Exe

cuti

on

Tim

e L2MissL2HitCPU

233

145

100

34

350

191

100

44

OLTP DSS

l Piranha’s performance margin 3x for OLTP and 2.2x for DSS

l Piranha has more outstanding misses è better utilizes memory system

Single-Chip Performance (Cont.)Single-Chip Performance Single-Chip Performance (Cont.)(Cont.)

lNear-linear scalability– low memory latencies– effectiveness of highly associative L2 and non-inclusive caching

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

Number of Cores

Sp

eed

up

010

2030

4050

6070

8090

100

P1 P2 P4 P8

500 MHz, 1-issue

No

rmal

ized

Bre

akd

ow

n o

f L

1 M

isse

s (%

)

L2 MissL2 FwdL2 Hit

Potential of a Full-Custom PiranhaPotential of a Full-Custom PiranhaPotential of a Full-Custom Piranha

l 5x margin over OOO for OLTP and DSSl Full-custom design benefits substantially from boost in core speed

0

20

40

60

80

100

120

OOO1GHz

4-issue

P8500MHz1-issue

P8F1.25GHz1-issue

OOO1GHz

4-issue

P8500MHz1-issue

P8F1.25GHz1-issue

No

rmal

ized

Exe

cuti

on

Tim

e

L2 MissL2 HitCPU

OLTP DSS

100

34

20

100

43

19





lPiranha

lDesign Methodology

lSummary

Managing Complexity in the ArchitectureManaging Complexity in the ArchitectureManaging Complexity in the ArchitecturelUse of many simpler logic modules

– shorter design– easier verification– only short wires*– faster synthesis– simpler chip-level layout

lSimplify intra-chip communication– all traffic goes through ICS (no backdoors)

lUse of microprogrammed protocol engineslAdoption of large VM pagesl Implement sub-set of Alpha ISA

– no VAX floating point, no multimedia instructions, etc.

Methodology ChallengesMethodology ChallengesMethodology Challengesl Isolated sub-module testing

– need to create robust bus functional models (BFM)– sub-modules’ behavior highly inter-dependent– not feasible with a small team

lSystem-level (integrated) testing– much easier to create tests– only one BFM at the processor interface– simpler to assert correct operation– Verilog simulation is too slow for comprehensive testing

Our Approach:Our Approach:Our Approach:

lDesign in stylized C++ (synthesizable RTL level)– use mostly system-level, semi-random testing– simulations in C++ (faster & cheaper than Verilog)§ simulation speed ~1000 clocks/second

– employ directed tests to fill test coverage gaps

lAutomatic C++ to Verilog translation– single design database– reduce translation errors– faster turnaround of design changes– risk: untested methodology

lUsing industry-standard synthesis tools

l IBM ASIC process (Cu11)

Piranha Methodology: OverviewPiranha Methodology: OverviewPiranha Methodology: Overview

C++ RTLModels

C++ RTL Models: Cycleaccurate and “synthesizeable”

Physical Design: leveragesindustry standard Verilog-basedtools

PhysicalDesign

cxx: C++ compiler

PS1

PS1: Fast (C++) LogicSimulator

cxx

PS1V PS1V: Can “co-simulate” C++and Verilog module versionsand check correspondence

cxx VerilogModels

Verilog Models: Machinetranslated from C++ models

CLevel

CLevel: C++-to-Verilog Translator

SummarySummarySummarylCMP architectures are inevitable in the near future

lPiranha investigates an extreme point in CMP design– many simple cores

lPiranha has a large architectural advantage over complexsingle-core designs (> 3x) for database applications

lPiranha methodology enables faster design turnaround

lKey to Piranha is application focus:– One-size-fits-all solutions may soon be infeasible

ReferenceReferenceReferencelPapers on commercial workload performance & Piranha

research.compaq.com/wrl/projects/Database

1

TDT 4260 – lecture 11/3 - 2011• Miniproject status, update, presentation

• Synchronization, Textbook Chap 4.5– And a short note on BSP (with excellent timing …)

• Short presentation of NUTS, NTNU Test Sattelite System http://nuts iet ntnu no/

1 Lasse Natvig

Sattelite System http://nuts.iet.ntnu.no/

• UltraSPARC T1 (Niagara), Chap 4.8

• And more on multicores

Miniproject – after the first deadlineImplementing 1 existing prefetcher

Comparison of 2 or more existing prefetchers

Improving on existing prefetcher

Sequential prefetcher RPT and DCPT

Improving sequential

2 Lasse Natvig

prefetcher RPT and DCPT sequential prefetcher

RPT prefetcherSequential (tagged or adaptive), RPT and DCPT

Improving DCPT

Miniproject – after the first deadline• Feedback

– RPT and DCPT are popular choice; the report should properly motivate each group choice of prefetcher (the motivation should not be: “The code was easily available”)

– Several groups works on similar methods• “find your story”

3 Lasse Natvig

– too much focus on getting the highest result in the PfJudge ranking; as stated in section 2.3. of the guidelines, the miniproject will be evaluated based on the following criteria:

• good use of language• clarity of the problem statement• overall document structure• depth of understanding for the field of prefetching• quality of presentation

Miniproject presentations• Friday 15/4 at 1415-1700 (max)• OK for all?

– No … we are working on finding a time schedule that is OK for all

4 Lasse Natvig

IDI Open, a challenge for you?

5 Lasse Natvig

Synchronization• Important concept

– Synchronize access to shared resources– Order events from cooperating processes correctly

• Smaller MP systems– Implemented by uninterrupted instruction(s) atomically accessing

l

6 Lasse Natvig

a value– Requires special hardware support– Simplifies construction of OS / parallel apps

• Larger MP systems Appendix H (not in course)

2

• Swaps value in register for value in memory– Mem = 0 means not locked, Mem = 1 means locked– How does this work

• Register <= 1 ; Processor want to lock• Exchange(Register, Mem)

Atomic exchange (swap)

7 Lasse Natvig

Exchange(Register, Mem)– If Register = 0 Success

• Mem was = 0 Was unlocked• Mem is now = 1 Now locked

– If Register = 1 Fail• Mem was = 1 Was locked• Mem is now = 1 Still locked

• Exchange must be atomic!

• One alternative: Load Linked (LL) and Store Conditional (SC)– Used in sequence

• If memory location accessed by LL changes, SC fails• If context switch between LL and SC SC fails

Implementing atomic exchange (1/2)

8 Lasse Natvig

• If context switch between LL and SC, SC fails– Implemented using a special link register

• Contains address used in LL• Reset if matching cache block is invalidated or if we get

an interrupt• SC checks if link register contains the same address. If

so, we have atomic execution of LL & SC

• Example code EXCH (R4, 0(R1)):try: MOV R3, R4 ; mov exchange value

LL R2, 0(R1) ; load linkedSC R3, 0(R1) ; store conditionalBEQZ R3, try ; branch if SC failed

Implementing atomic exchange (2/2)

9 Lasse Natvig

MOV R4, R2 ; put load value in R4

• This can now be used to implement e.g. spin locksDADDUI R2, R0, #1 ; R0 always = 0

lockit: EXCH R2, 0(R1) ; atomic exchangeBNEZ R2, lockit ; already locked?

Barrier sync. in BSP• The BSP-model

– Leslie G. Valiant, A bridging model for parallel computation, [CACM 1990]

– Computations organised in

10 Lasse Natvig

supersteps– Algorithms adapt to

compute platform represented through 4 parameters

– Helps the combination of portability & performance

http://www.seas.harvard.edu/news-events/press-releases/valiant_turing

Multicore• Important and early example: UltraSPARC T1• Motivation (See lecture 1)

– In all market segments from mobile phones to supercomputers– End of Moores law for single-core– The power wall

Th ll

11 Lasse Natvig

– The memory wall – The bandwith problem– ILP limitations– The complexity wall

Why multicores?

12 Lasse Natvig

3

13 Lasse Natvig

Chip Multithreading Opportunities and challenges• Paper by Spracklen & Abraham, HPCA-11 (2005)

[SA05]• CMT processors = Chip Multi-Threaded processors• A spectrum of processor architectures

14 Lasse Natvig

• A spectrum of processor architectures– Uni-processors with SMT (one core)– (pure) Chip Multiprocessors (CMP) (one thread pr. core)– Combination of SMT and CMP (They call it CMT)

• Best suited to server workloads (with high TLP)

Offchip Bandwidth• A bottleneck• Bandwidth increasing, but also latency [Patt04]• Need more than 100 in-flight requests to fully utilize

the available bandwidth

15 Lasse Natvig

Sharing processor resources• SMT

– Hardware strand• ”HW for storing the state of a thread of execution”• Several strands can share resources within the core, such as execution

resources– This improves utilization of processor resources– Reduces applications sensitivity to off-chip misses

• Switch between threads can be very efficient

16 Lasse Natvig

• (pure) CMP– Multiple cores can share chip resources such as memory controller,

off-chip bandwidth and L2 cache– No sharing of HW resources between strands within core

• Combination (CMT)

1st generation CMT• 2 cores per chip• Cores derived from

earlier uniprocessor designs

• Cores do not share any t ff

17 Lasse Natvig

resources, except off-chip data paths

• Examples: Sun’s Gemini, Sun’s UltraSPARC IV (Jaguar), AMD dual core Opteron, Intel dual-core Itanium (Montecito), Intel dual-core Xeon (Paxville, server)

2nd generation CMT• 2 or more cores per chip• Cores still derived from earlier

uniprocessor designs• Cores now share the L2 cache

– Speeds inter-core communication

18 Lasse Natvig

Speeds te co e co u cat o– Advantageous as most commercial

applications have significant instructionfootprints

• Examples: Sun’s UltraSPARC IV+, IBM’s Power 4/5

4

3rd generation CMT• CMT processors are best

designed from the ground-up, optimized for a CMT design point– Lower power consumption

• Multiple cores per chip

19 Lasse Natvig

Multiple cores per chip• Examples:

– Sun’s Niagara (T1) • 8 cores, each is 4-way SMT• Each core single-issue, short

pipeline• Shared 3MB L2-cache

– IBM’s Power-5• 2 cores, each 2-way SMT

Multicore generations (?)

20 Lasse Natvig

CMT/Multicore design space• Number of cores

– Multiple simple or few complex?• Recent paper of Hill & Marty …

– See http://www.youtube.com/watch?v=KfgWmQpzD74

– Heterogeneous cores• Serial fraction of parallel application

– Remember Amdahl’s lawO f l f i l th d d li ti

21 Lasse Natvig

• One powerful core for single-threaded applications

• Resource sharing– L2 cache! (and L3)

• (Terminology: LL = Last Level cache)– Floating point units– New more expensive resources (amortized over multiple cores)

• Shadow tags, more advanced cache techniques, HW accelerators, Cryptographic, OS functions (eg. memcopy), XML parsing, compression

– Your innovation !!!

CMT/Multicore challenges• Multipe threads (strands) share resources

– Maximize overall performance• Good resource utilization• Avoid ”starvation” (Units without work to do)

– Cores must be ”good neighbours”• Fairness, research by Magnus Jahre• See http://research.idi.ntnu.no/multicore/pub

P f t hi

22 Lasse Natvig

• Prefetching– Agressive prefetching is OK in single-thread system since the entire

system is idle on a miss– CMT/Multicore requires more careful prefetching

• Prefetch operation may take resources used by other threads– See research by Marius Grannæs (same link as above)

• Speculative operations– OK if using idle resources (delay until resource is idle)– More careful (just as prefetching) / seldomly power efficient

• Target: Commercial server applications– High thread level parallelism (TLP)

• Large numbers of parallel client requests– Low instruction level parallelism (ILP)

• High cache miss rates• Many unpredictable branches

UltraSPARC T1 (“Niagara”)

23 Lasse Natvig

• Many unpredictable branches

• Power, cooling, and space aremajor concerns for data centers

• Metric: (Performance / Watt) / Sq. Ft.• Approach: Multicore, Fine-grain

multithreading, Simple pipeline, Small L1 caches, Shared L2

T1 processor – ”logical” overview

24 Lasse Natvig

1.2 GHz at 72W typical, 79W peak power consumption

5

T1 Architecture• Also ships with 6 or 4 processors

25 Lasse Natvig

T1 pipeline / 4 threads• Single issue, in-order, 6-deep pipeline: F, S, D,

E, M, W • Shared units:

– L1 cache, L2 cache – TLB – Exec units

26 Lasse Natvig

Exec. units – pipe registers

• Separate units:– PC– instruction

buffer– reg file– store buffer

1 5%

2.0%

2.5%

ate

TPC-C

SPECJBB

Miss Rates: L2 Cache Size, Block Size (fig. 4.27)

27 Lasse Natvig

0.0%

0.5%

1.0%

1.5%

1.5 MB;32B

1.5 MB;64B

3 MB;32B

3 MB;64B

6 MB;32B

6 MB;64B

L2 M

iss

ra

T1

140

160

180

200

TPC-CSPECJBB

Miss Latency: L2 Cache Size, Block Size (fig. 4.28)

T1

28 Lasse Natvig

0

20

40

60

80

100

120

1.5 MB; 32B 1.5 MB; 64B 3 MB; 32B 3 MB; 64B 6 MB; 32B 6 MB; 64B

L2 M

iss

late

ncy

CPI Breakdown of Performance

Benchmark

Per Thread

CPI

Per core CPI

Effective CPI for 8 cores

Effective IPC for 8 cores

TPC-C 7 20 1 80 0 23 4 4

29 Lasse Natvig

TPC C 7.20 1.80 0.23 4.4

SPECJBB 5.60 1.40 0.18 5.7

SPECWeb99 6.60 1.65 0.21 4.8

Average thread status (fig 4.30)

30 Lasse Natvig

6

Not Ready Breakdown (fig 4.31)

40%

60%

80%

100%

cycl

es n

ot re

ady

Other

Pipeline delay

L2 miss

31 Lasse Natvig

• Other = ?– TPC-C - store buffer full is largest contributor– SPEC-JBB - atomic instructions are largest contributor – SPECWeb99 - both factors contribute

0%

20%

40%

TPC-C SPECJBB SPECWeb99

Frac

tion

of

L1 D miss

L1 I miss

4

4.5

5

5.5

6

6.5

to P

entiu

m D

+Power5 Opteron Sun T1

Performance Relative to Pentium D

32 Lasse Natvig

0

0.5

1

1.5

2

2.5

3

3.5

SPECIntRate SPECFPRate SPECJBB05 SPECWeb05 TPC-like

Perfo

rman

ce re

lativ

e

2 5

3

3.5

4

4.5

5

5.5

aliz

ed to

Pen

tium

D

+Power5 Opteron Sun T1

Performance/mm2, Performance/Watt

33 Lasse Natvig

0

0.5

1

1.5

2

2.5

SPECIntRate

/mm^2

SPECIntRate

/Watt

SPECFPRate

/mm^2

SPECFPRate

/Watt

SPECJBB05

/mm^2

SPECJBB05

/Watt

TPC-C

/mm^2

TPC-C

/Watt

Effic

ienc

y no

rma

Cache CoherencyAnd

Memory Models

Review● Does pipelining help instruction latency?● Does pipelining help instruction throughput?● What is Instruction Level Parallelism? ● What are the advantages of OoO machines? ● What are the disadvantages of OoO machines? ● What are the advantages of VLIW?● What are the disadvantages of VLIW? ● What is an example of Data Spatial Locality? ● What is an example of Data Temporal Locality? ● What is an example of Instruction Spatial Locality? ● What is an example of Instruction Temporal Locality? ● What is a TLB? ● What is a packet switched network?

Memory Models (Memory Consistency)

Memory Model: The system supports a given model if operations on memory follow specific rules. The data consistency model specifies a contract between programmer and system, wherein the system guarantees that if the programmer follows the rules, memory will be consistent and the results of memory operations will be predictable.

Memory Models (Memory Consistency)

Memory Model: The system supports a given model if operations on memory follow specific rules. The data consistency model specifies a contract between programmer and system, wherein the system guarantees that if the programmer follows the rules, memory will be consistent and the results of memory operations will be predictable.

Huh??????

Sequential Consistency?

Simple Case● Consider a simple two processor system

● The two processors are coherent● Programs running in parallel may communicate via

memory addresses● Special hardware is required in order to enable

communication via memory addresses.● Shared memory addresses are the standard form of

communication for parallel programming

Memory

CPU 0 CPU 1

Interconnect

Simple Case● CPU 0 wants to send a data word to CPU 1

● What does the code look like ???

Memory

CPU 0 CPU 1

Interconnect

Simple Case● CPU 0 wants to send a data word to CPU 1

● What does the code look like ???

● Code on CPU0 writes a value to an address● Code on CPU1 reads the address to get the new value

Memory

CPU 0 CPU 1

Interconnect

Simple Case

Memory

CPU 0 CPU 1

Interconnect

int shared_flag = 0;int shared_value = 0;

void sender_thread(){

shared_value = 42;shared_flag = 1;

}

void receiver_thread(){

while (shared_flag == 0) { }Int new_value = shared_value;printf(“%i\n”, new_value);

}

Simple Case

Memory

CPU 0 CPU 1

Interconnect




}



}

Global variables are shared when using pthreads. This means all threads within this process may access these variables

Simple Case

Memory

CPU 0 CPU 1

Interconnect




}



}


Sender writes to the shared data, then sets a shared data flag that the receiver is polling

Simple Case

Memory

CPU 0 CPU 1

Interconnect




}



}



Receiver is polling on the flag. When the flag is no longer zero, the receiver reads the shared_value and prints it out.

Simple Case

Memory

CPU 0 CPU 1

Interconnect




}



}



Receiver is polling on the flag. When the flag is no longer zero, the receiver reads the shared_value and prints it out.

Any Problems???

Simple CMP Cache Coherency

Interconnect

CPU 0

L1

CPU 0

L1

CPU 0

L1

CPU 0

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory ● Four core machine supporting cache coherency

● Each core has a local L1 Data and Instruction cache.

● The L2 cache is shared amongst all cores, and physically distributed into 4 disparate banks

● The interconnect sends memory requests and responses back and forth between the caches

The Coherency Problem

Interconnect

CPU 0

L1

CPU 0

L1

CPU 0

L1

CPU 0

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Ld R1,X


Interconnect

CPU 0

L1

CPU 0

L1

CPU 0

L1

CPU 0

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Ld R1,X

● Misses in Cache

Miss!


Interconnect

CPU 0

L1

CPU 0

L1

CPU 0

L1

CPU 0

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Ld R1,X

● Misses in Cache

● Goes to “home” l2 (home often determined by hash of address)


Interconnect

CPU 0

L1

CPU 0

L1

CPU 0

L1

CPU 0

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Ld R1,X

● Misses in Cache


● If miss at home L2, read data from memory

To Memory


Interconnect

CPU 0

L1

CPU 0

L1

CPU 0

L1

CPU 0

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Ld R1,X

● Misses in Cache



● Deposit data in both home L2 and Local L1


Interconnect

CPU 0

L1

CPU 0

L1

CPU 0

L1

CPU 0

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Ld R1,X

● Misses in Cache



● Deposit data in both home L2 and Local L1

Mem(X) is now in both the L2 and ONE L1 cache


Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Ld R1,X

● CPU 3 reads the same address

Ld R2,X


Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Ld R1,X


● Miss in L1

Ld R2,X

Miss!


Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Ld R1,X


● Miss in L1

● Sends request to L2

● Hits in L2

Ld R2,X


Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Ld R1,X


● Miss in L1

● Sends request to L2

● Hits in L2

● Data is placed in L1 cache for CPU 3

Ld R2,X


Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Ld R1,X

Store R2, X

● CPU now STORES to address X

Ld R2,X

What happens?????


Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Ld R1,X

Store R2, X

● CPU now STORES to address X

Ld R2,X

Special hardware is needed in order to either update or invalidate the data in CPU 3's cache


Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Ld R1,X

Store R2, X

● For this example, we will assume a directory based invalidate protocol, with write-thru L1 caches

Ld R2,X


Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Ld R1,X

Store R2, X

● Store updates the local L1 and writes-thru to the L2

Ld R2,X


Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

0, 3

L2 Bank

Directory

Ld R1,X

Store R2, X


● At the L2, the directory is inspected, showing CPU3 is sharing the line

Ld R2,X


Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

0, 3

L2 Bank

Directory

Ld R1,X

Store R2, X



● The data in CPU3's cache is invalidated

Ld R2,X


Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

0

L2 Bank

Directory

Ld R1,X

Store R2, X




● The L2 cache is updated with the new value

Ld R2,X


Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

0

L2 Bank

Directory

Ld R1,X

Store R2, X




● The L2 cache is updated with the new value

● The system is now “coherent”

● Note that CPU3 was removed from the directory

Ld R2,X

Ordering

Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Store R1,X

Store R2, Y

● Our protocol relies on stores writing through to the L2 cache.

Ordering

Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Store R1,X

Store R2, Y

● Our protocol relies on stores writing through to the L2 cache.

● If the stores are to different addresses, there are multiple points within the system where the stores may be reordered.

Ordering

Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Store R1,X

Store R2, Y

Ordering

Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Store R1,X

Store R2, Y

Purple leaves the network first!

Ordering

Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Store R1,X

Store R2, Y

Stores are written to the shared L2 out-of-order (purple first, then red) !!!

Ordering

Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Store R1,X

Store R2, Y

Stores are written to the shared L2 out-of-order (purple first, then red) !!!

Interconnect is not the only cause for out-of-order!

Ordering

Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Store R1,X

Store R2, Y

Processor core may issues instructions out-of-order (remember out-of-order machines??)

Ordering

Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

Store R1,X

Store R2, Y

L2 pipeline may also reorder requests to different addresses

L2 Pipeline Ordering

ResourceAllocation

And Conflict

Detection

L2 TagAccess

L2 DataAccess

CoherenceControl

From Network

Retry Fifo

From Network


ResourceAllocation

And Conflict

Detection

L2 TagAccess

L2 DataAccess

CoherenceControl

Retry Fifo

Two Memory Requests arrive on the network

ResourceAllocation

And Conflict

DetectionFrom Network


L2 TagAccess

L2 DataAccess

CoherenceControl

Retry Fifo

Requests Serviced in-order

Retry Fifo ResourceAllocation

And Conflict



L2 TagAccess

L2 DataAccess

CoherenceControl

Conflicts are sent to retry fifo

Conflict!


And Conflict



L2 TagAccess

L2 DataAccess

CoherenceControl

Network is given priority

L2 TagAccess


And Conflict



L2 DataAccess

CoherenceControl

Requests are now executing in a different order!

L2 DataAccess

L2 TagAccess


And Conflict



CoherenceControl

Requests are now executing in a different order!

Simple Case (revisited)

Memory

CPU 0 CPU 1

Interconnect




}



}



while (shared_flag == 0) { }new_value = shared_value;

Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

421

0




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

421

Receiver is spinning on “shared_flag”

0

0

0

0




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

421

“shared_value” has reset value of 0

0

0

0

0




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

421

0

0

0

0

Store to shared value writes-thru L1

42




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

421

0

0

0

0

Store to “shared_flag” writes thru L1

421




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

421

0

0

0

0

Both stores are now sitting in the network 421




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

421

0

0

0

0

Store to “shared_flag” is first to leave the network

42

1




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

421

0

0

0

0

1) “shared_flag” is updated

2) Coherence protocol invalidates copy in CPU3

42

1




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

421

0 0

0

42

1




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

421

0 0

0

42

1Receiver that is polling now misses in the cache and sends request to L2!

Miss!




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

421

0 0

0

42

1Response comes back.

Flag is now set!

Time to read the “shared_value”!

1




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

421

0 0

0

42

1

1

Note that the write to “shared_value” is still sitting in the network!




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

421

0 0

0

42

1

1




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

3

421

0 0

0

42

1

1 0




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

3

421

0 0

0

42

1

1 0

Write of “42” to “shared_value” finally escapes the network, but it is TOO LATE!




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

3

421

0 0

0

42

1

1 0

Our code doesn't always work!

WTF???

0




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

3

421

0 0

0

42

1

1 0

The architecture needs to expose ordering properties to the programmer, so that the programmer may write correct code.

This is called the “Memory Model”

Sequential Consistency

Hardware GUARANTEES that all memory operations are ordered globally.

● Benefits● Simplifies programming (our initial code would have worked)

● Costs● Hard to implement micro-architecturally● Can hurt performance● Hard to verify

Weak Consistency

Loads and stores to different addresses may be re-ordered

● Benefits● Much easier to implement and build● Higher performing● Easy to verify

● Costs● More complicated for the programmer● Requires special “ordering” instructions for synchronization

Instructions for Weak Memory Models

● Write Barrier● Don't issue a write until all preceding writes have completed

● Read Barrier● Don't issue a read until all preceding reads have completed

● Memory Barrier● Don't issue a memory operation until all preceding memory

operations have completed

Etc etc

Simple Case (write barrier)

Memory

CPU 0 CPU 1

Interconnect



shared_value = 42;__write_barrier();shared_flag = 1;

}



}




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

42

1

0



Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

42

1

Receiver is spinning on “shared_flag”

0

0

0

0shared_value = 42;__write_barrier();shared_flag = 1;



Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

42

1

“shared_value” has reset value of 0

0

0

0

0shared_value = 42;__write_barrier();shared_flag = 1;



Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

42

1

0

0

0

0

Store to shared value writes-thru L1

42




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

42

1

0

0

0

0

write_barrier prevents issues of “shared_flag = 1” until the “shared_value = 42” is complete. This is tracked via acknowledgments

42


Blocked



Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

42

1

0

0

0

0

Write eventually leaves network

42


42

Blocked



Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

42

1

0

0

0

0

Write is acknowledged

42


StillBlocked



Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

42

1

0

0

0

0

Barrier is now complete!

42




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

42

1

0

0

0

0

Store to “shared_flag” writes thru L1

42

1




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

42

1

0

0

0

0

Store to “shared_flag” leaves the network

421




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

42

1

0

0

0

0

1) “shared_flag” is updated

2) Coherence protocol invalidates copy in CPU3

421




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

42

1

0 0

0

421




Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

42

1

0 0

0

1Receiver that is polling now misses in the cache and sends request to L2!

Miss!


42



Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

42

1

0 0

0

1Response comes back.

Flag is now set!

Time to read the “shared_value”!

1


42



Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

Directory

42

1

0 0

0

1

1


42



Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

3

42

1

0 0

0

42

1

1 0


42

42

Correct Code!!!



Interconnect

CPU 0

L1

CPU 1

L1

CPU 2

L1

CPU 3

L1

L2 Bank

3

L2 Bank

Directory

L2 Bank

Directory

L2 Bank

3

42

1

0 0

0

42

1

1 0


42

42

Correct Code!!!

What about reads.....

Weak or Strong?

● The academic community pushed hard for sequential consistency:

“Multiprocessors Should Support Simple Memory Consistency Models” Mark Hill, IEEE Computer, August 1998

Weak or Strong?

● The academic community pushed hard for sequential consistency:

“Multiprocessors Should Support Simple Memory Consistency Models” Mark Hill, IEEE Computer, August 1998

WRONG!!!

Most new architectures support relaxed memory models (ARM, IA64, TILE, etc). Much easier to implement and verify. Not a programming issue, because the complexity is hidden behind a library, and 99.9% of programmers don't have to worry about these issues!

Break ProblemYou are one of P recently arrested prisoners. The warden makes the following announcement:

"You may meet together today and plan a strategy, but after today you will be in isolated cells and have no communication with one another. I have set up a "switch room" which contains a light switch, which is either on or off. The switch is not connected to anything. Every now and then, I will select one prisoner at random to enter the "switch room". This prisoner may throw the switch (from on to off, or vice-versa), or may leave the switch unchanged. Nobody else will ever enter this room. Each prisoner will visit the switch room arbitrarily often. More precisely, for any N, eventually each of you will visit the switch room at least N times. At any time, any of you may declare: "we have all visited the switch room at least once." If the claim is correct, I will set you free. If the claim is incorrect, I will feed all of you to the sharks."

Devise a winning strategy when you know that the initial state of the switch is off. Hint: not all prisoners need to do the same thing.

1

1

TDT4260

Introduction to Green Computing Asymmetric multicore processors

Alexandru Iordan

2

Introduction to Green Computing

• What do we mean by Green Computing?

• Why Green Computing?

• Measuring “greenness”

• Research into energy consumption reduction

3

What do we mean by Green Computing?

4

What do we mean by Green Computing?

The green computing movement is a multifaceted global effort to reduce energy consumption and to promote sustainable development in the IT world.[Patrick Kurp, Green computing in Communications of the ACM, 2008]

5

Why Green Computing?

• Heat dissipation problems

• High energy bills

• Growing environmental impact

6

Measuring “greenness”

• Non-standard metrics– Energy (Joules)– Power (Watts)– Energy-per-instructions ( Joules / No. instructions )– Energy-delayN-product ( Joules * secondsN )– PerformanceN / Watt ( (No. instructions / second)N / Watt )

• Standard metrics– Data centers: Power Usage Effectiveness metric (The Green Grid

consortium)– Servers: ssj_ops / Watt metric (SPEC consortium)

2

8

Research into energy consumption reduction

9

Maximizing Power Efficiency with Asymmetric Multicore SystemsFedorova et al., Communications of the ACM, 2009

• Outline

– Asymmetric multicore processors

– Scheduling for parallel and serial applications

– Scheduling for CPU- and memory-intensive applications

10

Asymmetric multicore processors

• What makes a multicore asymmetric?– a few powerful cores (high clock freq., complex pipelines, OoO

execution)– many simple cores (low clock freq., simple pipeline, low power

requirement)

• Homogeneous ISA AMP– the same binary code can run on both types of cores

• Heterogeneous ISA AMP– code compiled separately for each type of core– examples: IBM Cell, Intel Larrabee

11

Efficient utilization of AMPs

• Efficient mapping of threads/workloads

– parallel applications• serial part → complex cores

• scalable parallel part → simple cores

– microarchitectural characteristics of workloads• CPU intensive applications → complex cores

• memory intensive applications → simple cores

12

Sequential vs. parallel characteristics

• Sequential programs– high degree of ILP– can utilize features of a complex core (super-scalar pipeline, OoO

execution, complex branch prediction)

• Parallel programs– high number of parallel threads/tasks (compensates for low ILP and

masks memory delays)

• Having both complex and simple cores, give AMPs applicability for wider range of applications

13

Parallelism-aware scheduling

• Goal: improve overall system efficiency (not the performance of a particular application)

• Idea: assign sequential applications/phases to run on the complex cores

• Does NOT provide fairness

3

14

Challenges of PA scheduling

• Detecting serial and parallel phases– limited scalability of threads can yield wrong solutions

• Thread migration overhead– migration across memory domains is expensive– scheduler must be topology aware

15

“Heterogeneity”-aware scheduling

• Goal: improve overall system efficiency

• Idea: – CPU-intensive applications/phases → complex cores– memory-intensive applications/phases → simple cores

• Inherently unfair

16

Challenges of HA scheduling

• Classifying threads/phases as CPU- or memory-bound– two approaches presented: direct measurement and modeling

• Long execution time (direct measurement approach) or need of offline information (modeling approach)

17

Summary

• Green Computing focuses on improving energy-efficiency and sustainable development in the IT world

• AMPs promise higher energy-efficiency than symmetric processors

• Schedulers must by designed to take advantage of the asymmetric hardware

18

References

• Kirk W. Cameron, The road to greener IT pasturesin IEEE Computer, 2009

• Dan Herrick and Mark Ritschard, Greening your computing technology, the near and far perspectives in Proceedings of the 37th ACM SIGUCCS, 2009

• Luiz A. Barroso, The price of performance in ACMQueue , 2005

19

NTNU HPC InfrastructureIBM AIX Power5+, CentOS AMD Istanbul

Jørn AmundsenIDI/NTNU IT2011-03-25

www.ntnu.no Jørn Amundsen, NTNU IT

2

Contents

1 Njord Power5+ hardware

2 Kongull AMD Istanbul hardware

3 Resource Managers

4 Documentation


3

Power5+ hardware

Cache and memory

Chip layout

System levelTOC


4

Cache and memory

• 16 x 64-bit word cache lines (32 in L3)• Hardware cache line prefetch on loads• Reads from memory are written into L2• External L3, acts as a victim cache for L2• L2 and L3 are shared between cores• L1 is write-through• Cache coherence is maintained system-wide at L2 level

• 4K pages sizes default, kernel supports 64K and 16M pages


5

Chip design

logic

decode &

schedule

64−bit registers

32 GPR, 32 FPR

2 FXU

2 FPU

1 BXU

1 CRL

2 LSU

Execution

Units

logic

decode &

schedule

64−bit registers

32 GPR, 32 FPR

2 FXU

2 FPU

1 BXU

1 CRL

2 LSU

Execution

Units

2−way

L1 I−cache

64K

10−way

L2 cache

1.92M

L1 D−cache

32K4−way

Memory Controller

L3 cache

36M12−way

2−way

L1 I−cache

64K

16−128GB

Main memory

DDR2

power5+ core power5+ core

power5+ chip

4−way

L1 D−cache

32K

Switch Fabric

35.2 GB/s

25.6 GB/s


6

SMT• In a concrete application, the processor core might be idle 50-80% of

the time, waiting for memory• An obvious solution would be to let another thread execute while our

thread is waiting for memory• This is known as hyper-threading in the Intel/AMD world, and

Simultaneous Multithreading (SMT) with IBM• SMT is supported in hardware throughout the processor core• SMT is more efficient than hyper-threading with less context switch

overhead• Power5 and 6 supports 1 thread/core or SMT with 2 threads/core,

while the latest Power7 supports 4 threads/core• SMT is enabled or disabled dynamically on a node with the

(privileged) command smtctl


7

SMT (2)

• SMT is beneficial if you are doing a lot of memory references, andyour application performance is memory bound

• Enabling SMT doubles the number of MPI tasks per node, from 16 to32. Requires your application to be sufficiently scalable.

• SMT is only available in user space with batch processing, by addingthe structured comment string:

#@ requirements = ( Feature == "SMT" )


8

Chip module packaging

• 4 chips and 4 L3 caches are HW integrated onto a MCM• 90.25 cm2, 89 layers of metal


9

The system level

• On a p575 system, a node is 2 MCM’s / 8 chips / 16 1.9GHz cores

• The Njord system is- 2 x 16-way 32 GiB login nodes- 4 x 16-way 16 GiB I/O nodes (used with GPFS)- 186 x 16-way 32 GiB compute nodes- 6 x 16-way 128 GiB compute nodes

• GPFS parallel file system, 33 TiB fiber disks 62 TiB SATA disks

• Interconnect- IBM Federation, a multistage crossbar network providing 2 GiB/s

bidirectional bandwidth and 5µs latency system-wide MPI performance


10

GPFS

• An important feature of a HPC system is the capability of movinglarge amounts of data from or to memory, across nodes and from orto permanent storage

• In this respect a high quality and high performance global file systemis essential

• GPFS is a robust parallel FS geared at high BW I/O, usedextensively in HPC and in the database industry

• Disk access is ≈ 1000 times slower than memory access, hence keyfactor for performance are

- spreading (striping) files across many disk units- using memory to cache files- hiding latencies in software


11

GPFS and parallel I/O (2)

• High transfer rates is achieved by distributing files in blocks roundrobin across a large number of disk units, up to thousands of disks

• On njord, the GPFS block size and stripe unit is 1 MB• In addition to multiple disks servicing file I/O, multiple threads might

read, write or update (R+W) a file simultaneously• GPFS use multiple I/O servers (4 dedicated nodes on njord), working

in parallel for performance, maintaining file and file metadataconsistency.

• High performance comes at a cost. Although GPFS can handledirectories with millions of files, it is usually the best to use fewer andlarger files, and to access files in larger chunks.


12

File buffering

• The kernel does read-aheadsand write-behinds of file blocks

• The kernel does heuristics onI/O to discover sequential andstrided forward and backwardreads.

• The disadvantage is memorycopying of all data

• Can bypass with DIRECT_IO –can be useful with large I/O(MB-sized), utilizing applicationI/O patterns

application

buffer

KERNELfile system

buffer

DISK

SUBSYSTEM

application

user


13

AMD Istanbul hardware

Cache and memory

System levelTOC


14

Cache and memory

• 6 x 128 KiB L1 cache• 6 x 512 KiB L2 cache• 1 x 6 MiB L3 cache• 24 or 48 GiB DDR3 RAM


15

The system level• A node is 2 chips / 12 2.4GHz cores• The Kongull system is

- 1 x 12-way 24 GiB login nodes- 4 x 12-way 24 GiB I/O nodes (used with GPFS)- 52 x 12-way 24 GiB compute nodes- 44 x 12-way 48 GiB compute nodes

• Nodes compute-0-0 – compute-0-39 and compute-1-0 –compute-1-11 are 24 GiB @ 800 MHz, while compute-1-12 –compute-1-15 and compute-2-0 – compute-2-39 are 48 GiB@ 667 MHz bus frequency

• GPFS parallel file system, 73 TiB

• Interconnect- A fat tree implemented with HP Procurve switches, 1 Gb from node to

rack switch, then 10Gb from the rack switch to the toplevel switch.Bandwidth and latency is left as a programming exercise.


16

Resource Managers

Resource Managers

Njord classes

Kongull queuesTOC


17

Resource Managers• Need efficient (and fair) utilization of the large pool of resources• This is the domain of queueing (batch) systems or resource

managers• Administers the execution of (computational) jobs and provides

resource accounting across usersand accounts• Includes distribution of parallel (OpenMP/MPI) threads/processes

across physical cores and gang scheduling of parallel execution• Jobs are Unix shell scripts with batch system keywords embedded

within structured comments• Both Njord and Kongull employs a series of queues (classes)

administering various sets of possibly overlapping nodes withpossibly different priorities

• IBM LoadLeveler on Njord, Torque (development from OpenPBS) onKongull


18

Njord job class overview

class min-maxnodes

max nodes/ job

maxruntime description

forecast 1-180 180 unlimited top priority class dedicatedto forecast jobs

bigmem 1-6 4 7 days high priority 115GB memoryclass

large 4-180 128 21 days high priority class for jobs of64 processors or more

normal 1-52 42 21 days default class

express 1-186 4 1 hour high priority class for debug-ging and test runs

small 1/2 1/2 14 days low priority class for serial orsmall SMP jobs

optimist 1-186 48 unlimited checkpoint-restart jobs


19

Njord job class overview (2)

• Forecast is the highest priority queue, suspends everything else

• Beware: node memory (except bigmem) is split in 2, to guaranteeavailable memory for forecast jobs

• A C-R job runs at the very lowest priority, any other job will terminateand requeue an optimist queue job if not enough available nodes

• Optimist class jobs need an internal checkpoint-restart mechanism• AIX LoadLeveler impose node job memory limits, e.g. jobs

oversubscribing available node memory are aborted with an email


20

LoadLeveler sample jobscript

# @ job_name = hybrid_job# @ account_no = ntnuXXX# @ job_type = parallel# @ node = 3# @ tasks_per_node = 8# @ class = normal# @ ConsumableCpus(2) ConsumableMemory(1664mb)# @ error = $(job_name).$(jobid).err# @ output = $(job_name).$(jobid).out# @ queue

export OMP_NUM_THREADS=2# Create (if necessary) and move to my working directoryw=$WORKDIR/$USER/testif [ ! -d $w ]; then mkdir -p $w; ficd $w$HOME/a.outllq -w $LOADL_STEP_ID

exit 0


21

LoadLeveler sample C-R email (1/2)

Date: Mon, 21 Mar 2011 18:31:37 +0100From: [email protected]: [email protected]: z2rank_s_5

From: LoadLeveler

LoadLeveler Job Step: f05n02io.791345.0Executable: /home/ntnu/joern/run/z2rank/logs/skipped/z2rank_s_5.jobExecutable arguments:State for machine: f14n06LoadL_starter: The program, z2rank_s_5.job, exited normally and returnedan exit code of 0.

State for machine: f09n06State for machine: f13n04State for machine: f14n04State for machine: f08n06State for machine: f12n06State for machine: f15n07State for machine: f18n04


22

LoadLeveler sample C-R email (2/2)

This job step was dispatched to run 18 time(s).This job step was rejected by Starter 0 time(s).Submitted at: Mon Mar 21 10:02:56 2011Started at: Mon Mar 21 18:16:59 2011Exited at: Mon Mar 21 18:31:37 2011

Real Time: 0 08:28:41Job Step User Time: 16 06:34:29

Job Step System Time: 0 00:21:15Total Job Step Time: 16 06:55:44

Starter User Time: 0 00:00:19Starter System Time: 0 00:00:09Total Starter Time: 0 00:00:28


23

Kongull job queue overview

class min-maxnodes

max nodes/ job

maxruntime description

default 1-52 52 35 days default queue except IPT,SFI IO and Sintef Petroleum

express 1-96 96 1 hour high priority queue for de-bugging and test runs

bigmem 1-44 44 7 days default queue for IPT, SFI IOand Sintef Petroleum

optimist 1-96 48 28 days checkpoint-restart jobs

• Oversubscribing node physical memory crashes the node

• this might happen if not specifying the below in your job script:

#PBS -lnodes=1:ppn=12

• If all nodes are not reserved, the batch system will attempt to share nodes by default


24

Documentation

Njord User Guidehttp://docs.notur.no/ntnu/njord-ibm-power-5

Notur load statshttp://www.notur.no/hardware/status/

Kongull support wikihttp://hpc-support.idi.ntnu.no/

Kongull load statshttp://kongull.hpc.ntnu.no/ganglia/


TDT4260 Computer ArchitectureMini-Project Guidelines

Alexandru Ciprian [email protected]

January 10, 2011

1 Introduction

The Mini-Project accounts for 20% of the final grade in TDT4260 Computer Architecture. Your task isto develop and evaluate a prefetcher using the M5 simulator. M5 is currently one of the most popularsimulators for computer architecture research and has a rich feature set. Consequently, it is a verycomplex piece of software. To make your task easier, we have created a simple interface to the memorysystem that you can use to develop your prefetcher. Furthermore, you can evaluate your prefetchers bysubmitting your code via web interface. This web interface runs your code on the Kongull cluster withthe default simulator setup. It is also possible to experiment with other parameters, but then you will haveto run the simulator yourself. The web interface, the modified M5 simulator and more documentationcan be found at http://dm-ark.idi.ntnu.no/.

The Mini-Project is carried out in groups of 2 to 4 students. In some cases we will allow students towork alone. Your will be graded based on both a written paper and a short oral presentation.

Make sure you clearly cite the source of information, data and figures. Failure to do so is regarded ascheating and is handled according to NTNU guidelines. If you have any questions, send an e-mail toteaching assistant Alexandru Ciprian Iordan ([email protected]) .

1.1 Mini-Project Goals

The Mini-Project has the following goals:

• Many computer architecture topics are best analyzed by experiments and/or detailed studies. TheMini-Project should provide training in such exercises.

• Writing about a topic often increases the understanding of it. Consequently, we require that theresult of the Mini-Project is a scientific paper.

2 Practical Guidelines

2.1 Time Schedule and Deadlines

The Mini-Project schedule is shown in Table 1. If these deadlines collide with deadlines in other subjects,we suggest that you consider handing in the Mini-Project earlier than the deadline. If you miss the finaldeadline, this will reduce the maximum score you can be awarded.

1

Deadline DescriptionFriday 21. January List of group members delivered to Alexandru Ciprian Ior-

dan ([email protected]) by e-mailFriday 4. March Short status report and an outline of the final report delivered to

Alexandru Ciprian Iordan ([email protected]) by e-mailFriday 8. April 12:00 (noon) Final paper deadline. Deliver the paper through It’s Learning. De-

tailed report layout requirements can be found in section 2.2.Week 15 (11. - 15. April) Compulsory 10 minute oral presentations

Table 1: Mini-Project Deadlines

2.2 Paper Layout

The paper must follow the IEEE Transactions style guidelines available here:

http://www.ieee.org/publications_standards/publications/authors/authors_journals.html#sect2

Both Latex and Word templates are available, but we recommend that you use Latex. The paper mustuse a maximum of 8 pages. Failure to comply with these requirements will reduce the maximum scoreyou can be awarded.

In addition, we will deduct points if:

• The paper does not have a proper scientific structure. All reports must contain the following sec-tions: Abstract, Introduction, Related Work or Background, Prefetcher Description, Methodology,Results, Discussion and Conclusion. You may rename the “Prefetcher Description” section to amore descriptive title. Acknowledgements and Author biographies are optional.

• Use citations correctly. If you use a figure that somebody else has made, a citation must appear inthe figure text.

• NTNU has acquired an automated system that checks for plagiarism. We may run this system onyour papers so make sure you write all text yourself.

2.3 Evaluation

The Mini-Project accounts for 20% of the total grade in TDT4260 Computer Architecture. Within theMini-Project, the report counts 80% and the oral presentation 20%.

The report grade will be based on the following criteria:

• Language and use of figures• Clarity of the problem statement• Overall document structure• Depth of understanding for the field of computer architecture• Depth of understanding of the investigated problem

The oral presentation grade will be based on following criteria:

• Presentation structure• Quality and clarity of the slides• Presentation style• If you use more than the provided time, you will lose points.

2

M5 simulator system

TDT4260 Computer Architecture

User documentation

Last modified: November 23, 2010

Contents

1 Introduction 2

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Chapter outlines . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Installing and running M5 4

2.1 Download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.2 VirtualBox disk image . . . . . . . . . . . . . . . . . . 5

2.3 Build . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4 Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4.1 CPU2000 benchmark tests . . . . . . . . . . . . . . . . 6

2.4.2 Running M5 with custom test programs . . . . . . . . 7

2.5 Submitting the prefetcher for benchmarking . . . . . . . . . . 8

3 The prefetcher interface 9

3.1 Memory model . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Interface specification . . . . . . . . . . . . . . . . . . . . . . 9

3.3 Using the interface . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3.1 Example prefetcher . . . . . . . . . . . . . . . . . . . . 13

4 Statistics 14

5 Debugging the prefetcher 16

5.1 m5.debug and trace flags . . . . . . . . . . . . . . . . . . . . . 16

5.2 GDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.3 Valgrind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1

Chapter 1

Introduction

You are now going to write your own hardware prefetcher, using a modifiedversion of M5, an open-source hardware simulator system. This modifiedversion presents a simplified interface to M5’s cache, allowing you to con-centrate on a specific part of the memory hierarchy: a prefetcher for thesecond level (L2) cache.

1.1 Overview

This documentation covers the following:

• Installing and running the simulator

• Machine model and memory hierarchy

• Prefetcher interface specification

• Using the interface

• Testing and debugging the prefetcher on your local machine

• Submitting the prefetcher for benchmarking

• Statistics

1.2 Chapter outlines

The first chapter gives a short introduction, and contains an outline of thedocumentation.

2

The second chapter starts with the basics: how to install the M5 simulator.There are two possible ways to install and use it. The first is as a stand-alone VirtualBox disk-image, which requires the installation of VirtualBox.This is the best option for those who use Windows as their operating systemof choice. For Linux enthusiasts, there is also the option of downloading atarball, and installing a few required software packages.

The chapter then continues to walk you through the necessary steps toget M5 up and running: building from source, running with command-lineoptions that enables prefetching, running local benchmarks, compiling andrunning custom test-programs, and finally, how to submit your prefetcherfor testing on a computing cluster.

The third chapter gives an overview over the simulated system, and de-scribes its memory model. There is also a detailed specification of theprefetcher interface, and tips on how to use it when writing your ownprefetcher. It includes a very simple example prefetcher with extensive com-ments.

The fourth chapter contains definitions of the statistics used to quantita-tively measure prefetchers.

The fifth chapter gives details on how to debug prefetchers using advancedtools such as GDB and Valgrind, and how to use trace-flags to get detaileddebug printouts.

3

Chapter 2

Installing and running M5

2.1 Download

Download the modified M5 simulator from the PfJudgeβ website.

2.2 Installation

2.2.1 Linux

Software requirements (specific Debian/Ubuntu packages mentioned in paren-theses):

• g++ >= 3.4.6

• Python and libpython >= 2.4 (python and python-dev)

• Scons > 0.98.1 (scons)

• SWIG >= 1.3.31 (swig)

• zlib (zlib1g-dev)

• m4 (m4)

To install all required packages in one go, issue instructions to apt-get:

sudo apt-get install g++ python-dev scons swig zlib1g-dev m4

The simulator framework comes packaged as a gzipped tarball. Start the ad-venture by unpacking with tar xvzf framework.tar.gz. This will createa directory named framework.

4

2.2.2 VirtualBox disk image

If you do not have convenient access to a Linux machine, you can downloada virtual machine with M5 preconfigured. You can run the virtual machinewith VirtualBox, which can be downloaded from http.//www.virtualbox.

org.

The virtual machine is available as a zip archive from the PfJudgeβ web-site. After unpacking the archive, you can import the virtual machine intoVirtualBox by selecting “Import Appliance” in the file menu and opening“Prefetcher framework.ovf”.

2.3 Build

M5 uses the scons build system: scons -j2 ./build/ALPHA SE/m5.opt

builds the optimized version of the M5 binaries.

-j2 specifies that the build process should built two targets in parallel. Thisis a useful option to cut down on compile time if your machine has severalprocessors or cores.

The included build script compile.sh encapsulates the necessary build com-mands and options.

2.4 Run

Before running M5, it is necessary to specify the architecture and parametersfor the simulated system. This is a nontrivial task in itself. Fortunatelythere is an easy way: use the included example python script for runningM5 in syscall emulation mode, m5/config/example/se.py. When usinga prefetcher with M5, this script needs some extra options, described inTable 2.1.

For an overview of all possible options to se.py, do

./build/ALPHA SE/m5.opt common/example/se.py --help

When combining all these options, the command line will look somethinglike this:

./build/ALPHA SE/m5.opt common/example/se.py --detailed

--caches --l2cache --l2size=1MB --prefetcher=policy=proxy

--prefetcher=on access=True

This command will run se.py with a default program, which prints out“Hello, world!” and exits. To run something more complicated, use the

5

http.//www.virtualbox.org

http.//www.virtualbox.org

Option Description

--detailed Detailed timing simulation

--caches Use caches

--l2cache Use level two cache

--l2size=1MB Level two cache size

--prefetcher=policy=proxy Use the C-style prefetcher interface

--prefetcher=on access=True Have the cache notify the prefetcher

on all accesses, both hits and misses

--cmd The program (an Alpha binary) to run

Table 2.1: Basic se.py command line options.

--cmd option to specify another program. See subsection 2.4.2 about cross-compiling binaries for the Alpha architecture. Another possibility is to runa benchmark program, as described in the next section.

2.4.1 CPU2000 benchmark tests

The test prefetcher.py script can be used to evaluate the performance ofyour prefetcher against the SPEC CPU2000 benchmarks. It runs a selectedsuite of CPU2000 tests with your prefetcher, and compares the results tosome reference prefetchers.

The per-test statistics that M5 generates are written tooutput/<testname-prefetcher>/stats.txt. The statistics most relevantfor hardware prefetching are then filtered and aggregated to a stats.txt

file in the framework base directory.

See chapter 4 for an explanation of the reported statistics.

Since programs often do some initialization and setup on startup, a samplefrom the start of a program run is unlikely to be representative for the wholeprogram. It is therefore desirable to begin the performance tests after theprogram has been running for some time. To save simulation time, M5 canresume a program state from a previously stored checkpoint. The prefetcherframework comes with checkpoints for the CPU2000 benchmarks taken after109 instructions.

It is often useful to run a specific test to reproduce a bug. To run theCPU2000 tests outside of test prefetcher.py, you will need to set theM5 CPU2000 environment variable. If this is set incorrectly, M5 will give theerror message “Unable to find workload”. To export this as a shell variable,do

6

export M5 CPU2000=lib/cpu2000

Near the top of test prefetcher.py there is a commented-out call todry run(). If this is uncommented, test prefetcher.py will print thecommand line it would use to run each test. This will typically look likethis:

m5/build/ALPHA SE/m5.opt --remote-gdb-port=0 -re

--outdir=output/ammp-user m5/configs/example/se.py

--checkpoint-dir=lib/cp --checkpoint-restore=1000000000

--at-instruction --caches --l2cache --standard-switch

--warmup-insts=10000000 --max-inst=10000000 --l2size=1MB

--bench=ammp --prefetcher=on access=true:policy=proxy

This uses some additional command line options, these are explained inTable 2.2.

Option Description

--bench=ammp Run one of the SPEC CPU2000 benchmarks.

--checkpoint-dir=lib/cp The directory where program checkpoints are stored.

--at-instruction Restore at an instruction count.

--checkpoint-restore=n The instruction count to restore at.

--standard-switch Warm up caches with a simple CPU model,

then switch to an advanced model to gather statistics.

--warmup-insts=n Number of instructions to run warmup for.

--max-inst=n Exit after running this number of instructions.

Table 2.2: Advanced se.py command line options.

2.4.2 Running M5 with custom test programs

If you wish to run your self-written test programs with M5, it is necessary tocross-compile them for the Alpha architecture. The easiest way to achievethis is to download the precompiled compiler-binaries provided by crosstoolfrom the M5 website. Install the one that fits your host machine best (32or 64 bit version). When cross-compiling your test program, you must usethe -static option to enforce static linkage.

To run the cross-compiled Alpha binary with M5, pass it to the script withthe --cmd option. Example:

./build/ALPHA SE/m5.opt configs/example/se.py --detailed

--caches --l2cache --l2size=512kB --prefetcher=policy=proxy

--prefetcher=on access=True --cmd /path/to/testprogram

7

http://www.kegel.com/crosstool/

http://www.m5sim.org/wiki/index.php/Download#Pre-compiled_Cross-compilers

2.5 Submitting the prefetcher for benchmarking

First of all, you need a user account on the PfJudgeβ web pages. Theteaching assistant in TDT4260 Computer Architecture will create one foryou. You must also be assigned to a group to submit prefetcher code orview earlier submissions.

Sign in with your username and password, then click “Submit prefetcher”in the menu. Select your prefetcher file, and optionally give the submissiona name. This is the name that will be shown in the highscore list, so choosewith care. If no name is given, it defaults to the name of the uploaded file.If you check “Email on complete”, you will receive an email when the resultsare ready. This could take some time, depending on the cluster’s currentworkload.

When you click “Submit”, a job will be sent to the Kongull cluster, whichthen compiles your prefetcher and runs it with a subset of the CPU2000

tests. You are then shown the “View submissions” page, with a list of allyour submissions, the most recent at the top.

When the prefetcher is uploaded, the status is “Uploaded”. As soon as it issent to the cluster, it changes to “Compiling”. If it compiles successfully, thestatus will be “Running”. If your prefetcher does not compile, status willbe “Compile error”. Check “Compilation output” found under the detailedview.

When the results are ready, status will be “Completed”, and a score will begiven. The highest scoring prefetcher for each group is listed on the highscorelist, found under “Top prefetchers” in the menu. Click on the prefetchername to go a more detailed view, with per-test output and statistics.

If the prefetcher crashes on some or all tests, status will be “Runtime error”.To locate the failed tests, check the detailed view. You can take a look atthe output from the failed tests by clicking on the “output” link found aftereach test statistic.

To allow easier exploration of different prefetcher configurations, it is possi-ble to submit several prefetchers at once, bundled into a zipped file. Each.cc file in the archive is submitted independently for testing on the cluster.The submission is named after the compressed source file, possibly prefixedwith the name specified in the submission form.

There is a limit of 50 prefetchers per archive.

8

http://docs.notur.no/ntnu/kongull.hpc.ntnu.no/kongull-hardware-1/hardware

Chapter 3

The prefetcher interface

3.1 Memory model

The simulated architecture is loosely based on the DEC Alpha Tsunamisystem, specifically the Alpha 21264 microprocessor. This is a superscalar,out-of-order (OoO) CPU which can reorder a large number of instructions,and do speculative execution.

The L1 prefetcher is split in a 32kB instruction cache, and a 64kB datacache. Each cache block is 64B. The L2 cache size is 1MB, also with a cacheblock size of 64B. The L2 prefetcher is notified on every access to the L2cache, both hits and misses. There is no prefetching for the L1 cache.

The memory bus runs at 400MHz, is 64 bits wide, and has a latency of 30ns.

3.2 Interface specification

The interface the prefetcher will use is defined in a header file located atprefetcher/interface.hh. To use the prefetcher interface, you shouldinclude interface.hh by putting the line #include "interface.hh" atthe top of your source file.

#define Value Description

BLOCK SIZE 64 Size of cache blocks (cache lines) in bytes

MAX QUEUE SIZE 100 Maximum number of pending prefetch requests

MAX PHYS MEM SIZE 228 − 1 The largest possible physical memory address

Table 3.1: Interface #defines.

NOTE: All interface functions that take an address as a parameter block-align the address before issuing requests to the cache.

9

http://en.wikipedia.org/wiki/Alpha_21264

Function Description

void prefetch init(void) Called before any memory access to let the

prefetcher initialize its data structures

void prefetch access(AccessStat stat) Notifies the prefetcher about a cache access

void prefetch complete(Addr addr) Notifies the prefetcher about a prefetch load

that has just completed

Table 3.2: Functions called by the simulator.

Function Description

void issue prefetch(Addr addr) Called by the prefetcher to initiate a prefetch

int get prefetch bit(Addr addr) Is the prefetch bit set for addr?

int set prefetch bit(Addr addr) Set the prefetch bit for addr

int clear prefetch bit(Addr addr) Clear the prefetch bit for addr

int in cache(Addr addr) Is addr currently in the L2 cache?

int in mshr queue(Addr addr) Is there a prefetch request for addr in

the MSHR (miss status holding register) queue?

int current queue size(void) Returns the number of queued prefetch requests

void DPRINTF(trace, format, ...) Macro to print debug information.

trace is a trace flag (HWPrefetch),

and format is a printf format string.

Table 3.3: Functions callable from the user-defined prefetcher.

AccessStat member Description

Addr pc The address of the instruction that caused the access

(Program Counter)

Addr mem addr The memory address that was requested

Tick time The simulator time cycle when the request was sent

int miss Whether this demand access was a cache hit or miss

Table 3.4: AccessStat members.

10

The prefetcher must implement the three functions prefetch init,prefetch access and prefetch complete. The implementation may beempty.

The function prefetch init(void) is called at the start of the simulationto allow the prefetcher to initialize any data structures it will need.

When the L2 cache is accessed by the CPU (through the L1 cache), the func-tion void prefetch access(AccessStat stat) is called with an argument(AccessStat stat) that gives various information about the access.

When the prefetcher decides to issue a prefetch request, it should callissue prefetch(Addr addr), which queues up a prefetch request for theblock containing addr.

When a cache block that was requested by issue prefetch arrives frommemory, prefetch complete is called with the address of the completedrequest as parameter.

Prefetches issued by issue prefetch(Addr addr) go into a prefetch requestqueue. The cache will issue requests from the queue when it is not fetchingdata for the CPU. This queue has a fixed size (available as MAX QUEUE SIZE),and when it gets full, the oldest entry is evicted. If you want to check thecurrent size of this queue, use the function current queue size(void).

3.3 Using the interface

Start by studying interface.hh. This is the only M5-specific header fileyou need to include in your header file. You might want to include standardheader files for things like printing debug information and memory alloca-tion. Have a look at what the supplied example prefetcher (a very simplesequential prefetcher) to see what it does.

If your prefetcher needs to initialize something, prefetch init is the placeto do so. If not, just leave the implementation empty.

You will need to implement the prefetch access function, which the cachecalls when accessed by the CPU. This function takes an argument,AccessStat stat, which supplies information from the cache: the addressof the executing instruction that accessed cache, what memory address wasaccess, the cycle tick number, and whether the access was a cache miss. Theblock size is available as BLOCK SIZE. Note that you probably will not needall of this information for a specific prefetching algorithm.

If your algorithm decides to issue a prefetch request, it must call theissue prefetch function with the address to prefetch from as argument.The cache block containing this address is then added to the prefetch request

11

queue. This queue has a fixed limit of MAX QUEUE SIZE pending prefetch re-quests. Unless your prefetcher is using a high degree of prefetching, thenumber of outstanding prefetches will stay well below this limit.

Every time the cache has loaded a block requested by the prefetcher,prefetch complete is called with the address of the loaded block.

Other functionality available through the interface are the functions for get-ting, setting and clearing the prefetch bit. Each cache block has one suchtag bit. You are free to use this bit as you see fit in your algorithms. Notethat this bit is not automatically set if the block has been prefetched, ithas to be set manually by calling set prefetch bit. set prefetch bit onan address that is not in cache has no effect, and get prefetch bit on anaddress that is not in cache will always return false.

When you are ready to write code for your prefetching algorithm of choice,put it in prefetcher/prefetcher.cc. When you have several prefetchers,you may want to to make prefetcher.cc a symlink.

The prefetcher is statically compiled into M5. After prefetcher.cc hasbeen changed, recompile with ./compile.sh. No options needed.

12

3.3.1 Example prefetcher

/*

* A sample prefetcher which does sequential one-block lookahead.

* This means that the prefetcher fetches the next block _after_ the one that

* was just accessed. It also ignores requests to blocks already in the cache.

*/

#include "interface.hh"

void prefetch_init(void)

{

/* Called before any calls to prefetch_access. */

/* This is the place to initialize data structures. */

DPRINTF(HWPrefetch, "Initialized sequential-on-access prefetcher\n");

}

void prefetch_access(AccessStat stat)

{

/* pf_addr is now an address within the _next_ cache block */

Addr pf_addr = stat.mem_addr + BLOCK_SIZE;

/*

* Issue a prefetch request if a demand miss occured,

* and the block is not already in cache.

*/

if (stat.miss && !in_cache(pf_addr)) {

issue_prefetch(pf_addr);

}

}

void prefetch_complete(Addr addr) {

/*

* Called when a block requested by the prefetcher has been loaded.

*/

}

13

Chapter 4

Statistics

This chapter gives an overview of the statistics by which your prefetcher ismeasured and ranked.

IPC instructions per cycle. Since we are using a superscalar architecture,IPC rates > 1 is possible.

Speedup Speedup is a commonly used proxy for overall performance whenrunning benchmark tests suites.

speedup =execution timeno prefetcher

execution timewith prefetcher=

IPCwith prefetcher

IPCno prefetcher

Good prefetch The prefetched block is referenced by the application be-fore it is replaced.

Bad prefetch The prefetched block is replaced without being referenced.

Accuracy Accuracy measures the number of useful prefetches issued bythe prefetcher.

acc =good prefetches

total prefetches

Coverage How many of the potential candidates for prefetches were actu-ally identified by the prefetcher?

cov =good prefetches

cache misses without prefetching

Identified Number of prefetches generated and queued by the prefetcher.

14

Issued Number of prefetches issued by the cache controller. This canbe significantly less than the number of identified prefetches, due toduplicate prefetches already found in the prefetch queue, duplicateprefetches found in the MSHR queue, and prefetches dropped due toa full prefetch queue.

Misses Total number of L2 cache misses.

Degree of prefetching Number of blocks fetched from memory in a singleprefetch request.

Harmonic mean A kind of average used to aggregate each benchmarkspeedup score into a final average speedup.

Havg =n

1x1

+ 1x2

+ ... + 1xn

=n∑ni=1

1xi

15

Chapter 5

Debugging the prefetcher

5.1 m5.debug and trace flags

When debugging M5 it is best to use binaries built with debugging support(m5.debug), instead of the standard build (m5.opt). So let us start byrecompiling M5 to be better suited to debugging:

scons -j2 ./build/ALPHA SE/m5.debug.

To see in detail what’s going on inside M5, one can specify enable traceflags, which selectively enables output from specific parts of M5. The mostuseful flag when debugging a prefetcher is HWPrefetch. Pass the option--trace-flags=HWPrefetch to M5:

./build/ALPHA SE/m5.debug --trace-flags=HWPrefetch [...]

Warning: this can produce a lot of output! It might be better to redirectstdout to file when running with --trace-flags enabled.

5.2 GDB

The GNU Project Debugger gdb can be used to inspect the state of thesimulator while running, and to investigate the cause of a crash. Pass GDBthe executable you want to debug when starting it.

gdb --args m5/build/ALPHA SE/m5.debug --remote-gdb-port=0

-re --outdir=output/ammp-user m5/configs/example/se.py

--checkpoint-dir=lib/cp --checkpoint-restore=1000000000

--at-instruction --caches --l2cache --standard-switch

--warmup-insts=10000000 --max-inst=10000000 --l2size=1MB

--bench=ammp --prefetcher=on access=true:policy=proxy

You can then use the run command to start the executable.

16

Some useful GDB commands:

run <args> Restart the executable with the given command line arguments.

run Restart the executable with the same arguments as last time.

where Show stack trace.

up Move up stack trace.

down Move down stack frame.

print <expr> Print the value of an expression.

help Get help for commands.

quit Exit GDB.

GDB has many other useful features, for more information you can consultthe GDB User Manual at http://sourceware.org/gdb/current/onlinedocs/gdb/.

5.3 Valgrind

Valgrind is a very useful tool for memory debugging and memory leak detec-tion. If your prefetcher causes M5 to crash or behave strangely, it is usefulto run it under Valgrind and see if it reports any potential problems.

By default, M5 uses a custom memory allocator instead of malloc. This willnot work with Valgrind, since it replaces malloc with its own custom mem-ory allocator. Fortunately, M5 can be recompiled with NO FAST ALLOC=True

to use normal malloc:

scons NO FAST ALLOC=True ./m5/build/ALPHA SE/m5.debug

To avoid spurious warnings by Valgrind, it can be fed a file with warningsuppressions. To run M5 under Valgrind, use

valgrind --suppressions=lib/valgrind.suppressions

./m5/build/ALPHA SE/m5.debug [...]

Note that everything runs much slower under Valgrind.

17

http://sourceware.org/gdb/current/onlinedocs/gdb/

http://sourceware.org/gdb/current/onlinedocs/gdb/

of 5

Norwegian University of Science and Technology (NTNU)

DEPT. OF COMPUTER AND INFORMATION SCIENCE (IDI)

Course responsible: Professor Lasse Natvig

Quality assurance of the exam: PhD Jon Olav Hauglid

Contact person during exam: Magnus Jahre

Deadline for examination results: 23rd

of June 2009.

EXAM IN COURSE TDT4260 COMPUTER ARCHITECTURE

Tuesday 2nd

of June 2009

Time: 0900 - 1300

Supporting materials: No written and handwritten examination support materials are permitted. A

specified, simple calculator is permitted.

By answering in short sentences it is easier to cover all exercises within the duration of the exam. The

numbers in parenthesis indicate the maximum score for each exercise. We recommend that you start

by reading through all the sub questions before answering each exercise.

The exam counts for 80% of the total evaluation in the course. Maximum score is therefore 80 points.

Exercise 1) Instruction level parallelism (Max 10 points)

a) (Max 5 points) What is the difference between (true) data dependencies and name

dependencies? Which of the two presents the most serious problem? Explain why such

dependencies will not always result in a data hazard.

Solution sketch:

True data dependency: One instruction reads what an earlier has written (data flows) (RAW).

Name dependency: Two instructions use the same register or memory location, but there is no

flow of data between them. One instruction writes what an earlier has read (WAR) or written

(WAW). (no data flow).

True data dependency is the most serious problem, as name dependencies can be prevented by

register renaming. Also, many pipelines are designed so that name-dependencies will not cause a

hazard.

A dependency between two instructions will only result in a data hazard if the instructions are

close enough together and the processor executes them out of order.

b) (Max 5 points) Explain why loop unrolling can improve performance. Are there any potential

downsides to using loop unrolling?

Solution sketch:

Loop unrolling can improve performance by reducing the loop overhead (e.g. loop overhead

instructions executed every 4th element, rather than for each). It also makes it possible for

scheduling techniques to further improve instruction order as instructions for different elements

of 5

(iterations) now can be interchanged. Downsides include increased code size which may lead to

more cache misses and increased number of registers used.

Exercise 2) Multithreading (Max 15 points)

a) (Max 5 points) What are the differences between fine-grained and coarse-grained

multithreading?

Solution sketch:

Fine-grained: Switch between threads after each instruction. Coarse-grained: Switch on costly

stalls (cache miss).

b) (Max 5 points) Can techniques for instruction level parallelism (ILP) and thread level parallelism

(TLP) be used simultaneously? Why/why not?

Solution sketch:

ILP and TLP can be used simultaneously. TLP looks at parallelism between different threads,

while ILP looks at parallelism inside a single instruction stream/thread.

c) (Max 5 points) Assume that you are asked to redesign a processor from single threaded to

simultaneous multithreading (SMT). How would that change the requirements for the caches?

(I.e., what would you look at to ensure that the caches would not degrade performance when

moving to SMT)

Solution sketch:

Several threads executing at once will lead to increased cache traffic and more cache conflicts.

Techniques that could help: Increased cache size, more cache ports/banks, higher associativity,

non-blocking caches.

Exercise 3) Multiprocessors (Max 15 points)

a) (Max 5 points) Give a short example illustrating the cache coherence problem for

multiprocessors.

Solution sketch:

See Figure 4.3 on page 206 of the text book. (A reads X, B reads X, A stores X, B now has

inconsistent value for X).

b) (Max 5 points) Why does bus snooping scale badly with number of processors? Discuss how

cache block size could influence the choice between write invalidate and write update.

Solution sketch:

Bus snooping relies on a common bus where information is broadcasted. As number of devices

increase, this common medium becomes a bottleneck.

Invalidates are done at cache block level, while updates are done on individual words. False

sharing coherence misses only appear when using write invalidate with block sizes larger than

of 5

one word. So as cache block size increases, the number of false sharing coherence misses will

increase, thereby making write update increasingly more appealing.

c) (Max 5 points) What makes the architecture of UltraSPARC T1 (“Niagara”) different from most

other processor architectures?

Solution sketch:

High focus on TLP, low focus on ILP. Poor single thread performance, but great multithread

performance. Thread switch on any stall. Short pipeline, in-order, no branch prediction.

Exercise 4) Memory, vector processors and networks (Max 15 points)

a) (Max 5 points) Briefly describe 5 different optimizations of cache performance.

Solution sketch:

(1 point pr. optimization) 6 techniques listed on page 291 in the text book, 11 more in 5.2 on

page 293.

b) (Max 5 points) What makes vector processors fast at executing a vector operation?

Solution sketch:

A Vector operation can be executed with a single instruction, reducing code size and improving

cache utilization. Further, the single instruction has no loop overhead and no control

dependencies which a scalar processor would have. Hazard checks can also be done per vector,

rather than per element. A vector processor also contains a deep pipeline especially designed for

vector operations.

c) (Max 5 points) Discuss how the number of devices to be connected influences the choice of

topology.

Solution sketch:

This is a classic example of performance vs. cost. Different topologies scale differently with

respect to performance or cost as the number of devices grows. Crossbar scales performance

well, but cost badly. Ring or bus scale performance badly, but cost well.

Exercise 5) Multicore architectures and programming (Max 25 points)

a) (Max 6 points) Explain briefly the research method called design space exploration (DSE). When

doing DSE, explain how a cache sensitive application can be made processor bound, and how it

can be made bandwidth bound.

Solution sketch:

(Lecture 10-slide 4) DSE is to try out different points in an n-dimensional space of possible

designs, where n is the number of different main design parameters, such as #cores, core-types

(IO vs. OOO etc.), cache size etc. Cache sensitive applications can become processor bound by

of 5

increasing the cache size, and they can be made bandwidth bound by decreasing it..

b) (Max 5 points) In connection with GPU-programming (shader programming), David Blythe uses

the concept ”computational coherence”. Explain it briefly.

LF: See lecture 10, slide 36 + evt. the paper.

c) (Max 8 points) Give an overview of the architecture of the Cell processor.

Solution sketch:

All details of this figure are not expected, but the main elements.

* One main processor (Power-architecture, called PPE = Power processing element) – this acts as

a host (master) processor. (Power arch., 64 bit, in-order two-issue superscalar, SMT

(Simultaneous multithreading. Has a vector media extension (VMX) (Kahle figure 2))

* 8 identical SIMD processors (called SPE = Synergistic Processing element), each of these

consists of SPU processing element (Synergistic processor unit) and local storage (LS, 256 KB

SRAM --- not cache). On chip memory controller + bus interface. (Can operate on integers in

different formats., 8, 16, 32 and floating point numbers in 32 og 64 bit. (64 bit floats in later

version).

* Interconnect is a ring-bus (Element Interconnect Bus, EIB), connects PPE + 8 SPE. two

unidirectional busses in each direction. Worst case latency is half distance, can support up to

three simultaneous transfers

* Highly programmable DMA controller.

d) (Max 6 points) The Cell design team made several design decisions that were motivated by a wish

to make it easier to develop programs with predictable (more deterministic) processing time

(performance). Describe two of these.

Solution sketch:

1) They discarded the common out-of-order execution in the Power-processor, developed a

simpler in-order processor

of 5

2) The local store memory (LS) in the SPE processing elements do not use HW cache-coherency

snooping protocols to avoid the in-determinate nature of cache misses. The programmer handles

memory in a more explicit way

3) Also the large number of registers (128) might help making the processing more deterministic

wrt. execution time.

4) Extensive timers and counters (probably performance counters) (that may be used by the

SW/programmer to monitor/adjust/control performance)

…---oooOOOooo---…

of 4

Norwegian University of Science and Technology (NTNU) DEPT. OF COMPUTER AND INFORMATION SCIENCE (IDI) Contact person for questions regarding exam exercises: Name: Lasse Natvig Phone: 906 44 580

EXAM IN COURSE TDT4260 COMPUTER ARCHITECTURE Monday 26th of May 2008 Time: 0900 – 1300 Solution sketches in blue text

Supporting materials: No handwritten or printed materials allowed, simple specified calculator is allowed. By answering in short sentences it is easier to cover all exercises within the duration of the exam. The numbers in parenthesis indicate the maximum score for each exercise. We recommend that you start by reading through all the sub questions before answering each exercise. The exam counts for 80% of the total evaluation in the course. Maximum score is therefore 80 points. Exercise 1) Parallel Architecture (Max 25 points) a) (Max 5 points) The feature size of integrated circuits is now often 65 nanometres or smaller, and it is still decreasing. Explain briefly how the number of transistors on a chip and the wire delay changes with shrinking feature size. The number of transistors can be 4 times larger when the feature size is halved. However the wire delay does not improve (scales poorly). (The textbook page 17 gives more details, but we here ask for the main trends) b) (Max 5 points) In a cache coherent multiprocessor, the concepts migration and replication of shared data items are central. Explain both concepts briefly and also how they influence on latency to access of shared data and the bandwidth demand on the shared memory. Migration means that data move to a place closer to requesting/accessing unit. Replication just means storing several copies. Having a local copy in general means faster access, and it is harmelss to have several copies of read-only data. (Textbook page 207) c) (Max 5 points) Explain briefly how a write buffer can be used in cache systems to increase performance. Explain also what “write merging” is in this context. The main purpose of the write buffer is to temporarily store data that are evicted from the cache so new data can reuse the cache space as fast as possible, i.e. to avoid waiting for the latency of the memory one level further away from the processor. If more writes are to the same cache block (adress) these writes can be combined, resulting in a reduced traffic towards the next memory level. (Textbook page 300) ((Also slides 11-6-3)). // Retting: 3 poeng for skrive-buffer-forståelse og 2 for skrive-fletting. d) (Max 5 points) Sketch a figure that shows how a hypercube with 16 nodes are built by combining two smaller hypercubes. Compare the hypercube-topology with the 2-dimensional mesh topology with respect to connectivity and node cost (number of links/ports per node). (Figure E-14 c) A mesh has a fixed degree of connectivity and becomes slower in general when the number of nodes is increased, since the number of hops needed for reaching another node on average is increasing. For a hypercube it is the other way around, the connectivity increase for larger networks, so the communication time does not increase much, but the node cost does also increase. When going to a larger network, increasing the

of 4

dimension, every node must be extended with a new port, and this is a drawback when it comes to building computers using such networks. e) (Max 5 points) When messages are sent between nodes in a multiprocessor two possible strategies are source routing and distributed routing. Explain the difference between these two. For source routing, the entire routing path is precomputed by the source (possibly by table lookup—and placed in the packet header). This usually consists of the output port or ports supplied for each switch along the predetermined path from the source to the destination, (which can be stripped off by the routing control mechanism at each switch. An additional bit field can be included in the header to signify whether adaptive routing is allowed (i.e., that any one of the supplied output ports can be used). For distributed routing, the routing information usually consists of the destination address. This is used by the routing control mechanism in each switch along the path to determine the next output port, either by computing it using a finite-state machine or by looking it up in a local routing table (i.e., forwarding table). (Textbook page E-48) Exercise 2) Parallel processing (Max 15 points) a) (Max 5 points) Explain briefly the main difference between a VLIW processor and a dynamically scheduled superscalar processor. Include the role of the compiler in your explanation. Parallel execution of several operations is scheduled (analysed and planned) at compile time and assembled into very long/broad instructions for VLIW. (Such work done at compile time is often called static). In a dynamically scheduled superscalar processor dependency and resource analysis are done at run time (dynamically) to find opportunities to do operations in parallell. (Textbook page 114 -> and VLIW paper) b) (Max 5 points) What function has the vector mask register in a vector processor? If you want to update just some subset of the elements in a vector register, i.e. to implement IF A[i] != 0 THEN A[i] = A[i] – B[i] for (i=0..n) in a simple way, this can be done by setting the vector mask register to 1 only for the elements with A[i] != 0. In this way, the vectorinstruction A = A - B can be performed without testing every element explicitly. c) (Max 5 points) Explain briefly the principle of vector chaining in vector processors. The execution of instructions using several/different functional and memory pipelines can be chained together directly or by using vector registers. The chaining forms one longer pipeline. (This is the technique of forwarding (used in processor, as in Tomasulos algorithm) extended to vector registers (Textbook F-23) ((Slides forel-9, slide 20)) – bør sjekkes Exercise 3) Multicore processors (Max 20 points) a) (Max 5 points) In the paper Chip Multithreading: Opportunities and Challenges, by Spracklen & Abraham is the concept Chip Multithreaded processor (CMT) described. The authors describe three generations of CMT processors. Describe each of these briefly. Make simple drawings if you like. a) 1. generation: typically 2 cores pr. chip, every core is a traditional processor-core, no shared resources except the off-chip bandwidth. 2.generation: Shared L2 cache, but still traditional processor cores. 3. generation: as 2. gen., but the cores are now custom-made for being used in a CMP, and might also use simultaneous multithreading (SMT). (This description is a bit ”biased” and colored by the backgorund of the authors (in Sun Microsystems) that was involved in the design of Niagara 1 og 2 (T1)) // fig. 1 i artikkel, og slides // Var deloppgave mai 2007, b) (Max 5 points) Outline the main architecture in SUN’s T1 (Niagara) multicore processor. Describe the placement of L1 and L2 cache, as well as how the L1 caches are kept coherent. Fig 4.24 at page 250 in the textbook, that shows 8 cores, each with its own L1-cache (described in the text), 4 x L2 cache banks, each having a channel to external memory, 1x FPU unit, crossbar as interconnection. Coherence

of 4

is maintained by a catalog associated with each L2 cache. This knows which L1-caches that havbe a copy of data in the L2 cache. // Læreboka side 249-250, også forelsning c) (Max 6 points) In the paper Exploring the Design Space of Future CMP’s the authors perform a design space exploration where several main architectural parameters are varied assuming a fixed total chip area of 400mm2. Outline the approach by explaining the following figure;

Technology independent area models – found empirically, – core area and cache area measured in cache byte equivalents (CBE). Study the relative costs in area versus the associated performance gains --- maximize performance per unit area for future technology generations. With smaller feature sizes, the available area for cache banks and processing cores increases. Table 3 displays die area in terms of the cache-byte-equivalents (CBE), and PIN and POUT columns show how many of each type of processor with 32KB separate L1 instruction and data caches could be implemented on the chip if no L2 cache area were required. (PIN is a simple in-order-execution processor, POUT is a larger out-of-order exec processor). And, for reference, Lambda-squared where lambda is equal to one half of the feature size. The primary goal of this paper is to determine the best balance between per-processor cache area, area consumed by different processor organizations, and the number of cores on a single die. LF; Ny oppgave / Middels/vanskelig / foil 1-6, og 2-3 d) (Max 4 points) Explain the argument of the authors of the paper Exploring the Design Space of Future CMP’s that we in the future may have chips with useless area on the chip that performs no other function than as a placeholder for pin area? As applications become bandwidth bound, and global wire delays increase, an interesting scenario may arise. It is likely that monolithic caches cannot be grown past a certain point in 50 or 35nm technologies, since the wire delays will make them too slow. It is also likely that, given a ceiling on cache size, off-chip bandwidth will limit the number of cores. Thus, there may be useless area on the chip which cannot be used for cache or processing logic, and which performs no function other than as a placeholder for pin area. That area may be useful to use for compression engines, or intelligent controllers to manage the caches and memory channels. (Fra forel 8, slide 6 på side 4) Exercise 4) Research prototypes (Max 20 points) a) (Max 5 points) Sketch a figure of the main system structure of the Manchester Dataflow Machine (MDM). Include the following units: Matching unit, Token Queue, IO switch, Instruction store, Overflow unit and Processing unit. Show also how these are connected. See figure 5 in the paper, and slides. The Overflow unit is coupled to the matching unit, in parallel..

of 4

b) (Max 5 points) What was the function of the overflow unit in MDM and explain very briefly how it was implemented. If an operand does not find its corresponding operand in the Matching Unit (MU), and it is not space in MU to store it (for waiting on the other operand), the operand is stored in the overflow store. This is a separate and much slower subsystem with much larger storage capcity. It is composed of a separate overflow-bus, memory and a microcoded processors, in other words a SW-solution. See also figure 7 in the paper. c) (Max 5 points) In the paper The Stanford FLASH Multiprocessor by Kuskin et.al., the FLASH computer is described. FLASH is an abbreviation for FLexible Architecture for SHared memory. What kind of flexibility was the main goal for the project? Programming paradigm, flexibility in the choice between distributed shared memory (DSM) i.e. cache coherent shared memory and message passing, but also other alternative ways of communication between the nodes could be explored. d) (Max 5 points) Outline the main architecture of a node in a FLASH system. What was the most central design choice to achieve this flexibility? Fig. 2.1 explain much of this

Interconnection of PE’s in a mesh. The most central design choice was the MAGIC unit, a specially designed node controller. All memory accesses goes through this, and it can as an example realise a cache-coherence protocol. Every Node is identical. The whole computer has one single adress space, but the memory is physically distributed.

---oooOOOooo---

Output

Matching Unit

Instruction Store

Processing

Unit

P0...P19

Switch

Token Queue

Input

tdt4260

Education

Transcript of tdt4260