III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0) Dezső Sima, 2007.

90
III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0) Dezső Sima, 2007

Transcript of III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0) Dezső Sima, 2007.

Page 1: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

III. Multicore Processors (1)

Dezső Sima

Spring 2007

(Ver. 2.0) Dezső Sima, 2007

Page 2: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Foreword

In a relative short time interval of about two years multicore processors became the standard. As multiple cores provide a substantially higher processing performance than their predecessors, their memory and I/O subsystems need also be largely enhanced to avoid processing bottlenecks. Consequently, while discussing multicore processors the focus of interest shifts from their microarchitecture to their macroarchitecture incorporating their cache hierarchy, the way how memory and I/O controllers are attached and how on-chip interconnects needed are implemented.

Accordingly, the first nine Sections are devoted to the design space of the macroachitecture of multicore processors. Particular Sections deal with the main dimensions that span the design space and identify how occuring concepts became implemented in major multicore lines.

Section 10 is a huge repository, providing relevant materials and facts published in connection with major multicore lines. In a concrete course appropriate parts of the repository can be selected according to the scope and attendance of the course. Each part of the repository concludes with a list of available literature to allow a more detailed study of a multicore line or processor of interest.

Page 3: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

2 Overview of MCPs•

5 Layout of L2 caches•

9 Basic alternatives of the macro-architecture of MCPs•

7 Layout of the memory and I/O architecture•

Overview

6 Layout of L3 caches•

1 Introduction•

3 The design space of the macroarchitecture of MCPs•

8 Implementation of the on-chip interconnections•

4 Layout of the cores•

10 Multicore repository•

Page 4: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

1. Introduction

Page 5: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Figure 1.1: Processor power density trends

Source: D. Yen: Chip Multithreading Processors Enable Reliable High Throughput Computing http://www.irps.org/05-43rd/IRPS_Keynote_Yen.pdf

1. Introduction (1)

Page 6: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Figure 1.2: Single-stream performance vs. cost

Source: Marr T.T. et al. „Hyper-Threading Technology Architecture and MicroarchitectureIntel Technology Journal, Vol. 06, Issue 01, Febr 14, 2002, pp. 4-16

1. Introduction (2)

Page 7: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

1. Introduction (3)

Figure 1.3: Historical evolution of transistor counts, clock speed, power and ILP

Page 8: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

2. Overview of MCPs

Page 9: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Figure 2.1: Intel’s dual-core mobile processor lines

8CST

QCMT

QCST

DCMT

DCST

2005

1Q 2Q2006

1Q 2Q 3Q 4Q3Q 4Q

Core Duo

1/06

(Yonah Duo)

151 mtrs./31 W65 nm/90 mm2

Core 2 Duo

7/06

281 mtrs./34 W

(Merom)T5000/T7000

65 nm/143 mm2

T2000

2. Overview of MCPs (1)

Page 10: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Figure 2.2: Intel’s multi-core processor lines

8CST

QCMT

QCST

DCMT

DCST

2005 20061Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q

230 mtrs./95 W

Pentium D 800(Smithfield)

5/05

90 nm/2*103 mm2

Core 2 Duo E6000/X6800(Conroe)

291 mtrs./65/75 W

7/06

65 nm/143 mm2

Pentium D 900(Presler)

2*188 mtrs./130 W65 nm/2*81 mm2

1/06

Core 2 Extrem QX6700(Kentsfield)

2*291 mtrs./130 W

11/06

65 nm/2*143 mm2

Pentium EE 955/965(Presler)

2*188 mtrs./130 W

1/06

65 nm/2*81 mm2

2-way MT/core

Pentium EE 840(Smithfield)

230 mtrs./130 W

5/05

90 nm/2*103 mm2

2-way MT/core

2. Overview of MCPs (2)

Page 11: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Figure 2.3: Intel’s dual-core Xeon UP-line

8CST

QCMT

QCST

DCMT

DCST

20061Q 2Q

20071Q 2Q 3Q 4Q3Q 4Q

Xeon 3000

9/06

291 mtrs./65 W 65 nm/143 mm2

(Conroe)

2. Overview of MCPs (3)

Page 12: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Figure 2.4.: Intel’s dual-core Xeon DP-lines

8CST

QCMT

QCST

DCMT

DCST

20051Q 2Q

20061Q 2Q 3Q 4Q3Q 4Q

Xeon DP 2.8

10/05

2*169 mtrs./135 W 90 nm/2*135 mm2

(Paxville DP)

Xeon 5300

11/06

2*291 mtrs./80/120 W 65 nm/2*143 mm2

(Clovertown)

Xeon 5100

6/06

291 mtrs./65/80 W 65 nm/143 mm2

(Woodcrest)

2-way MT/core2*188 mtrs./95/130 W

Xeon 5000

6/06

65 nm/2*81 mm2

(Dempsey)

2-way MT/core

2. Overview of MCPs (4)

Page 13: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Figure 2.5.: Intel’s dual-core Xeon MP-lines

8CST

QCMT

QCST

DCMT

DCST

20051Q 2Q

20061Q 2Q 3Q 4Q3Q 4Q

Xeon 7000

11/05

2*169 mtrs./95/150 W 90 nm/2*135 mm2

(Paxville MP)

2-way MT/core1328 mtrs./95/150 W

Xeon 7100

8/06

65 nm/435 mm2

(Tulsa)

2-way MT/core

2. Overview of MCPs (5)

Page 14: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Figure 2.6.: Intel’s dual-core EPIC-based server line

8CST

QCMT

QCST

DCMT

DCST

20051Q 2Q

20061Q 2Q 3Q 4Q3Q 4Q

2-way MT/core

9x00(Montecito)

1720 mtrs./104 W

7/06

90 nm/596 mm2

2. Overview of MCPs (6)

Page 15: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Figure 2.7.: AMD’s multi-core desktop processor lines

8CST

Athlon64 QC(Barcelona)

2007

QCMT

QCST

DCMT

DCST

2005 20061Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q

3800/4200/4600(Manchester S. E2)

154 mtrs./89/110 W

5/05

90 nm/2*147 mm2

3800-4800(Toledo S. E6)

233 mtrs./89/110 W

5/05

90 nm/199 mm2

4000-5000(Brisbane S. G1)

154 mtrs./65 W

12/06

65 nm/126 mm2

3600-5600(Windsor S. F2/F3)

154/227 mtrs./65/89 W

5/06

90 nm/183/230 mm2

20071Q 2Q 3Q 4Q

65 nm/291 mm2

2. Overview of MCPs (7)

Page 16: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Figure 2.8.: AMD’s dual-core high-end desktop/entry level server Athlon 64 FX line

8CST

QCMT

QCST

DCMT

DCST

2005 20061Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q

Athlon 64 FX(Toledo/Windsor)

233/227 mtrs./110/125 W

1/06

90 nm/199/230 mm2

2. Overview of MCPs (8)

Page 17: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Figure 2.9.: AMD’s dual-core Opteron UP-lines

8CST

QCMT

QCST

DCMT

DCST

2005 20061Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q

233 mtrs./110 W

Opteron 100

8/05

90 nm/199 mm2

(Denmark)

Opteron 1000

8/06

227 mtrs./103/125 W 90 nm/230 mm2

(Santa Ana)

2. Overview of MCPs (9)

Page 18: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Figure 2.10.: AMD’s dual-core Opteron DP-lines

8CST

QCMT

QCST

DCMT

DCST

2005 20061Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q

227 mtrs./95/110 W

Opteron 2000

8/06

90 nm/230 mm2

(Santa Rosa)Opteron 200

3/05

233 mtrs./95 W 90 nm/199 mm2

(Italy)

2. Overview of MCPs (10)

Page 19: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Figure 2.11.: AMD’s dual-core Opteron MP-lines

8CST

QCMT

QCST

DCMT

DCST

2005 20061Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q

243 mtrs./95/119 W

Opteron 8000

8/06

90 nm/220 mm2

(Santa Rosa)Opteron 800

4/05

233 mtrs./95 W 90 nm/199 mm2

(Egypt)

2. Overview of MCPs (11)

Page 20: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Figure 2.12.: IBM’s multi-core server lines

POWER4 POWER4+

POWER5 POWER5+

Cell BE

POWER6

8CST

174 mtrs./115/125 W

10/01 11/02

QCMT

QCST

DCMT

DCST

2001 2002

3Q 4Q 3Q 4Q

~~ ~~

5/04

2004

1Q 2Q

~~

10/05

276 mtrs./70 W

3Q 4Q

2006

234 mtrs./95 W

20061Q 2Q 3Q 4Q

2007

750 mtrs./~100W

20071Q 2Q

180 nm/412 mm2

90 nm/230 mm2

90 nm/221 mm2

65 nm/341 mm2

276 mtrs./80W (est.)

130 nm/389 mm2

184 mtrs./70 W130 nm/380 mm2

2-way MT/core 2-way MT/core

(PPE:2-way MT)

2-way MT/core

(SSEs: no MT)

2. Overview of MCPs (12)

Page 21: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Figure 2.13.: Sun’s and Fujitsu’s multi-core server lines

8CST

QCMT

QCST

DCMT

DCST

20041Q 2Q

20051Q 2Q 3Q 4Q

UltraSPARC IV+

9/05

295 mtrs./90 W

(Panther)

20071Q 2Q 3Q 4Q

20081Q 2Q

66 mtrs./108 W

UltraSPARC IV

7/04

(Jaguar)

~~3Q

8CMTUltraSPARC T1

11/2005

279 mtrs./63 W

(Niagara)UltraSPARC T2

2007

(Niagara II)

APL SPARC64 VI

2007

540 mtrs./120 W

(Olympus)

APL SPARC64 VII

2008

(Jupiter)

130 nm/352 mm2 90 nm/335 mm2

90 nm/421 mm2

65 nm/464 mm2

65 nm/342 mm290 nm/379 mm2

80 mtrs./32 W

Gemini

4/04

130 nm/206 mm2

3Q

4-way MT/core 8-way MT/core

2-way MT/core

4Q

~120 W2-way MT/core

72 W (est.)

2. Overview of MCPs (13)

Page 22: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Figure 2.14.: HP’s PA-8x00 server line

8CST

QCMT

QCST

DCMT

DCST

20041Q 2Q

20051Q 2Q 3Q 4Q

PA 8800

2/2004

300 mtrs./55 W

(Mako)PA 8900

5/05

317 mtrs.

(Shortfin)

3Q 4Q

130 nm/366 mm2 130 nm/366 mm2

2. Overview of MCPs (14)

Page 23: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Figure 2.15.: RMI’s multi-core XLR-line (scalar RISC)

8CST

QCMT

QCST

DCMT

DCST

20051Q 2Q 3Q 4Q

20061Q 2Q 3Q 4Q

8CMT

XLR 5xx

5/05

333 mtrs./10-50 W90 nm/~220 mm2

4-way MT/core

2. Overview of MCPs (15)

Page 24: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

3. Design space of the macro architecture of MCPs

Page 25: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Building blocks of multicore processors (MC)

Cores•

L2 cache(s) (L2)•

FSB controller (FSB c.)•

Bus controller (B. c.) •

L3 cache(s), if available (L3)•

Memory controller (M. c.)•

Interconnection network (IN)•

L2 I L2 D

L3

L2 I L2 D

L3

IN

FSB

FSB c.

C C

or

3. Design space of the macro architecture of MCPs (1)

Page 26: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Layout of L2 caches

Layout of the cores

Layout of the I/O andmemory architecture

Macro architecture of multi-core (MC) processors

Layout of L3 caches (if available)

3. Design space of the macro architecture of MCPs (2)

Layout of the interconnections

Page 27: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

4. Layout of the cores

Page 28: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Layout of L2 caches

Layout of the cores

Layout of the I/O andmemory architecture

Layout of L3 caches (if available)

5. Layout of the cores (1)

Layout of the on-chip interconnections

Macro architecture of multi-core (MC) processors

Page 29: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Physical implementation

Basic layout Advanced features

Layout of the cores

4. Layout of the cores (2)

Page 30: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

4. Layout of the cores (3)

HMC(Heterogeneous MC)

Basic layout

SMC(Symmetrical MC)

Cell BE (1 PPE+8 SPE)All other MCs

Page 31: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

4. Layout of the cores (4)

Virus protection

Advanced features of the cores

Remote management

Power/clock speed management

Data protection

Support ofvirtualisation

Details on the next slide!

Page 32: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

4. Layout of the cores (5)

12

Pentium M-based line: Core Duo T2000Prescott based desktop lines:Pentium D 8xx, EE 840

3

Xeon lines: Paxwille DP, MP

Cedar Mill based desktop lines: Pentium D 9xx, EE 955,965

4 Core basd mobile line: T5xxx/T7xxx (Merom)desktop line: Core 2 Duo E6xx, X6xxx, QX67xx

Xeon lines: 5000 (Dempsey), 7100 (Tulsa)

Xeon lines: 3000, 5100 (Woodcrest), 5300 (Clovertown)

Intel

Netburst

Prescott-based2

Cedar Mill-based3

Core-based4

Itanium 2

AMD

Athlon 64 XR

Athlon 64 FX

Opteron

x00 lines

x000 lines

Pentium M-based1

ED-bit

P Vanderpool (partly)

La Grande AMT2

Nx-bit Cool 'n' Quiet Pacifica (partly)

Presidio

ED-bit EIST 5/02

ED-bit

ED-bit

EIST (partly)

EIST (partly)

EIST (partly) Vanderpool (partly)

DBS Silvervale

Nx-bit

Nx-bit

Nx-bit

Cool 'n' Quiet

PowerNow!

PowerNow!

Pacifica

AMD - V

AMD - V

Power/clock speed management

Support of virtualisation

Data protection

Remote management

Virus protection

Table 4.1 : Comparison of advanced features provided by Intel’s and AMD’s MC lines. For a brief introduction of the related Intel techniques see the appropriate Intel processor descriptions

Page 33: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

EIST: Enhanced Intel SpeedStep Technology

First implemented in Intel’s mobile and server platforms,

It allows the system to dynamically adjust processor voltage and core frequency,to decrease average power consumption and thus average heat production.

This technology is similar to AMD’s Cool’n’Quiet and PowerNow! techniques. For more information see http://en.wikipedia.org/wiki/SpeedStep

The ED bit with OS support is now a widely used technique to prevent certain classes of malicious buffer overflow attacks. It is used e.g. in Intel’s Itanium, Pentium M, Pentium 4 Prescottand subsequent processors as well as in AMD’s x86-64 line (designated as the NX bit (No Execute bit).

In buffer overflow attacks a malicious worm inserts its code into another program’s data spaceand creates a flood of code that overwhelms the processor,allowing the worm to propagate itself to the network, and to other computers.

The Execute Disable bit allows the processor to classify areas in memory by where application code can be executed and where it cannot. Actually, the ED bit is implementedas the 63. bit of the Page Table Entry. If its value is 1, code cannot be executed from that page.

When a malicious worm attempts to insert code in the buffer,the processor disables code execution, preventing damage and worm propagation. To became active the ED feature need to be supported by the OS.For more information see: http://en.wikipedia.org/wiki/XD_Bit

ED: Execute Disable bit (often designated as the NX bit (No Execute bit))

Brief description of the advanced features pointed out while using Intel’s notations (1)

4. Layout of the cores (6)

Page 34: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Virtualisation techniques allow a platform to run multiple operating systems and applications in independent partitions. The hardver support improves the performance and robustness of traditional software-based virtualization solutions. Intel introduced hardware virtualisation support first in its Itanium line (called the Silvervale tecnique)followed by its Presler based and subsequent processors (designated as the Vanderpool technique).

AMD uses this technique in its recent lines - dubbed as Pacifica - since Mai 2006. For more informationsee http://en.wikipedia.org/wiki/Virtualization_Technology

VT: Virtualization Technology

It is a Trusted Execution Technology to protect users from software-based attacks aimed at stealing vitaldata, initiated by Intel. It includes a set of hardware enhancements intended to provide multiple separatedexecution environments.

For more information see Intel® Trusted Execution TechnologyPreliminary Architecture Specification at http://www.intel.com/technology/security/ orhttp://en.wikipedia.org/wiki/LaGrande_Technology

Intel’s Active Management Technology is a feature of its vPro technology. It allows to managethe computer system remotely, even if the computer is switched off or the hard drive has failed. By means of this technology system administrators can e.g. remotely download software updates or fix and hail system problems. It can also be utilized to protect the network by proactively blocking incoming threats.

For more information see Intel Active Management Technology at http://www3.intel.com/cd/network/communications/emea/eng/203372.htm

La Grande Technology

AMT2

4. Layout of the cores (7)

Brief description of the advanced features pointed out while using Intel’s notations (2)

Page 35: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Multi-chip implementation

Physical implementation

Monolithic implementation

Paxville MP

Pentium D 9xx

Paxville DP

Kentsfield

Yonah Duo

Pentium D 8xx

Tulsa

Montecito

AMD, IBM, Sun

Dempsey

Clovertown

Fujitsu, HP andRMI

Woodcrest

Merom Conroe

Pentium EE 840

All DCs of

Pentium EE 955/965

4. Layout of the cores (8)

Page 36: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

5. Layout of L2 caches

Page 37: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Layout of L2 caches

Layout of the cores

Layout of the I/O andmemory architecture

Layout of L3 caches (if available)

5. Layout of L2 caches (1)

Layout of the on-chip interconnections

Macro architecture of multi-core (MC) processors

Page 38: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Inclusion policy

Allocation to the cores

Banking policy

Layout of L2 caches

Use by instructions/data

Integration of L2 caches to the proc. chip

5. Layout of L2 caches (2)

Page 39: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Shared L2 cache for all cores

Allocation of L2 caches to the cores

Private L2 cache for each core

Athlon 64 X2 (2005)

IN (Xbar)

Memory

System Request Queue

B. c. M. c.

HT-bus

L2 L2

Core Core

Core Duo (2006)

Core 2 Duo (2006)

L2 contr.

Core

L2

Core

FSB c.

FSB

Examples:

5. Layout of L2 caches (3)

Page 40: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Shared L2 cache for all cores

Allocation of L2 caches to the cores

Private L2 cache for each core

POWER4 (2001)

Montecito (2006)

UltraSPARC IV (2004)

Athlon 64 X2 and AMD's Opteron lines

(2005/2006)

POWER5 (2005)

Core Duo (Yonah),Core 2 Duo (Core)based lines (2006)

UltraSPARC T1 (2005)UltraSPARC T2 (2005)

POWER6 (2007)

PA 8800 (2004)PA 8900 (2005)

Smithfield, Presler,Irwindale and Cedar Millbased lines (2005/2006)

SPARC64 VI (2007)SPARC64 VII (2008)

Cell BE (2006)

XLR (2005)

UltraSPARC IV+ (2005)

5. Layout of L2 caches (4)

Page 41: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Inclusion policy

Allocation to the cores

Banking policy

Attaching L2 caches to MCPs

Use by instructions/data

Integration of L2 caches to the proc. chip

5. Layout of L2 caches (5)

Page 42: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Lines replaced (victimized) in the L1 are written to L2.

References to data missing in L1 but available in L2 initiate reloading the pertaining cache line to L1.Reloaded data is deleted from L2

L2 operates usually as a write back cache (only modified data that is replaced in the L2 is written back to the memory),

Unmodified data that is replaced in L2 is deleted.

L1

M.c.

Replaced (victimised)lines

Modified data

(low traffic)

Lines missing in L1 are reloaded and deleted from L2

L2

Memory

Data missing in L1/L2

(high traffic)

L1

L2

Memory

Exclusive L2

Inclusion policy of L2 caches

Inclusive L2

(Victim cache)

5. Layout of L2 caches (6)

Page 43: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Figure 5.1: Implementation of exclusive L2 caches

Source: Zheng, Y., Davis, B.T., Jordan, M.: “ Performance evaluation of exclusive cache hierarchies”, 2004 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS),

2004, pp. 89-96.

5. Layout of L2 caches (7)

Page 44: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Exclusive L2

Inclusion policy of L2 caches

Inclusive L2

Most implementations Athlon 64X2 (2005)Dual-core Opteron lines (2005)

Expected trend

5. Layout of L2 caches (8)

Page 45: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Inclusion policy

Allocation to the cores

Banking policy

Attaching L2 caches to MCPs

Use by instructions/data

Integration of L2 caches to the proc. chip

5. Layout of L2 caches (9)

Page 46: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Unified instr./data cache(s)

Use by instructions/data

Split instr./data cache(s)

Expected trend

Typical implementation Montecito (2006)

5. Layout of L2 caches (10)

Page 47: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Inclusion policy

Allocation to the cores

Banking policy

Attaching L2 caches to MCPs

Use by instructions/data

Integration of L2 caches to the proc. chip

5. Layout of L2 caches (11)

Page 48: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Single-banked implementation

Banking policy

Multi-banked implementation

Typical implementation

5. Layout of L2 caches (12)

Page 49: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

UltraSPARC T1 (2005) (Niagara)(8 cores/4xL2 banks)

Core

X-bar

Core

L2 L2

Memory

M. c.

Memory

M. c.

The 128-byte long L2 cache lines are hashed acrossthe 3 modules. Hashing is performed by modulo 3arithmetric applied on a large number of real address bits.

The four L2 modules are interleaved at 64-byte blocks.

Mapping of addresses to the banks:

Addr.067

0 21

Modulo 3∑0

256

64

128

196

Mapping of addresses to the banks:

POWER4 (2001)POWER5 (2005)

Core

X-bar

Core

L2 L2L2

Fabric Bu SContr.IN (Fabric Bus Contr.)

L3 tags/contr.B. c.

GX bus

Figure 5.2: Alternatives of mapping addresses to memory modules

Mapping of addresses to memory modules

5. Layout of L2 caches (13)

Page 50: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Inclusion policy

Allocation to the cores

Banking policy

Attaching L2 caches to MCPs

Use by instructions/data

Integration of L2 caches to the proc. chip

5. Layout of L2 caches (14)

Page 51: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Partial integration of L2

Integration to the processor chip

Full integration of L2

All other lines considered

UltraSPARC IV (2004) UltraSPARC V (2005)

Expected trend

PA 8800 (2004)PA 8900 (2005)

On-chip L2 tags/control,off-chip data

Entire L2 on-chip

5. Layout of L2 caches (15)

Page 52: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

6. Layout of L3 caches

Page 53: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Layout of L2 caches

Layout of the cores

Layout of the I/O andmemory architecture

Layout of L3 caches (if available)

6. Layout of L3 caches (1)

Layout of the interconnections

Macro architecture of multi-core (MC) processors

Page 54: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Inclusion policy

Allocation to the L2 cache(s)

Banking policy

Layout of L3 caches

Use by instructions/data

Integration of L3 caches to the proc. chip

6. Layout of L3 caches (2)

Page 55: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Shared L3 cache for all L2s

Allocation of L3 caches to the L2 caches

Private L3 cache for each L2

Montecito (2006)

UltraSPARC IV+ (2004)

POWER5 (2005) POWER4 (2001)

Athlon 64 X2–Barcelona (2007)

6. Layout of L3 caches (3)

Page 56: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Inclusion policy

Allocation to the L2 cache(s)

Banking policy

Layout of L3 caches

Use by instructions/data

Integration of L3 caches to the proc. chip

6. Layout of L3 caches (4)

Page 57: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Lines replaced (victimized) in the L2 are written to L3.

References to data missing in L2 but available in L3 initiate reloading the pertaining cache line to L2.Reloaded data is deleted from L3

L3 operates usually as a write back cache (only modified data that is replaced in the L3 is written back to the memory),

Unmodified data that is replaced in L3 is deleted.

L2

M.c.

Replaced (victimised)lines

Modified data

(low traffic)

Lines missing in L2 are reloaded

and deleted from L3

L3

Memory

Data missing in L2/L3

(high traffic)

L2

L3

Memory

Exclusive L3

Inclusion policy of L3 caches

Inclusive L3

(Victimcache)

6. Layout of L3 caches (5)

Page 58: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Exclusive L3

Inclusion policy of L3 caches

Inclusive L3

Montecito (2006)

UltraSPARC IV+ (2005)

POWER4 (2001) POWER5 (2005)

Athlon 64 X2–Barcelona (2007)

Expected trend

6. Layout of L3 caches (6)

Page 59: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Inclusion policy

Allocation to the L2 cache(s)

Banking policy

Layout of L3 caches

Use by instructions/data

Integration of L3 caches to the proc. chip

6. Layout of L3 caches (7)

Page 60: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Unified instr./data cache(s)

Use by instructions/data

Split instr./data caches

All multicore processorsunveiled until now hold

both instructions and data

6. Layout of L3 caches (8)

Page 61: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Inclusion policy

Allocation to the L2 cache(s)

Banking policy

Layout of L3 caches

Use by instructions/data

Integration of L3 caches to the proc. chip

6. Layout of L3 caches (9)

Page 62: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Single-banked implementation

Banking policy

Multi-banked implementation

All multicore processorsunveiled until now are multi-banked

6. Layout of L3 caches (10)

Page 63: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Inclusion policy

Allocation to the L2 cache(s)

Banking policy

Layout of L3 caches

Use by instructions/data

Integration of L3 caches to the proc. chip

6. Layout of L3 caches (11)

Page 64: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

L3 partially integrated

Integration to the processor chip

L3 fully integrated

POWER4 (2001)

UltraSPARC IV+ (2005)

POWER5 (2005)

Montecito (2006)

Expected trend

POWER6 (2007)

On-chip L3 tags/control,off-chip data

Entire L3 on-chip

6. Layout of L3 caches (12)

Page 65: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Examples of partially integrated L3 caches:

Core

L3 tags/contr.

L3 data

Interconn. network

M. c..

Memory

B. c.

Fire Planebus

Core

L2

Fabric Bus Contr.

L2

L2

L2

Memory

M. c.

L3 tags/contr.

L3 tags/contr.

L3 tags/contr.

L3 data

L3 data

L3 data

POWER5 (2005): UltraSPARC IV+ (2005):

6. Layout of L3 caches (13)

Page 66: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

6. Layout of L3 caches (14)

Figure 6.1: Cache architecture of MC processors

(Not differentiating between inclusive/exclusive alternatives in the figure, but denotingit in the examples)

L2 private

L2 partially integrated,no L3

Cache architecture

L2 shared

L2 private

L2 shared

L2 fully integrated,no L3

L2 private

L2 shared

L2 fully integrated, L3 partially integrated

L2 private

L2 shared

L2 fully integrated,L3 fully integrated

L3 private

L3 shared

L3private

L3shared

L3 private

L3shared

L3 private

L3shared

PA-8800

UltraSPARC IV

Montecito

POWER6

Core Duo(Yonah)

Core 2 Duo based proc-s

Cell BEUltraSPARC T1/T2

SPARC 64 VI/VIIXLR

POWER4POWER5(excl.)

UltraSPARC IV+(excl.)

PA-8900

Smithfield/Presler based proc.s

Athlon 64X2 (excl.)Opteron dual core

(excl.)

Gemini

Page 67: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

6. Layout of L3 caches (15)

Examples for cache architectures (1)

L2 partially integrated, privateno L3

UltraSPARC IV (2004)

Core

Interconn. network

M. c.

Memory

B. c.

Fire Planebus

Core

L2 data L2 data

L2 tags/contr. L2 tags/contr.

L2 partially integrated, sharedno L3

PA-8800 (2004) PA-8900 (2005)

L2 data

CoreL2 tags/contr.Core

FSB c.

FSB

Page 68: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

UltraSPARC T1 (2005)

L2L2 M. c.

B. c.

L2

L2

L2Core 7

M. c.

M. c.

M. c.

Core 0

X

b

a

r

Memory

Memory

Memory

Memory

JBus

L2 fully integrated, privateno L3

L2 fully integrated, sharedno L3

Athlon 64 X2 (2005)

IN (Xbar)

Memory

System Request Queue

B. c. M. c.

HT-bus

L2 L2

Core Core

Core Duo (2006)

Core 2 Duo (2006)

L2 contr.

Core

L2

Core

FSB c.

FSB

Examples for cache architectures (2)

6. Layout of L3 caches (16)

Page 69: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

6. Layout of L3 caches (17)

UltraSPARC IV+ (2005)

Core

L3 tags/contr.

L3 data

Interconn. network

M. c.

Memory

B. c.

Fire Planebus

Core

L2

POWER4 (2001)

IN (Fabric Bus Contr.)

M. c.

Memory

B. c. L3 tagscontr.

L3 data

L2 L2 L2

GX-bus

Chip-to-chip/

Mem.-to-Mem.interconn.

L2 fully integrated, sharedL3 partially integrated

Examples for cache architectures (3)

Page 70: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Core

L2 I L2 D

L3

Core

L2 I L2 D

L3

FSB c.

FSB

Montecito (2006)

L2 fully integrated, private,L3 fully integrated

Examples for cache architectures (4)

6. Layout of L3 caches (18)

Page 71: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

7. Layout of memory and I/O architecture

Page 72: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Layout of L2 caches

Layout of the cores

Layout of the I/O andmemory architecture

Layout of L3 caches (if available)

7. Layout of memory and I/O architecture (1)

Layout of the interconnections

Macro architecture of multi-core (MC) processors

Page 73: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

7. Layout of memory and I/O architecture (2)

The concept of connection

Layout of the I/O and memory connection

Point of attachmentPhysical implementations

of the M. c.

Page 74: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

7. Layout of memory and I/O architecture (3)

Connection via the FSB

The concept of connection

Connection via the M. c and B. c

Used typically in connection with off-chip M. c.-s

Used in connection with on-chip M. c.-s

Examples:

Core

L2 I L2 D

L3

Core

L2 I L2 D

L3

FSB c.

FSB

Core 2 Duo Montecito

IN

Core

L2

Core

FSB c.

FSB

L2L2 M. c..

B. c..

L2

L2

L2Core 7

M. c..

M. c..

M. c..

Core 0

X

b

a

r

Memory

Memory

Memory

Memory

JBusUltraSPARC T1

Page 75: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

7. Layout of memory and I/O architecture (4)

The highest cache level (via an IN)

The point of attachment

The IN connecting the two highest cache levels

L2 (private/shared)

L3 (private/shared)

The IN connecting the cores and a shared L2

The IN connecting a shared L2 and a shared L3

L2

IN

L2 L3

IN

L3

IN1

L2 L2

CC

IN

L3 L3 L3

L2 L2 L2

(2-level caches) (3-level caches) (2-level caches) (3-level caches)

Examples:

Page 76: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

7. Layout of memory and I/O architecture (5)

Data missing in L2/L3

(high traffic)

Memory

L2

M.c.

Replaced (victimised)lines

Modified data(low traffic)

Lines missing in L2 are reloaded and deleted from L3

L3

Memory

L2

IN

L2

L3 L3

M.c.

L3 L3

M.c.

Memory

Memory

L3

L2

Montecito (2006)POWER4 (2001)

UltraSPARC IV+ (2004)POWER5 (2004)

Exclusive L3

Inclusion policy of L3 caches

Inclusive L3

Cache inclusivity vs. memory attachment

Page 77: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

7. Layout of memory and I/O architecture (6)

The highest cache level (via an IN)

The point of attachment

The IN connecting the two highest cache levels

L2 (private/shared)

L3 (private/shared)

The IN connecting the cores and a shared L2

The IN connecting a shared L2 and a shared L3

L2

IN

L2 L3

IN

L3

IN1

L2 L2

CC

IN

L3 L3 L3

L2 L2 L2

(2-level caches) (3-level caches) (2-level caches) (3-level caches)

The M. c is connected usually in this way if the highest level cache is exclusive.

The M. c is connected usually in this way if the highest level cache is inclusive.

Examples:

Page 78: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

7. Layout of memory and I/O architecture (7)

Core

L2 I L2 D

L3

Core

L2 I L2 D

L3

FSB c.

FSB

Montecito (2006)

IN (Xbar)

Memory

System Request Queue

B. c. M. c.

HT-bus

L2 L2

Core Core

L2 contr.

Core

L2

Core

FSB c.

FSB

Athlon 64 X2 (2005)Core 2 Duo (2006)

Examples for attaching memory and I/O via the highest cache level

In case of a two-level cache hierarchy In case of a three-level cache hierarchy

Page 79: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

7. Layout of memory and I/O architecture (8)

UltraSPARC T1 (2005) UltraSPARC IV+ (2005)

Examples for attaching memory and I/O via the interconnection network connecting the two highest levels of the cache hierarcy

In case of a two-level cache hierarchy In case of a three-level cache hierarchy

Core

L3 tags/contr.

L3 data

Interconn. network

M. c.

Memory

B. c.

Fire Planebus

Core

L2

L2L2 M. c.

B. c.

L2

L2

L2Core 7

M. c.

M. c.

M. c.

Core 0

X

b

a

r

Memory

Memory

Memory

Memory

JBus

(exclusive L3)

Page 80: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Off-chip memory controller On-chip memory controller

Integration of the memory controller to the processor chip

POWER4 (2001)

UltraSPARC IV+ (2005)

POWER5 (2005)

Montecito (2006?)

UltraSPARC T1 (2005)

UltraSPARC IV (2004)

Athlon 64 X2 (2005)Presler (2005)

Smithfield (2005)

PA-8800 (2004)

PA-8900 (2005)

Core (2006)

Yonah Duo (2006)

Expected trend

7. Layout of memory and I/O architecture (9)

Longer access times,but provides independency of

memory technology

Shortens access times,but causes dependency of

memory technology and speed

Page 81: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

8. Layout of the on-chip interconnections

Page 82: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

Layout of L2 caches

Layout of the cores

Layout of the I/O andmemory architecture

Layout of L3 caches (if available)

1. Layout of interconnections (1)

Layout of the interconnections

Macro architecture of multi-core (MC) processors

Page 83: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

8. Layout of interconnections (2)

IN1

L2 L2

C CL2

IN

L2

L3 L3

L2/L3

IN

L2/L3

B. c. M. c. FSB c.

I/O bus FSBMemory

ChipModule

Chip

Chip

Chipor

Between private or shared L2 modules

and shared L3 modules

Between cores and shared L2 modules

Between private or shared L2/L3 modules and the B. c./M. c. or

alternatively the FSB c.

On-chip interconnections

Chip-to-chip and module-to-module

interconnects

Page 84: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

8. Layout of interconnections (3)

Arbiter/Multi-port cache implementations Crossbar

Implementation of interconnections

Ring

For a small number ofsources/destinations

For a larger number ofsources/destinations

Cell BE (2006)XRI (2005)

eg. to connect dual-coresto shared L2 caches

UltraSPARC T1 (2005)UltraSPARC T2 (2007)

Quantitative aspects, such as the number of sources/destinations or bandwidth requirement affect which implementation alternative is the most beneficial.

Page 85: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

9. Basic alternatives of macro-architectures of MC processors

Page 86: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

9. Basic alternatives of macro-architectures of MC processors (1)

Basic dimensions spanning the design space

• Kind of related cache architecture

• Layout of the memory and I/O connection

• Layout of the IN(s) involved

• Cache inclusivity affects the layout of the memory connectionInterrelationships between particular dimensions

Page 87: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

9. Basic alternatives of macro-architectures of MC processors (2)

Considering a selected part of the design space of the macroarchitecture of MC processors

Assumptions:

• L2 is supposed to be partially integrated on the chip (i.e. only the tags are on the chip). This is not indicated in the design space tree but denoted in the examples. • The kind of the implementation of the involved INs is not considered.

• In case of an off-chip M. controller the use of an FSB is assumed.

• A two-level cache hierarchy is considered (no L3).

Page 88: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

9. Basic alternatives of macro-architectures of MC processors (3)

Off-chip M. c. On-chip M. c.

Design space of the macroarchitecture of MC processors (in case of a two-level cache hierarchy)

L2 private L2 shared L2 private L2 shared

Part of the design-space specifying the point of attachment

for the memory and I/O

Page 89: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

9. Basic alternatives of macro-architectures of MC processors (4)

Core 2 Duo based processors (2006)(2 cores)

SPARC64 VI (2007)(2 cores)

SPARC64 VII (2008)(4 cores)

Smithfield/Presler based processors (2005/2006)(2 cores)

PA-8800 (2004)(2 cores, L2 data off-chip)

FSB c.

FSB

IN1

L2 L2

IN2

L2

CC C

FSB c.

IN

L2 L2 L2

FSB

CC C

L2 private L2 shared

M . c. off-chip

Design space of the macroarchitecture of MC processors (2-level hierarchy) (1)

Page 90: III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

9. Basic alternatives of macro-architectures of MC processors (5)

Design space of the macroarchitecture of MC processors (2-level hierarchy) (2)

M. c. on-chip

L2 privateL2 shared

Attaching both I/O and M. c. to L2 via IN2

Attaching M. c. to L2 via IN2 and I/O to IN1

Attaching both I/O and M.c. via IN1

Attaching I/O. to L2 via IN2 and M.c. to IN1

UltraSPARC T1 (2005) (8 cores)

Cell BE (2006)(8 cores)

Opteron dual core (2 cores),(2005)

UltraSPARC IV(2 cores, L2 data off-chip), (2004)

Athlon 64 X2 (2005)(2 cores)

IN

L2 L2

B. c.

I/O-bus Memory

M. c.

CC

IN1

L2

B. c.

I/O-bus Memory

M. c.

IN2

L2

CC

I/O-bus

IN1

L2

B. c.

Memory

M. c.

IN2

L2

CC

IN1

L2

B. c.

I/O-bus

Memory

M. c.

L2

CC

IN1

L2

M. c.

Memory

I/O-bus

B. c.

IN2

L2

CC