Outline Multi-Core Processors - University of Auckland€¦ · Sriram Vajapeyam Multi-Cores...

Sriram Vajapeyam

Multi-Core Processors

Sriram Vajapeyam

[Affiliation: Freelance Researcher, Feb-Dec.2010]

April 20, 2010

Outline

•  The Big Picture: An Example Multi-Core •  Why Multi-Core? •  Applications for Multi-Core? •  Architecture:

–  Cores –  Caches –  Interconnect, etc.

•  Putting it all together: IBM Power 7, Tilera T64

Sriram Vajapeyam

Example Multi-Core: IBM POWER7

•  8 Processor Cores

•  Shared L3 (32MB)

•  Interconnect

•  Memory Controllers

•  SMP Coherence Support

32 Threads; 32+ MB Cache;

100GB/s off-chip Memory-BW

GX+ Chip-Chip

MCM-MCM

SMPLink

[Hot Chips, Aug. 2009]

Sriram Vajapeyam

Industry-Wide Trend…

Category Processor # Cores x

SMT = Threads

On-Chip Cache (last-level)

On-Chip Inter-

connect

GenPurpose/ Servers/ Supers

IBM Power7 8 x 4-way = 32 32MB Hi-Perf

Intel Core I7 8 x 2-way = 16 8MB Point-Point

SUN Niagara T2 8 x 8-way = 64 4MB CrossBar

AMD Phenom 4 x 1-way = 4 6MB Point-Point

Graphics NVidia G200 240x1-way=240 16KBx30 --

DSP/Network/Multimedia/ Embedded

TI OMAP 4430 3x 1-way = 3 -- Bus

Tilera TILE64 64 x 1-way = 64 4MB Mesh

Sriram Vajapeyam

The First Multi-Core: IBM Power 4 Processor 2000/ 2001

Sriram Vajapeyam

Multi-Cores

Why Multi-Cores?

Why Multi-Cores?

•  Power – Energy – Temperature Wall (e.g. Intel Core I7: 130W, AMD Phenom: 140W) o  dynamic power: (Voltage**2) * Frequency * Activity

!  Frequency = function of (Voltage)

!  BUT, as Voltage reduces, V-threshold needs to reduce

o  short-circuit power: Voltage * Frequency * Activity * I-short

o  leakage power: 1 / [e**(V-threshold)] -- problem at low voltages

o  many slower cores better than few faster cores – freq, V reduce •  Instruction-Level-Parallelism (ILP) Wall •  Design & Verification Complexity (Power7: 1.2B Transistors)

No Better Ideas Yet ?!

Processor Power Management

Dynamic Power •  Multiple slower cores

•  Speed/Complexity of core dependent on workloads

•  Clock Gating, etc. – leakage power continues Leakage Power

•  Power Gating •  Functional units are switched off •  Various prediction algorithms for switch-off

Thermal Hotspots, Total Energy •  As important as power: power density, power integral

Power Mgmt. not covered in this talk

Sriram Vajapeyam

Multi-Cores

Applications?

Sriram Vajapeyam

Multi-Core Applications?

•  Grandest Challenge in Computer Science Today!

[paraphrase:] Problems triggered by the halt of processor improvements and the onset of multi-cores are the biggest research problem in all of Computer Science today!

•  John Hennessy •  ACM/IEEE Eckert-Mauchly Award winner

•  Fran Allen •  ACM Turing Award winner

Sriram Vajapeyam

Parallelising Software (for Multi-Core)

•  Easy: o  Add 10M pairs of numbers o  Service 100 different web-page read requests o  Search a large database

•  Moderate (i.e. has sequential components): o  Sort a large set (TeraBytes) of numbers o  Sum a large set of numbers

•  Difficult: o  Complex decision-based computation (“spaghetti code”)

Sriram Vajapeyam

Possible New Benchmarks

•  The Landscape of Parallel Computing Research: A View From Berkeley [Patterson et al]

Applications 1. What are the applications?

2. What are common kernels of the applications? Architecture and Hardware

3. What are the HW building blocks? 4. How to connect them?

Programming Model and Systems Software 5. How to describe applications and kernels?

6. How to program the hardware? Evaluation

7. How to measure success?

Sriram Vajapeyam

Possible New Benchmarks

•  Dwarf Mine (from UC Berkeley) – Kernels

List of Dwarfs

Dense Linear Algebra Sparse Linear Algebra Spectral Methods N-Body Methods

Structured Grids Unstructured Grids Map Reduce Combinational Logic

Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound Graphical Models

Finite State Machines

Sriram Vajapeyam

Multi-Core Design

Sriram Vajapeyam

Multi-Core Design Questions

•  How powerful should each Core be? •  How much Cache should each core have?

•  How do different Cores/threads communicate? •  How do different Cores/threads synchronize? •  How much off-chip memory bandwidth/latency?

•  How much multi-chip SMP support?

•  How can a task be partitioned across multiple Cores?

Sriram Vajapeyam

Multi-Core Design

Core Micro-Architecture

17

Pradip Bose | IBM Virtual Low Power Seminar; August 2009

Power-Performance Efficient Pipeline Depth

Power-performance optimal Performance optimal

V. Srinivasan et al., MICRO-2002 V. Zyuban et al., IEEETC, 8/2004 18

Pradip Bose | IBM Virtual Low Power Seminar; August 2009

Workload impact: TPCC Trace

Power-performance optimal Performance optimal

V. Srinivasan et al., MICRO-2002 V. Zyuban et al., IEEETC, 8/2004

19

Pradip Bose| Low Power Seminar: Aug. 2009

Simple vs. Complex Cores in a Chip For a given power budget, higher throughput is achieved by multiple simple cores on both SMP workloads and independent threads

A complex core provides much higher single-thread performance; scaling up a simple core by reducing FO4 and/or raising Vdd does not achieve this level of performance.

It may be worthwhile to have multiple heterogeneous cores on chip

The appropriate design point depends on the workload that is being supported

Source: Zyuban et al. 2004, presented by T. Agerwala, keynote at ISCA 2004

Sriram Vajapeyam

Example Core: SUN Niagara-2 (a simple core)

•  8-Stage Integer Pipeline: o  Fetch, Cache, Pick, Decode, Execute, Memory, Bypass, Write-Back o  Floating-Point: Execute is 6 stages instead of 1 stage o  Memory-Dependence: 3 clocks

•  Branch Prediction: NONE (default: Not-Taken)

•  Instruction Issue: 1 inst/cycle per thread

•  Threads: 2-way SMT, 8-way active thread-pool

Simple, Throughput-oriented Cores

Sriram Vajapeyam

Another Example: IBM Power6 Core (hi-perf core)

•  Instruction Issue: 8-way Fetch/Decode 64-entry Inst-buffer/thread 7-way Dispatch

•  Integer Pipeline: 11/16 stages o  Floating-Point: 6-stage Execute

•  Branch Prediction: 16K-entry BHT, 2-bits per entry

•  Threads: 2-way SMT

High-Performance, Minimally Out-of-Order Core

Sriram Vajapeyam

IBM Power6 Core (contd.)

Sriram Vajapeyam

Industry-Wide Trend…

Category Processor # Cores/

SMT/ Threads

Complexity

GenPurpose/ Servers/ Supers

IBM Power6 2, 2-way = 4 7-way Dispatch

Intel Core I7 8, 2-way = 16 4-way OOO

SUN Niagara T2 8, 8-way = 64 1-way IO

AMD Phenom 4, 1-way = 4 3-way OOO

Graphics NVidia G200 240,1-way=240 1-way IO

DSP/Network/Multimedia/ Embedded

TI OMAP 4430 3, 1-way = 3 3-way OOO, 8-way VLIW

Tilera TILE64 64, 1-way = 64 3-way VLIW

Sriram Vajapeyam

Multi-Core Design

Caches &

On-Chip Interconnect

Sriram Vajapeyam

Multi-Core On-Chip RAM

•  Caches vs. Local Memory •  IBM Cell: Local Memory per Core (SPE)

•  Programming Challenge

•  Most General-Purpose Processors: Caches, not Local Memory

•  Caches: •  Shared, Coherent Caches vs. Private Caches

•  Physically Unified vs. Physically Distributed

Workload Impact: •  Throughput vs. Shared-Memory Workloads

•  Last-level Capacity Misses can dominate for some applications

Sriram Vajapeyam


IBM Power Series •  IBM Power 5

•  Shared 2MB L2 on-chip •  Off-Chip Victim L3

•  IBM Power 6 •  4MB Private L2s

•  32 MB L3 off-chip, victim cache

•  IBM Power-7 •  On-chip L3 logically shared, physically distributed; not just VC

•  Tilera T-64: L2 logically shared(?), physically distributed

Sriram Vajapeyam


Sriram Vajapeyam

Multi-Core On-Chip Interconnect

•  Typically, crossbar or multistage •  > 10 clocks to traverse from core to shared cache

•  Area, Power concerns •  Area equivalent of ~5 cores?

•  Power equivalent of ~2 cores?

•  Examples: •  Crossbar – SUN Niagara

•  Mesh – Tilera

Sriram Vajapeyam

Multi-Core Arch: Cache, Interconnect..

GX+ Chip-Chip

MCM-MCM

SMPLink

Processor # Cores/

SMT/ Threads

On-Chip Cache (last-level)

On-Chip Inter-connect

IBM Power7 8, 4-way = 32 32MB Hi-Perf

Intel Core I7 8, 2-way = 16 8MB PointPoint

SUN Niagara T2 8, 8-way = 64 4MB Cross-Bar

AMD Phenom 4, 1-way = 4 6MB PointPoint

NVidia G200 240,1-way=240 16KBx30 --

TI OMAP 4430 3, 1-way = 3 -- Bus

Tilera TILE64 64, 1-way = 64 4MB Mesh

Sriram Vajapeyam

Putting it all together..

Example Multi-Cores:

•  Tilera TILE-64

•  IBM Power Series

Sriram Vajapeyam

Tilera TILE-64 Ack: Tilera Website

Sriram Vajapeyam

Tilera TILE-64

•  8 X 8 grid of identical, general purpose processor cores (tiles) •  3-way VLIW pipeline for instruction level parallelism •  5 Mbytes of on-chip Cache •  Up to 443 billion operations per second (BOPS) •  31 Tbps of on-chip mesh interconnect

•  500MHz – 866MHz operating frequency •  15 – 22W @ 700MHz all cores active

•  Idle Tiles can be put into low-power sleep mode •  Four DDR2 memory controllers with optional ECC

•  iLib™ API's for efficient inter-tile communication •  ANSI standard C / C++ compiler

Sriram Vajapeyam

Tilera TILE-64 Core

Sriram Vajapeyam

IBM POWER5 / POWER6


•  2-way SMT

•  Private L2 (4 MB)

•  Off-chip Victim L3

•  L3 Controller



GX+ Chip-Chip

MCM-MCM

SMPLink

Sriram Vajapeyam

IBM POWER7


•  Shared L3 (32MB)

•  Interconnect



32 Threads; 32+ MB Cache; 100GB/s off-chip Memory-BW

L1 – L2 – local-L3 – L3 – Memory 1x -- 4x -- 12x -- 60x -- 180x GX+

Chip-Chip

MCM-MCM

SMPLink

[Hot Chips, Aug. 2009]

Sriram Vajapeyam

Summary

Multi-Cores are industry-wide dominant

o  Core Count: Trends: 10s to 100s to 1000s

o  Core Complexity: Simple, In-Order to Hi-Perf, OoO

o  Threads: SMT

o  Caches: Multi-MB Embedded DRAM

o  Networks: Buses, Crossbars, Point-to-Point

Architecture, Programming Grand Challenge!

Sriram Vajapeyam

Multi-Core Performance Modeling

Sriram Vajapeyam

Multi-Core Integrated Modeling Challenges

1.  Many Cores to Simulate! o  Trend: 10s to 100s to 1000s of cores! o  Simulator itself needs to be parallelized / distributed

!  single-thread sim is 1000x-100,000x slower 2.  Fine Component Granularity modeling

needed: o  Energy/Power Management done at

individual functional unit level or finer grain

o  IBM Power7: independent switch-off capability for individual execution units, threads, and cores

Sriram Vajapeyam

Parallel Simulation Challenges

Threaded Applications 1. Memory References

o  Potentially every mem. ref. conflicts with some other thread – Synchronization Point for Parallel Simulator

o  Frequent mem. refs.: at least 20% of all instructions 2. Execution-Driven (vs. trace-driven sim.):

o  Traces maybe too long to store and re-process o  Traces may not capture timing aspects of threaded

workloads, e.g. OS effects

Sriram Vajapeyam

Example Parallel, Distr. Sim.: MIT Graphite

•  Parallel (in a multicore) and Distributed (across chips)

•  Modular: each target core has its own simulator thread

•  Synchronization: various slack schemes

•  Accuracy: not cycle-accurate o  combines modeling,

direct-execution, loose synch, etc.

o  Error less than ~2% •  Perf: scales with host cores

Sriram Vajapeyam

Graphite Architecture

Sriram Vajapeyam

Graphite: Synchronization Models

•  Lax: Synch only at app. synch., msgs, thread events o  Error: < 10%, indicates performance trends o  Performance: Best runtime and scalability

•  LaxBarrier: Synch at “quanta” o  Error: almost none; o  Performance: poor runtime, ok scaling

•  LaxP2P: Randomized P2P synch o  Error: < 2%; Performance: almost as good as Lax

Sriram Vajapeyam

Summary

•  Multi-Cores are industry-wide dominant o  Core Count: Trends: 10s to 100s to 1000s o  Core Complexity: Simple, In-Order to Hi-Perf, O-o-O o  Threads: SMT o  Caches: Multi-MB Embedded DRAM o  Networks: Buses, Crossbars, Point-to-Point

•  Modeling needs to leverage existing Multi-cores o  Sequential Sim. too slow: 1000x – 100,000x slowdown o  Fine granularity modeling useful

Sriram Vajapeyam

References

•  G. Blake, R. Dreslinksi, T. Mudge, “A Survey of Multicore Processors”, IEEE Signal Processing Magazine, Nov. 2009

•  J. E. Miller, et al, “Graphite: A Distributed, Parallel Simulator for Multicores”, MIT CSAIL Technical Report 2009-056, Nov. 2009

•  K. Asanovic, et al, “A View of the Parallel Computing Landscape”, Communications of the ACM, Oct. 2009

•  etc etc etc

Outline Multi-Core Processors - University of Auckland€¦ · Sriram Vajapeyam Multi-Cores...

Documents

Transcript of Outline Multi-Core Processors - University of Auckland€¦ · Sriram Vajapeyam Multi-Cores...