Outline Multi-Core Processors - University of Auckland€¦ · Sriram Vajapeyam Multi-Cores...
Transcript of Outline Multi-Core Processors - University of Auckland€¦ · Sriram Vajapeyam Multi-Cores...
Sriram Vajapeyam
Multi-Core Processors
Sriram Vajapeyam
[Affiliation: Freelance Researcher, Feb-Dec.2010]
April 20, 2010
Outline
• The Big Picture: An Example Multi-Core • Why Multi-Core? • Applications for Multi-Core? • Architecture:
– Cores – Caches – Interconnect, etc.
• Putting it all together: IBM Power 7, Tilera T64
Sriram Vajapeyam
Example Multi-Core: IBM POWER7
• 8 Processor Cores
• Shared L3 (32MB)
• Interconnect
• Memory Controllers
• SMP Coherence Support
32 Threads; 32+ MB Cache;
100GB/s off-chip Memory-BW
GX+ Chip-Chip
MCM-MCM
SMPLink
[Hot Chips, Aug. 2009]
Sriram Vajapeyam
Industry-Wide Trend…
Category Processor # Cores x
SMT = Threads
On-Chip Cache (last-level)
On-Chip Inter-
connect
GenPurpose/ Servers/ Supers
IBM Power7 8 x 4-way = 32 32MB Hi-Perf
Intel Core I7 8 x 2-way = 16 8MB Point-Point
SUN Niagara T2 8 x 8-way = 64 4MB CrossBar
AMD Phenom 4 x 1-way = 4 6MB Point-Point
Graphics NVidia G200 240x1-way=240 16KBx30 --
DSP/Network/Multimedia/ Embedded
TI OMAP 4430 3x 1-way = 3 -- Bus
Tilera TILE64 64 x 1-way = 64 4MB Mesh
Sriram Vajapeyam
The First Multi-Core: IBM Power 4 Processor 2000/ 2001
Sriram Vajapeyam
Multi-Cores
Why Multi-Cores?
Why Multi-Cores?
• Power – Energy – Temperature Wall (e.g. Intel Core I7: 130W, AMD Phenom: 140W) o dynamic power: (Voltage**2) * Frequency * Activity
! Frequency = function of (Voltage)
! BUT, as Voltage reduces, V-threshold needs to reduce
o short-circuit power: Voltage * Frequency * Activity * I-short
o leakage power: 1 / [e**(V-threshold)] -- problem at low voltages
o many slower cores better than few faster cores – freq, V reduce • Instruction-Level-Parallelism (ILP) Wall • Design & Verification Complexity (Power7: 1.2B Transistors)
No Better Ideas Yet ?!
Processor Power Management
Dynamic Power • Multiple slower cores
• Speed/Complexity of core dependent on workloads
• Clock Gating, etc. – leakage power continues Leakage Power
• Power Gating • Functional units are switched off • Various prediction algorithms for switch-off
Thermal Hotspots, Total Energy • As important as power: power density, power integral
Power Mgmt. not covered in this talk
Sriram Vajapeyam
Multi-Cores
Applications?
Sriram Vajapeyam
Multi-Core Applications?
• Grandest Challenge in Computer Science Today!
[paraphrase:] Problems triggered by the halt of processor improvements and the onset of multi-cores are the biggest research problem in all of Computer Science today!
• John Hennessy • ACM/IEEE Eckert-Mauchly Award winner
• Fran Allen • ACM Turing Award winner
Sriram Vajapeyam
Parallelising Software (for Multi-Core)
• Easy: o Add 10M pairs of numbers o Service 100 different web-page read requests o Search a large database
• Moderate (i.e. has sequential components): o Sort a large set (TeraBytes) of numbers o Sum a large set of numbers
• Difficult: o Complex decision-based computation (“spaghetti code”)
Sriram Vajapeyam
Possible New Benchmarks
• The Landscape of Parallel Computing Research: A View From Berkeley [Patterson et al]
Applications 1. What are the applications?
2. What are common kernels of the applications? Architecture and Hardware
3. What are the HW building blocks? 4. How to connect them?
Programming Model and Systems Software 5. How to describe applications and kernels?
6. How to program the hardware? Evaluation
7. How to measure success?
Sriram Vajapeyam
Possible New Benchmarks
• Dwarf Mine (from UC Berkeley) – Kernels
List of Dwarfs
Dense Linear Algebra Sparse Linear Algebra Spectral Methods N-Body Methods
Structured Grids Unstructured Grids Map Reduce Combinational Logic
Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound Graphical Models
Finite State Machines
Sriram Vajapeyam
Multi-Core Design
Sriram Vajapeyam
Multi-Core Design Questions
• How powerful should each Core be? • How much Cache should each core have?
• How do different Cores/threads communicate? • How do different Cores/threads synchronize? • How much off-chip memory bandwidth/latency?
• How much multi-chip SMP support?
• How can a task be partitioned across multiple Cores?
Sriram Vajapeyam
Multi-Core Design
Core Micro-Architecture
17
Pradip Bose | IBM Virtual Low Power Seminar; August 2009
Power-Performance Efficient Pipeline Depth
Power-performance optimal Performance optimal
V. Srinivasan et al., MICRO-2002 V. Zyuban et al., IEEETC, 8/2004 18
Pradip Bose | IBM Virtual Low Power Seminar; August 2009
Workload impact: TPCC Trace
Power-performance optimal Performance optimal
V. Srinivasan et al., MICRO-2002 V. Zyuban et al., IEEETC, 8/2004
19
Pradip Bose| Low Power Seminar: Aug. 2009
Simple vs. Complex Cores in a Chip For a given power budget, higher throughput is achieved by multiple simple cores on both SMP workloads and independent threads
A complex core provides much higher single-thread performance; scaling up a simple core by reducing FO4 and/or raising Vdd does not achieve this level of performance.
It may be worthwhile to have multiple heterogeneous cores on chip
The appropriate design point depends on the workload that is being supported
Source: Zyuban et al. 2004, presented by T. Agerwala, keynote at ISCA 2004
Sriram Vajapeyam
Example Core: SUN Niagara-2 (a simple core)
• 8-Stage Integer Pipeline: o Fetch, Cache, Pick, Decode, Execute, Memory, Bypass, Write-Back o Floating-Point: Execute is 6 stages instead of 1 stage o Memory-Dependence: 3 clocks
• Branch Prediction: NONE (default: Not-Taken)
• Instruction Issue: 1 inst/cycle per thread
• Threads: 2-way SMT, 8-way active thread-pool
Simple, Throughput-oriented Cores
Sriram Vajapeyam
Another Example: IBM Power6 Core (hi-perf core)
• Instruction Issue: 8-way Fetch/Decode 64-entry Inst-buffer/thread 7-way Dispatch
• Integer Pipeline: 11/16 stages o Floating-Point: 6-stage Execute
• Branch Prediction: 16K-entry BHT, 2-bits per entry
• Threads: 2-way SMT
High-Performance, Minimally Out-of-Order Core
Sriram Vajapeyam
IBM Power6 Core (contd.)
Sriram Vajapeyam
Industry-Wide Trend…
Category Processor # Cores/
SMT/ Threads
Complexity
GenPurpose/ Servers/ Supers
IBM Power6 2, 2-way = 4 7-way Dispatch
Intel Core I7 8, 2-way = 16 4-way OOO
SUN Niagara T2 8, 8-way = 64 1-way IO
AMD Phenom 4, 1-way = 4 3-way OOO
Graphics NVidia G200 240,1-way=240 1-way IO
DSP/Network/Multimedia/ Embedded
TI OMAP 4430 3, 1-way = 3 3-way OOO, 8-way VLIW
Tilera TILE64 64, 1-way = 64 3-way VLIW
Sriram Vajapeyam
Multi-Core Design
Caches &
On-Chip Interconnect
Sriram Vajapeyam
Multi-Core On-Chip RAM
• Caches vs. Local Memory • IBM Cell: Local Memory per Core (SPE)
• Programming Challenge
• Most General-Purpose Processors: Caches, not Local Memory
• Caches: • Shared, Coherent Caches vs. Private Caches
• Physically Unified vs. Physically Distributed
Workload Impact: • Throughput vs. Shared-Memory Workloads
• Last-level Capacity Misses can dominate for some applications
Sriram Vajapeyam
Multi-Core On-Chip RAM
IBM Power Series • IBM Power 5
• Shared 2MB L2 on-chip • Off-Chip Victim L3
• IBM Power 6 • 4MB Private L2s
• 32 MB L3 off-chip, victim cache
• IBM Power-7 • On-chip L3 logically shared, physically distributed; not just VC
• Tilera T-64: L2 logically shared(?), physically distributed
Sriram Vajapeyam
Multi-Core On-Chip RAM
Sriram Vajapeyam
Multi-Core On-Chip Interconnect
• Typically, crossbar or multistage • > 10 clocks to traverse from core to shared cache
• Area, Power concerns • Area equivalent of ~5 cores?
• Power equivalent of ~2 cores?
• Examples: • Crossbar – SUN Niagara
• Mesh – Tilera
Sriram Vajapeyam
Multi-Core Arch: Cache, Interconnect..
GX+ Chip-Chip
MCM-MCM
SMPLink
Processor # Cores/
SMT/ Threads
On-Chip Cache (last-level)
On-Chip Inter-connect
IBM Power7 8, 4-way = 32 32MB Hi-Perf
Intel Core I7 8, 2-way = 16 8MB PointPoint
SUN Niagara T2 8, 8-way = 64 4MB Cross-Bar
AMD Phenom 4, 1-way = 4 6MB PointPoint
NVidia G200 240,1-way=240 16KBx30 --
TI OMAP 4430 3, 1-way = 3 -- Bus
Tilera TILE64 64, 1-way = 64 4MB Mesh
Sriram Vajapeyam
Putting it all together..
Example Multi-Cores:
• Tilera TILE-64
• IBM Power Series
Sriram Vajapeyam
Tilera TILE-64 Ack: Tilera Website
Sriram Vajapeyam
Tilera TILE-64
• 8 X 8 grid of identical, general purpose processor cores (tiles) • 3-way VLIW pipeline for instruction level parallelism • 5 Mbytes of on-chip Cache • Up to 443 billion operations per second (BOPS) • 31 Tbps of on-chip mesh interconnect
• 500MHz – 866MHz operating frequency • 15 – 22W @ 700MHz all cores active
• Idle Tiles can be put into low-power sleep mode • Four DDR2 memory controllers with optional ECC
• iLib™ API's for efficient inter-tile communication • ANSI standard C / C++ compiler
Sriram Vajapeyam
Tilera TILE-64 Core
Sriram Vajapeyam
IBM POWER5 / POWER6
• 2 Processor Cores
• 2-way SMT
• Private L2 (4 MB)
• Off-chip Victim L3
• L3 Controller
• Memory Controllers
• SMP Coherence Support
GX+ Chip-Chip
MCM-MCM
SMPLink
Sriram Vajapeyam
IBM POWER7
• 8 Processor Cores
• Shared L3 (32MB)
• Interconnect
• Memory Controllers
• SMP Coherence Support
32 Threads; 32+ MB Cache; 100GB/s off-chip Memory-BW
L1 – L2 – local-L3 – L3 – Memory 1x -- 4x -- 12x -- 60x -- 180x GX+
Chip-Chip
MCM-MCM
SMPLink
[Hot Chips, Aug. 2009]
Sriram Vajapeyam
Summary
Multi-Cores are industry-wide dominant
o Core Count: Trends: 10s to 100s to 1000s
o Core Complexity: Simple, In-Order to Hi-Perf, OoO
o Threads: SMT
o Caches: Multi-MB Embedded DRAM
o Networks: Buses, Crossbars, Point-to-Point
Architecture, Programming Grand Challenge!
Sriram Vajapeyam
Multi-Core Performance Modeling
Sriram Vajapeyam
Multi-Core Integrated Modeling Challenges
1. Many Cores to Simulate! o Trend: 10s to 100s to 1000s of cores! o Simulator itself needs to be parallelized / distributed
! single-thread sim is 1000x-100,000x slower 2. Fine Component Granularity modeling
needed: o Energy/Power Management done at
individual functional unit level or finer grain
o IBM Power7: independent switch-off capability for individual execution units, threads, and cores
Sriram Vajapeyam
Parallel Simulation Challenges
Threaded Applications 1. Memory References
o Potentially every mem. ref. conflicts with some other thread – Synchronization Point for Parallel Simulator
o Frequent mem. refs.: at least 20% of all instructions 2. Execution-Driven (vs. trace-driven sim.):
o Traces maybe too long to store and re-process o Traces may not capture timing aspects of threaded
workloads, e.g. OS effects
Sriram Vajapeyam
Example Parallel, Distr. Sim.: MIT Graphite
• Parallel (in a multicore) and Distributed (across chips)
• Modular: each target core has its own simulator thread
• Synchronization: various slack schemes
• Accuracy: not cycle-accurate o combines modeling,
direct-execution, loose synch, etc.
o Error less than ~2% • Perf: scales with host cores
Sriram Vajapeyam
Graphite Architecture
Sriram Vajapeyam
Graphite: Synchronization Models
• Lax: Synch only at app. synch., msgs, thread events o Error: < 10%, indicates performance trends o Performance: Best runtime and scalability
• LaxBarrier: Synch at “quanta” o Error: almost none; o Performance: poor runtime, ok scaling
• LaxP2P: Randomized P2P synch o Error: < 2%; Performance: almost as good as Lax
Sriram Vajapeyam
Summary
• Multi-Cores are industry-wide dominant o Core Count: Trends: 10s to 100s to 1000s o Core Complexity: Simple, In-Order to Hi-Perf, O-o-O o Threads: SMT o Caches: Multi-MB Embedded DRAM o Networks: Buses, Crossbars, Point-to-Point
• Modeling needs to leverage existing Multi-cores o Sequential Sim. too slow: 1000x – 100,000x slowdown o Fine granularity modeling useful
Sriram Vajapeyam
References
• G. Blake, R. Dreslinksi, T. Mudge, “A Survey of Multicore Processors”, IEEE Signal Processing Magazine, Nov. 2009
• J. E. Miller, et al, “Graphite: A Distributed, Parallel Simulator for Multicores”, MIT CSAIL Technical Report 2009-056, Nov. 2009
• K. Asanovic, et al, “A View of the Parallel Computing Landscape”, Communications of the ACM, Oct. 2009
• etc etc etc