Outline - saahpc.ncsa.illinois.edusaahpc.ncsa.illinois.edu/presentations/chien.pdf · o Sandy...

7/11/12

1

Can we Systematically Evaluate and Exploit Heterogeneous Accelerators?

A 10x10 Perspective

Andrew A. Chien Dept of Computer Science, University of Chicago

MCS, Argonne National Laboratory

SAAHPC Keynote July 11, 2012

Outline •  The Future is Heterogeneous •  Accelerators in Perspective •  Towards Systematic Accelerator Evalution •  10x10: Systematic Heterogeneous Architecture •  Summary and Futures

July 11, 2012 © Andrew A. Chien, 2012

7/11/12

2

The Future is Heterogeneous


Heterogeneous Supercomputers •  Tianhe-1 (NUDT, Nov

2010) o  5PF o  14,336 Xeons + 7168 Teslas

•  Titan (ORNL, fall 2012) o  19K AMD CPU’s + 960 GPU’s o  Grow to 20PF in fall? o  ~ 20PF/2TF => 10K Nvidia GPU’s

(Kepler?)

•  Blue Waters (NCSA, late 2012) o  11.5PF, 1.5PB o  49K AMD CPUS (380K cores) o  3K Nvidia GPU’s (Kepler)


7/11/12

3

Heterogeneity Dominates

•  Heterogeneity is growing dramatically – on single chips, in systems, and in high volume deployment o  Sandy Bridge/Fusion/Denver, Tegra 2/Omap/A5

•  Heterogeneous in architecture and implementation is the dominant computing platform of future

July 11, 2012

Smart Phones

Laptops and Tablets

Desktops

Servers Homogeneous

Some Heterogeneity (2x)

Extreme Heterogeneity (10x)

2015

© Andrew A. Chien, 2012

Exploding Diversity

•  Highly competitive markets, many without a dominant leader •  Smart Phone Market Highly fragmented – and diverse •  Laptop and Tablet Market fragmenting?

July 11, 2012

Smart Phones

Laptops and Tablets

2015 Marvell (RIM)

TI Omap

Apple Qualcomm

Mediatek

Intel

AMD

Apple

Nvidia

Qualcomm, TI, …


7/11/12

4

Accelerators in Perspective


Accelerators in HPC Systems

•  Waning Moore’s Law o  Energy-limited, Data Movement

Limited [BorkarChien, CACM May 2011]

•  Base vs Base+Acc vs. Ratiod o  Performance, Coupling, Capacity

•  Cost: compute chips, total energy, compute/cuft, price

•  DIFF: Delivery in Whole Compute Chips


CPU/APU

CPU/APU

CPU CPU

CPU GPU

PCI

GPU

PCI

GPU

PCI

7/11/12

5

How Accelerators Deliver Performance

•  Location: Path-oriented accelerators (flow and offload), NIC offload, PIM

•  Special Resources: high performance memory (e.g. GDDR, Convey )

•  Customization: specialized logic and dense packaging/coupling

•  Assumption: regular, replicated organization •  Scaled to thousands or millions

•  Challenges: Programming, Specialization, Integration


Programming •  Porting effort? (software

architecture, algorithms) •  Performance attainable? •  The Fast road? •  ...or the road to nowhere? •  ....How long is good

enough?


7/11/12

6

Programming

•  Critical: Avoiding Disaster!! July 11, 2012 © Andrew A. Chien, 2012

Specialization •  “Everyone uses only 10% of the functionality, the only

trouble is its a different 10% for everyone”

•  (image, character, graphics, floating point) Embedded, Smartphone, Laptop, Server processors

•  (parallel) Multithreaded applications? •  (floating point) DOE Scientific applications mini-apps

and PETSc – 20-30% of operation count

•  Architect: What to specialize and how to expose? •  Software Architect: What abstractions? (datatype,

representation, movement) What interfaces and partitions?

•  => see 10x10


7/11/12

7

Integration •  Future programming is about orchestrating data

movement, not operations. Data Movement dominates energy consumption.

– DARPA Exascale Software report (2009)

•  Parallel computing – horizontal, internode •  Exascale computing – vertical and horizontal, internode

and intranode (memory and accelerator hierarchy)

•  Not just about computing, but the relationship of computing to memory and to each other (and to the network)


Accelerator Integration

•  Shared Nothing, Asymmetric

•  Shared Memory, Symmetric

•  Shared Memory, Internal Customization, Symmetric


CPU/Acc

CPU/Acc

CPU CPU

CPU Acc

PCI

7/11/12

8

What’s a Programmer to do?

Towards Systematic Accelerator Evalution


OmniBench: Systematic Evaluation of Accelerators

•  Objective: Neutral evaluation of performance

•  Idea: Benchmark with codes Designed for an accelerator...

But not “exactly this” accelerator

•  Software complexity is the key driver •  “The community can tune for 1, but not for dozens”


7/11/12

9

Omnibench Experiment •  Challenging Kernel Programs

o  SGEMM, SpMV, BFS, FFT

•  Standard Interface – OpenCL 1.2 o  Simple model: CPU + ACC

•  Range of Heterogeneous Platforms o  Vary compute capabilities o  Vary special resources o  Vary integration and memory hierarchy approaches o  Range of cost/power levels


Heterogeneous CPU-‐‑GPU Systems

•  IvyBridge: Intel Core i5 3570K o  4 CPU cores (3.4 GHz) o  64 graphics cores (1.15 GHz) o  6 MB LLC, shared by CPU and integrated graphics o  dual-channel, DDR3 Memory, 25.6 GB/s o  77 W, 22 nm, 216 mm^2, 1.4 billion transistors

•  APU: AMD A-8 3850 o  4 CPU cores (2.9 GHz) o  400 graphics cores (0.6 GHz) o  4 MB LLC, dedicated to the CPU cores o  Dual-channel, DDR3 Memory, 29.9 GB/s o  100 W, 32 nm, 228 mm^2, 1.45 billion transistors

•  Tesla: NVIDIA Tesla C2075 o  448 cores = 14 multiprocessors * 32 (1.15 GHz) o  768 KB LLC, 64 KB shared memory/multiprocessor o  Private GDDR5 Memory, 144 GB/s o  225 W, 40 nm, 520 mm^2, 3 billion transistors o  CPU-GPU link: PCI-Express x16 Gen 2, 8 GB/s

CPU/ACC CPU

CPU GPU

PCI

CPU/acc CPU

7/11/12

10

One-‐‑sided Performance (SGEMM)


Simple Performance (SGEMM)


7/11/12

11

Self-‐‑normalized Accessible Performance (SGEMM)

July 11, 2012 © Andrew A. Chien, 2012 Self-‐‑normalized (integration)

•  Integration (data transfer and computation)

•  Fraction of “Peak” Performance

•  Fraction of Accessible Performance

Relative Accessible Performance (SGEMM)


•  Same terms •  Normalized to

performance of fastest accelerator

7/11/12

12

Self-‐‑normalized Accessible Performance (BFS)

July 11, 2012 © Andrew A. Chien, 2012 Highlighting Integration

Relative Accessible Performance (BFS)


7/11/12

13

Observations •  Data home location has significant impact on

accessible performance, should be captured in benchmarking

•  Organizational differences not highlighted by compute-intensive applications, but exposed clearly by memory-intensive

•  Data movement management is problematic in current integrated CPU-GPU systems (sw/hw)

•  Performance of discrete accelerators dominates on compute-intensive, but not on memory-intensive workloads (even w/o equal chip resources)


Related and Future Work •  Related Work

o  Accelerator Benchmarking: CUDA, OpenCL benchmarks (Rodinia, SHoC, ...)

o  Extensive Performance Modeling

•  Future Work o  Additional Platforms – configs, types, variations o  Improved software (always): Drivers, MemHierarcy, Compilers o  Higher level software interfaces: Beyond OpenCL? ,Open

ACC, ?? o  Larger systems: Larger-nodes (e.g. 2 hybrid vs CPU+GPU), Parallel

(multi-node) systems


7/11/12

14

10x10 Systematic Heterogeneity


July 11, 2012

Three Paths Forward

Heterogeneous (incl. Hybrid)

Small Core (100’s)

Big Core (10’s)

Dennard Scaling Energy-‐‑limited Scaling

Performan

ce

Time

[Borkar and Chien, “Technology Scaling creates New Landscape for Computer Architecture”, Communications of the ACM, May 2011]


7/11/12

15

Path #3: Customize, Scaleup

•  Customize: a collection of custom tools form a core o  Designed for a narrow domain, high performance and energy efficient o  Tool domains complement each other to cover general-purpose space

•  Separation maximizes energy efficiency o  Layout density, Isolation o  Exercise one/few tools at a time

•  Challenges: Programmability, Code Portability, Design Effort, Architecture, Si utilization

July 11, 2012

Customize


Examples: SoC & Integrated GPU

•  Apple’s A5

•  Nvidia’s Tegra 2 and 3

•  Intel Ivy Bridge

•  AMD Fusion (Ontario)

July 11, 2012

What’s WRONG with these chips?

Not very programmable...


7/11/12

16

10x10 Framework Enables Systematic Exploitation of Heterogeneity

Tight Clusters Loose Clusters No Clusters (general-‐‑purpose)

Micro-‐‑engine Workload Coverage

Micro-‐‑engine Energy Efficiency

Overall Workload Energy Efficiency


10x10 = Federated Heterogeneity

Traditional Core 10x10 Core

July 11, 2012

µengine #6

Basic RISC CPU

µengine #2

µengine #3

µengine #4

µengine #5 <tbd>

I-‐‑Cache

Shared L1 Data Cache

I-‐‑Cache I-‐‑Cache I-‐‑Cache I-‐‑Cache I-‐‑Cache

L1 Inst Cache

L1 Data Cache


7/11/12

17

Traditional Optimization: 90/10 Paradigm

•  Workloads: analyze and derive common cases(90%) •  Invent arch features, implementation optimizations with broad

impact (90%) •  Improve performance by adding optimizations •  Aggregation and Efficiency: 8080 80 insts => SB 500+ instructions

Workloads

“ILP” “reuse locality” “linear access” “bit-‐‑field opns” “branch panerns”

“pipelining” “superscalar” “caches & blocks” “mmedia” “branch pred”

Abstracted “common” cases Optimizations

Amdahl’s Law, H&P’s Comp Arch: A Quantitative Approach July 11, 2012 © Andrew A. Chien, 2012

10x10 Optimization Paradigm

•  Identify 10 application clusters; compute structures; datatypes (focus on 10 distinct bins)

•  Optimize architecture, Optimize implementation of each separately (improve energy-delay product by 10-100x)

•  Compose together sharing memory hierarchy and interconnect (preserve the benefits of customization)

7 idiom, 29 SPEC, 13 dwarves, 11NPB “Workload”

Factor into 10 Bins Compose Micro-‐‑engine

per Bin


7/11/12

18

The Big Picture: 10x10 •  Spectrum of energy efficiency vs. programmability •  Asics, soc, gpu, parallel cpu, cpu

•  Where are we going? Overlap, dominate? •  Answer is deeply a hardware and software question

o  Waning days of Moore’s Law, end of Moore’s Law, success of near-threshold and device scaling heroics

o  Software translation technology for cross-compilation, transformation and optimization, higher-level programming

July 11, 2012

EE [Ops/J]

@ fixed Process

Programmability/ Portability

Core

GPU

SoC / IP Accel

ASIC

+M-‐‑core

+ features

+GPU+M-‐‑core

Ideal Compute Chip

10x10


10x10 Workload Clustering •  Challenges

o  How to cluster? (try LOTS of things) o  How many for good coverage? o  How much benefit?

•  Broad Set of Workloads (34 total, varied) o  UHPC Challenge Problems (5) “Super”

•  Streaming sensor, chess, graph, md, shock hydro •  DARPA”Extreme Computing”

o  PARSEC (12) “PC” •  Data mining, vision, financial, genetic, physics, …

o  Embedded Benchmarks (10) “Mobile, IOT” •  Image, crypto, coding, signal processing

o  Biobench (7) “Data mining” •  Alignment, assembly, phylogeny, database search


7/11/12

19

What Characteristics Maner?

•  Where the time goes o  Focus on important sections– >90% coverage from each application

•  Architecturally Significant Features o  Cluster based on like requirements o  Supports sharing of customization

•  Two Feature Vectors o  Low-resolution: (Datatype x Size) o  High Resolution: (Datatype x Operation x Size)

July 11, 2012

Dynamic Profiles

Vector Clustering

Codes, Benchmarks

Loops, Opns

Memory

Clustered Regions


Low Res Clusters (8)

code regions

BR 8B INT 8B REG_XFER 8B INT 4B REG_XFER 4B REG_XFER 16BFLT 8B FLT 16B OTHER 4B FLT 4B OTHER <1B OTHER 2BINT 16B OTHER 8B INT 1B INT 2B REG_XFER 1B REG_XFER 2B


•  Width is “hot region” count; Ordered by Dynamic Weight •  Legend is Operation x Datatype •  #1 Integer, #2-‐‑5 FP single, double, vector •  Much simpler, cleaner clustering... •  8 clusters (100%)

7/11/12

20


code regions



•  #1=> #1,2,3,5,8 Integer split by size, #4, 6, 7,... FP •  Very similar clusters (tight)... •  8 clusters (70%), 16 clusters (85%), 32 clusters (100%)


code regions



•  Essentially homogenous clusters (very tight)... •  8 clusters (50%), 16 clusters (70%), 32 clusters (80%), 128 clusters (100%)

7/11/12

21

Clusters Insights •  Clusters draw from across the workloads – not in any

obvious “application domain” structure. •  Clusters reflect a wide variety of different

computational needs that correlate with architecture structure o  Call and branch intensive o  32 bit integer oriented o  Bit/byte oriented o  Mixed 32 and 64-bit oriented o  Single-precision floating point o  … and so on…

•  Clusters separate cleanly,(overpartition), ample opportunities for customization and energy efficiency.


Benefit Models

•  Specialization: fraction instructions unneeded •  Interpolation from Nehalem to DP Float energy


0

5

10

15

20

25

30

0 0.2 0.4 0.6 0.8 1

ener

gy e

ffici

ency

impr

ovem

ent (

x)

fraction of unimplemented opcodes

square-rootlinearquadraticcubic

7/11/12

22

Weighted Benefit vs. Benefit Model

0

5

10

15

20

25

30

aver

age

bene

fit (x

)

sq rootlinearquadraticcubic


Weighted Benefit (linear) vs. # Cores

0

2

4

6

8

10

12

1 co

res

4 co

res

7 co

res

10 c

ores

13 c

ores

16 c

ores

19 c

ores

22 c

ores

25 c

ores

28 c

ores

31 c

ores

34 c

ores

37 c

ores

40 c

ores

43 c

ores

46 c

ores

49 c

ores

52 c

ores

55 c

ores

58 c

ores

61 c

ores

64 c

ores

wei

ghte

d be

nefit

(x)

linear benefit model

hr8hr16hr32hr64hr128lr8lr16lr32lr64lr128


7/11/12

23

Related Work •  System on Chip (SoC, SoP, 3D, etc.) [CE products]

o  Rapid system integration, not architectural design. Less stable; discontinuous change, partitioned software.

o  6 months to α-silicon, 6 months to product in market

•  CPU+Reconfigurable hardware (FPGA’s, LUTS, adders, etc.) o  Convey HC-1, Sankaralingam11, Xilinx Zynq [EPP]

o  Advantages: flexibility o  Disadvantages: lose customized implementation, speed, energy efficiency.

•  Hybrid Computing (CPU-GPU, APU, GenX…) o  Advantages: Silicon today o  Disdvantages: programmability,1-way hetero, cost, energy efficiency.

•  Low-level Programmability and Heterogeneity o  QSCores/GreenDroid: Super instructions ; Khan11 [Morphing], Wu11 [VM-based, single ISA], o  Advantage: Don’t require much software support o  Disadvantages: local impact

•  Build “Chip Generators”, not Chips o  Horowitz; customization and closed systems

o  Custom for everything: programmabiliity?


Summary and Perspective •  Heterogeneity is endemic, and a basic source of

efficiency (local) •  We need integrated assessment – programmable,

usable, delivered performance (demand it) •  Unibench – uniform, systematic assessment of

accessible performance for a diverse accelerator future

•  10x10 – federated heterogeneous architecture based on systematic optimization of energy efficiency (major benefit)

•  Prepare wisely for a Heterogenous future!


7/11/12

24

More Information •  Papers

o  The Future of Microprocessors. Communications of the ACM 54(5): 67-77 (2011), [Borkar & Chien]

o  10x10: A General-purpose Architectural Approach to Heterogeneity and Energy Efficiency. Procedia CS 4: 1987-1996 (2011) [Chien, Snavely, Gahagan]

o  10x10: Taming Heterogeneity for General-purpose Architecture, in 2nd Workshop on New Directions in Computer Architecture, June 2011. Held at ISCA-38. [Chien]

o  Systematic Evaluation of Workload Clustering for Designing Heterogeneous, General-Purpose Architectures, UChicago Tech Report 2012, [A. Guha, A. Chien]

o  An Empirical Foundation for Heterogeneity: Clustering Applications by Computation and Memory Behavior, UChicago Tech Report 2011, [A. Guha, P. Cicotti, A. Snavely, and A. Chien]

•  Acknowledgements o  Apala Guha, Yao Zhang, Mark Sinclair o  Allan Snavely, Pietro Cicotti, Mark Gahagan o  Insightful feedback from Shekhar Borkar (Intel) and Bill Harrod (DARPA) o  Supported by the National Science Foundation under NSF Grant OCI-1057921

and DARPA MTO


Questions?


Outline - saahpc.ncsa.illinois.edusaahpc.ncsa.illinois.edu/presentations/chien.pdf · o Sandy...

Documents

Transcript of Outline - saahpc.ncsa.illinois.edusaahpc.ncsa.illinois.edu/presentations/chien.pdf · o Sandy...