Outline - saahpc.ncsa.illinois.edusaahpc.ncsa.illinois.edu/presentations/chien.pdf · o Sandy...

24
7/11/12 1 Can we Systematically Evaluate and Exploit Heterogeneous Accelerators? A 10x10 Perspective Andrew A. Chien Dept of Computer Science, University of Chicago MCS, Argonne National Laboratory SAAHPC Keynote July 11, 2012 Outline The Future is Heterogeneous Accelerators in Perspective Towards Systematic Accelerator Evalution 10x10: Systematic Heterogeneous Architecture Summary and Futures July 11, 2012 © Andrew A. Chien, 2012

Transcript of Outline - saahpc.ncsa.illinois.edusaahpc.ncsa.illinois.edu/presentations/chien.pdf · o Sandy...

7/11/12  

1  

Can  we  Systematically  Evaluate  and  Exploit  Heterogeneous  Accelerators?      

 A  10x10  Perspective

Andrew  A.  Chien Dept  of  Computer  Science,  University  of  Chicago

MCS,  Argonne  National  Laboratory

SAAHPC  Keynote July  11,  2012

Outline •  The Future is Heterogeneous •  Accelerators in Perspective •  Towards Systematic Accelerator Evalution •  10x10: Systematic Heterogeneous Architecture •  Summary and Futures

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

2  

The  Future  is  Heterogeneous

July 11, 2012 © Andrew A. Chien, 2012

Heterogeneous  Supercomputers •  Tianhe-1 (NUDT, Nov

2010) o  5PF o  14,336 Xeons + 7168 Teslas

•  Titan (ORNL, fall 2012) o  19K AMD CPU’s + 960 GPU’s o  Grow to 20PF in fall? o  ~ 20PF/2TF => 10K Nvidia GPU’s

(Kepler?)

•  Blue Waters (NCSA, late 2012) o  11.5PF, 1.5PB o  49K AMD CPUS (380K cores) o  3K Nvidia GPU’s (Kepler)

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

3  

Heterogeneity  Dominates

•  Heterogeneity is growing dramatically – on single chips, in systems, and in high volume deployment o  Sandy Bridge/Fusion/Denver, Tegra 2/Omap/A5

•  Heterogeneous in architecture and implementation is the dominant computing platform of future

July 11, 2012

Smart  Phones

Laptops  and  Tablets

Desktops

Servers Homogeneous

Some  Heterogeneity  (2x)

Extreme  Heterogeneity  (10x)

2015

© Andrew A. Chien, 2012

Exploding  Diversity

•  Highly competitive markets, many without a dominant leader •  Smart Phone Market Highly fragmented – and diverse •  Laptop and Tablet Market fragmenting?

July 11, 2012

Smart  Phones

Laptops  and  Tablets

2015 Marvell  (RIM)

TI  Omap

Apple Qualcomm

Mediatek

Intel

AMD

Apple

Nvidia

Qualcomm,  TI,  …

© Andrew A. Chien, 2012

7/11/12  

4  

Accelerators  in  Perspective

July 11, 2012 © Andrew A. Chien, 2012

Accelerators  in  HPC  Systems

•  Waning Moore’s Law o  Energy-limited, Data Movement

Limited [BorkarChien, CACM May 2011]

•  Base vs Base+Acc vs. Ratiod o  Performance, Coupling, Capacity

•  Cost: compute chips, total energy, compute/cuft, price

•  DIFF: Delivery in Whole Compute Chips

July 11, 2012 © Andrew A. Chien, 2012

CPU/APU

CPU/APU

CPU CPU

CPU GPU

PCI

GPU

PCI

GPU

PCI

7/11/12  

5  

How  Accelerators  Deliver  Performance  

•  Location: Path-oriented accelerators (flow and offload), NIC offload, PIM

•  Special Resources: high performance memory (e.g. GDDR, Convey )

•  Customization: specialized logic and dense packaging/coupling

•  Assumption: regular, replicated organization •  Scaled to thousands or millions

•  Challenges: Programming, Specialization, Integration

July 11, 2012 © Andrew A. Chien, 2012

Programming •  Porting effort? (software

architecture, algorithms) •  Performance attainable? •  The Fast road? •  ...or the road to nowhere? •  ....How long is good

enough?

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

6  

Programming

•  Critical: Avoiding Disaster!! July 11, 2012 © Andrew A. Chien, 2012

Specialization •  “Everyone uses only 10% of the functionality, the only

trouble is its a different 10% for everyone”

•  (image, character, graphics, floating point) Embedded, Smartphone, Laptop, Server processors

•  (parallel) Multithreaded applications? •  (floating point) DOE Scientific applications mini-apps

and PETSc – 20-30% of operation count

•  Architect: What to specialize and how to expose? •  Software Architect: What abstractions? (datatype,

representation, movement) What interfaces and partitions?

•  => see 10x10

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

7  

Integration •  Future programming is about orchestrating data

movement, not operations. Data Movement dominates energy consumption.

– DARPA Exascale Software report (2009)

•  Parallel computing – horizontal, internode •  Exascale computing – vertical and horizontal, internode

and intranode (memory and accelerator hierarchy)

•  Not just about computing, but the relationship of computing to memory and to each other (and to the network)

July 11, 2012 © Andrew A. Chien, 2012

Accelerator  Integration

•  Shared Nothing, Asymmetric

•  Shared Memory, Symmetric

•  Shared Memory, Internal Customization, Symmetric

July 11, 2012 © Andrew A. Chien, 2012

CPU/Acc

CPU/Acc

CPU CPU

CPU Acc

PCI

7/11/12  

8  

What’s  a  Programmer  to  do?

Towards Systematic Accelerator Evalution

July 11, 2012 © Andrew A. Chien, 2012

OmniBench:  Systematic  Evaluation  of  Accelerators

•  Objective: Neutral evaluation of performance

•  Idea: Benchmark with codes Designed for an accelerator...

But not “exactly this” accelerator

•  Software complexity is the key driver •  “The community can tune for 1, but not for dozens”

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

9  

Omnibench  Experiment •  Challenging Kernel Programs

o  SGEMM, SpMV, BFS, FFT

•  Standard Interface – OpenCL 1.2 o  Simple model: CPU + ACC

•  Range of Heterogeneous Platforms o  Vary compute capabilities o  Vary special resources o  Vary integration and memory hierarchy approaches o  Range of cost/power levels

July 11, 2012 © Andrew A. Chien, 2012

Heterogeneous  CPU-­‐‑GPU  Systems

•  IvyBridge: Intel Core i5 3570K o  4 CPU cores (3.4 GHz) o  64 graphics cores (1.15 GHz) o  6 MB LLC, shared by CPU and integrated graphics o  dual-channel, DDR3 Memory, 25.6 GB/s o  77 W, 22 nm, 216 mm^2, 1.4 billion transistors

•  APU: AMD A-8 3850 o  4 CPU cores (2.9 GHz) o  400 graphics cores (0.6 GHz) o  4 MB LLC, dedicated to the CPU cores o  Dual-channel, DDR3 Memory, 29.9 GB/s o  100 W, 32 nm, 228 mm^2, 1.45 billion transistors

•  Tesla: NVIDIA Tesla C2075 o  448 cores = 14 multiprocessors * 32 (1.15 GHz) o  768 KB LLC, 64 KB shared memory/multiprocessor o  Private GDDR5 Memory, 144 GB/s o  225 W, 40 nm, 520 mm^2, 3 billion transistors o  CPU-GPU link: PCI-Express x16 Gen 2, 8 GB/s

CPU/ACC CPU

CPU GPU

PCI

CPU/acc CPU

7/11/12  

10  

One-­‐‑sided  Performance  (SGEMM)

July 11, 2012 © Andrew A. Chien, 2012

Simple  Performance  (SGEMM)

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

11  

Self-­‐‑normalized  Accessible  Performance  (SGEMM)

July 11, 2012 © Andrew A. Chien, 2012 Self-­‐‑normalized  (integration)

•  Integration (data transfer and computation)

•  Fraction of “Peak” Performance

•  Fraction of Accessible Performance

Relative  Accessible  Performance  (SGEMM)

July 11, 2012 © Andrew A. Chien, 2012

•  Same terms •  Normalized to

performance of fastest accelerator

7/11/12  

12  

Self-­‐‑normalized  Accessible  Performance  (BFS)

July 11, 2012 © Andrew A. Chien, 2012 Highlighting  Integration

Relative  Accessible  Performance  (BFS)

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

13  

Observations •  Data home location has significant impact on

accessible performance, should be captured in benchmarking

•  Organizational differences not highlighted by compute-intensive applications, but exposed clearly by memory-intensive

•  Data movement management is problematic in current integrated CPU-GPU systems (sw/hw)

•  Performance of discrete accelerators dominates on compute-intensive, but not on memory-intensive workloads (even w/o equal chip resources)

July 11, 2012 © Andrew A. Chien, 2012

Related  and  Future  Work •  Related Work

o  Accelerator Benchmarking: CUDA, OpenCL benchmarks (Rodinia, SHoC, ...)

o  Extensive Performance Modeling

•  Future Work o  Additional Platforms – configs, types, variations o  Improved software (always): Drivers, MemHierarcy, Compilers o  Higher level software interfaces: Beyond OpenCL? ,Open

ACC, ?? o  Larger systems: Larger-nodes (e.g. 2 hybrid vs CPU+GPU), Parallel

(multi-node) systems

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

14  

10x10  Systematic  Heterogeneity

July 11, 2012 © Andrew A. Chien, 2012

July 11, 2012

Three  Paths  Forward

Heterogeneous (incl.  Hybrid)

Small  Core  (100’s)

Big  Core  (10’s)

Dennard  Scaling Energy-­‐‑limited  Scaling

Performan

ce

Time

[Borkar  and  Chien,  “Technology  Scaling  creates  New  Landscape  for  Computer   Architecture”,  Communications  of  the  ACM,  May  2011]

© Andrew A. Chien, 2012

7/11/12  

15  

Path  #3:  Customize,  Scaleup

•  Customize: a collection of custom tools form a core o  Designed for a narrow domain, high performance and energy efficient o  Tool domains complement each other to cover general-purpose space

•  Separation maximizes energy efficiency o  Layout density, Isolation o  Exercise one/few tools at a time

•  Challenges: Programmability, Code Portability, Design Effort, Architecture, Si utilization

July 11, 2012

Customize

© Andrew A. Chien, 2012

Examples:  SoC  &  Integrated  GPU

•  Apple’s A5

•  Nvidia’s Tegra 2 and 3

•  Intel Ivy Bridge

•  AMD Fusion (Ontario)

July 11, 2012

What’s WRONG with these chips?

Not very programmable...

© Andrew A. Chien, 2012

7/11/12  

16  

10x10  Framework  Enables  Systematic  Exploitation  of  Heterogeneity

Tight  Clusters Loose  Clusters No  Clusters (general-­‐‑purpose)

Micro-­‐‑engine Workload  Coverage  

Micro-­‐‑engine  Energy Efficiency  

Overall  Workload   Energy  Efficiency

July 11, 2012 © Andrew A. Chien, 2012

10x10  =  Federated  Heterogeneity

Traditional Core 10x10 Core

July 11, 2012

µengine  #6

Basic RISC  CPU

µengine  #2

µengine  #3

µengine  #4

µengine  #5 <tbd>

I-­‐‑Cache

Shared  L1  Data  Cache

I-­‐‑Cache I-­‐‑Cache I-­‐‑Cache I-­‐‑Cache I-­‐‑Cache

L1  Inst  Cache

L1  Data  Cache

© Andrew A. Chien, 2012

7/11/12  

17  

Traditional  Optimization:  90/10  Paradigm

•  Workloads: analyze and derive common cases(90%) •  Invent arch features, implementation optimizations with broad

impact (90%) •  Improve performance by adding optimizations •  Aggregation and Efficiency: 8080 80 insts => SB 500+ instructions

Workloads

“ILP” “reuse  locality” “linear  access” “bit-­‐‑field  opns” “branch  panerns”

“pipelining” “superscalar” “caches  &  blocks” “mmedia” “branch  pred”

Abstracted   “common”  cases Optimizations

Amdahl’s  Law,  H&P’s  Comp  Arch:  A  Quantitative  Approach July 11, 2012 © Andrew A. Chien, 2012

10x10  Optimization  Paradigm

•  Identify 10 application clusters; compute structures; datatypes (focus on 10 distinct bins)

•  Optimize architecture, Optimize implementation of each separately (improve energy-delay product by 10-100x)

•  Compose together sharing memory hierarchy and interconnect (preserve the benefits of customization)

7  idiom,  29  SPEC, 13  dwarves,  11NPB   “Workload”

Factor  into 10  Bins Compose Micro-­‐‑engine

per  Bin

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

18  

The  Big  Picture:  10x10 •  Spectrum of energy efficiency vs. programmability •  Asics, soc, gpu, parallel cpu, cpu

•  Where are we going? Overlap, dominate? •  Answer is deeply a hardware and software question

o  Waning days of Moore’s Law, end of Moore’s Law, success of near-threshold and device scaling heroics

o  Software translation technology for cross-compilation, transformation and optimization, higher-level programming

July 11, 2012

EE  [Ops/J]  

@  fixed  Process

Programmability/ Portability

Core

GPU

SoC /  IP  Accel

ASIC

+M-­‐‑core

+  features

+GPU+M-­‐‑core

Ideal Compute Chip

10x10

© Andrew A. Chien, 2012

10x10  Workload  Clustering   •  Challenges

o  How to cluster? (try LOTS of things) o  How many for good coverage? o  How much benefit?

•  Broad Set of Workloads (34 total, varied) o  UHPC Challenge Problems (5) “Super”

•  Streaming sensor, chess, graph, md, shock hydro •  DARPA”Extreme Computing”

o  PARSEC (12) “PC” •  Data mining, vision, financial, genetic, physics, …

o  Embedded Benchmarks (10) “Mobile, IOT” •  Image, crypto, coding, signal processing

o  Biobench (7) “Data mining” •  Alignment, assembly, phylogeny, database search

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

19  

What  Characteristics  Maner?

•  Where the time goes o  Focus on important sections– >90% coverage from each application

•  Architecturally Significant Features o  Cluster based on like requirements o  Supports sharing of customization

•  Two Feature Vectors o  Low-resolution: (Datatype x Size) o  High Resolution: (Datatype x Operation x Size)

July 11, 2012

Dynamic Profiles

Vector Clustering

Codes, Benchmarks

Loops, Opns

Memory

Clustered Regions

© Andrew A. Chien, 2012

Low  Res  Clusters  (8)

code regions

BR 8B INT 8B REG_XFER 8B INT 4B REG_XFER 4B REG_XFER 16BFLT 8B FLT 16B OTHER 4B FLT 4B OTHER <1B OTHER 2BINT 16B OTHER 8B INT 1B INT 2B REG_XFER 1B REG_XFER 2B

July 11, 2012 © Andrew A. Chien, 2012

•  Width  is  “hot  region”  count;  Ordered  by  Dynamic  Weight •  Legend  is  Operation  x  Datatype •  #1  Integer,  #2-­‐‑5  FP  single,  double,  vector •  Much  simpler,  cleaner  clustering... •  8  clusters  (100%)

7/11/12  

20  

Low  Res  Clusters  (32)

code regions

BR 8B INT 8B REG_XFER 8B INT 4B REG_XFER 4B REG_XFER 16BFLT 8B FLT 16B OTHER 4B FLT 4B OTHER <1B OTHER 2BINT 16B OTHER 8B INT 1B INT 2B REG_XFER 1B REG_XFER 2B

July 11, 2012 © Andrew A. Chien, 2012

•  #1=>  #1,2,3,5,8  Integer  split  by  size,  #4,  6,  7,...  FP •  Very  similar  clusters  (tight)... •  8  clusters  (70%),  16  clusters  (85%),  32  clusters  (100%)

Low  Res  Clusters  (128)

code regions

BR 8B INT 8B REG_XFER 8B INT 4B REG_XFER 4B REG_XFER 16BFLT 8B FLT 16B OTHER 4B FLT 4B OTHER <1B OTHER 2BINT 16B OTHER 8B INT 1B INT 2B REG_XFER 1B REG_XFER 2B

July 11, 2012 © Andrew A. Chien, 2012

•  Essentially  homogenous  clusters  (very  tight)... •  8  clusters  (50%),  16  clusters  (70%),  32  clusters  (80%),  128  clusters  (100%)

7/11/12  

21  

Clusters  Insights •  Clusters draw from across the workloads – not in any

obvious “application domain” structure. •  Clusters reflect a wide variety of different

computational needs that correlate with architecture structure o  Call and branch intensive o  32 bit integer oriented o  Bit/byte oriented o  Mixed 32 and 64-bit oriented o  Single-precision floating point o  … and so on…

•  Clusters separate cleanly,(overpartition), ample opportunities for customization and energy efficiency.

July 11, 2012 © Andrew A. Chien, 2012

Benefit  Models

•  Specialization: fraction instructions unneeded •  Interpolation from Nehalem to DP Float energy

July 11, 2012 © Andrew A. Chien, 2012

0

5

10

15

20

25

30

0 0.2 0.4 0.6 0.8 1

ener

gy e

ffici

ency

impr

ovem

ent (

x)

fraction of unimplemented opcodes

square-rootlinearquadraticcubic

7/11/12  

22  

Weighted  Benefit    vs.    Benefit  Model

0

5

10

15

20

25

30

aver

age

bene

fit (x

)

sq rootlinearquadraticcubic

July 11, 2012 © Andrew A. Chien, 2012

Weighted  Benefit  (linear)    vs.      #  Cores

0

2

4

6

8

10

12

1 co

res

4 co

res

7 co

res

10 c

ores

13 c

ores

16 c

ores

19 c

ores

22 c

ores

25 c

ores

28 c

ores

31 c

ores

34 c

ores

37 c

ores

40 c

ores

43 c

ores

46 c

ores

49 c

ores

52 c

ores

55 c

ores

58 c

ores

61 c

ores

64 c

ores

wei

ghte

d be

nefit

(x)

linear benefit model

hr8hr16hr32hr64hr128lr8lr16lr32lr64lr128

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

23  

Related  Work •  System on Chip (SoC, SoP, 3D, etc.) [CE products]

o  Rapid system integration, not architectural design. Less stable; discontinuous change, partitioned software.

o  6 months to α-silicon, 6 months to product in market

•  CPU+Reconfigurable hardware (FPGA’s, LUTS, adders, etc.) o  Convey HC-1, Sankaralingam11, Xilinx Zynq [EPP]

o  Advantages: flexibility o  Disadvantages: lose customized implementation, speed, energy efficiency.

•  Hybrid Computing (CPU-GPU, APU, GenX…) o  Advantages: Silicon today o  Disdvantages: programmability,1-way hetero, cost, energy efficiency.

•  Low-level Programmability and Heterogeneity o  QSCores/GreenDroid: Super instructions ; Khan11 [Morphing], Wu11 [VM-based, single ISA], o  Advantage: Don’t require much software support o  Disadvantages: local impact

•  Build “Chip Generators”, not Chips o  Horowitz; customization and closed systems

o  Custom for everything: programmabiliity?

July 11, 2012 © Andrew A. Chien, 2012

Summary  and  Perspective •  Heterogeneity is endemic, and a basic source of

efficiency (local) •  We need integrated assessment – programmable,

usable, delivered performance (demand it) •  Unibench – uniform, systematic assessment of

accessible performance for a diverse accelerator future

•  10x10 – federated heterogeneous architecture based on systematic optimization of energy efficiency (major benefit)

•  Prepare wisely for a Heterogenous future!

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

24  

More  Information •  Papers

o  The Future of Microprocessors. Communications of the ACM 54(5): 67-77 (2011), [Borkar & Chien]

o  10x10: A General-purpose Architectural Approach to Heterogeneity and Energy Efficiency. Procedia CS 4: 1987-1996 (2011) [Chien, Snavely, Gahagan]

o  10x10: Taming Heterogeneity for General-purpose Architecture, in 2nd Workshop on New Directions in Computer Architecture, June 2011. Held at ISCA-38. [Chien]

o  Systematic Evaluation of Workload Clustering for Designing Heterogeneous, General-Purpose Architectures, UChicago Tech Report 2012, [A. Guha, A. Chien]

o  An Empirical Foundation for Heterogeneity: Clustering Applications by Computation and Memory Behavior, UChicago Tech Report 2011, [A. Guha, P. Cicotti, A. Snavely, and A. Chien]

•  Acknowledgements o  Apala Guha, Yao Zhang, Mark Sinclair o  Allan Snavely, Pietro Cicotti, Mark Gahagan o  Insightful feedback from Shekhar Borkar (Intel) and Bill Harrod (DARPA) o  Supported by the National Science Foundation under NSF Grant OCI-1057921

and DARPA MTO

July 11, 2012 © Andrew A. Chien, 2012

Questions?

July 11, 2012 © Andrew A. Chien, 2012