Outline - saahpc.ncsa.illinois.edusaahpc.ncsa.illinois.edu/presentations/chien.pdf · o Sandy...

Post on 04-Jun-2018

217 views 0 download

Transcript of Outline - saahpc.ncsa.illinois.edusaahpc.ncsa.illinois.edu/presentations/chien.pdf · o Sandy...

7/11/12  

1  

Can  we  Systematically  Evaluate  and  Exploit  Heterogeneous  Accelerators?      

 A  10x10  Perspective

Andrew  A.  Chien Dept  of  Computer  Science,  University  of  Chicago

MCS,  Argonne  National  Laboratory

SAAHPC  Keynote July  11,  2012

Outline •  The Future is Heterogeneous •  Accelerators in Perspective •  Towards Systematic Accelerator Evalution •  10x10: Systematic Heterogeneous Architecture •  Summary and Futures

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

2  

The  Future  is  Heterogeneous

July 11, 2012 © Andrew A. Chien, 2012

Heterogeneous  Supercomputers •  Tianhe-1 (NUDT, Nov

2010) o  5PF o  14,336 Xeons + 7168 Teslas

•  Titan (ORNL, fall 2012) o  19K AMD CPU’s + 960 GPU’s o  Grow to 20PF in fall? o  ~ 20PF/2TF => 10K Nvidia GPU’s

(Kepler?)

•  Blue Waters (NCSA, late 2012) o  11.5PF, 1.5PB o  49K AMD CPUS (380K cores) o  3K Nvidia GPU’s (Kepler)

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

3  

Heterogeneity  Dominates

•  Heterogeneity is growing dramatically – on single chips, in systems, and in high volume deployment o  Sandy Bridge/Fusion/Denver, Tegra 2/Omap/A5

•  Heterogeneous in architecture and implementation is the dominant computing platform of future

July 11, 2012

Smart  Phones

Laptops  and  Tablets

Desktops

Servers Homogeneous

Some  Heterogeneity  (2x)

Extreme  Heterogeneity  (10x)

2015

© Andrew A. Chien, 2012

Exploding  Diversity

•  Highly competitive markets, many without a dominant leader •  Smart Phone Market Highly fragmented – and diverse •  Laptop and Tablet Market fragmenting?

July 11, 2012

Smart  Phones

Laptops  and  Tablets

2015 Marvell  (RIM)

TI  Omap

Apple Qualcomm

Mediatek

Intel

AMD

Apple

Nvidia

Qualcomm,  TI,  …

© Andrew A. Chien, 2012

7/11/12  

4  

Accelerators  in  Perspective

July 11, 2012 © Andrew A. Chien, 2012

Accelerators  in  HPC  Systems

•  Waning Moore’s Law o  Energy-limited, Data Movement

Limited [BorkarChien, CACM May 2011]

•  Base vs Base+Acc vs. Ratiod o  Performance, Coupling, Capacity

•  Cost: compute chips, total energy, compute/cuft, price

•  DIFF: Delivery in Whole Compute Chips

July 11, 2012 © Andrew A. Chien, 2012

CPU/APU

CPU/APU

CPU CPU

CPU GPU

PCI

GPU

PCI

GPU

PCI

7/11/12  

5  

How  Accelerators  Deliver  Performance  

•  Location: Path-oriented accelerators (flow and offload), NIC offload, PIM

•  Special Resources: high performance memory (e.g. GDDR, Convey )

•  Customization: specialized logic and dense packaging/coupling

•  Assumption: regular, replicated organization •  Scaled to thousands or millions

•  Challenges: Programming, Specialization, Integration

July 11, 2012 © Andrew A. Chien, 2012

Programming •  Porting effort? (software

architecture, algorithms) •  Performance attainable? •  The Fast road? •  ...or the road to nowhere? •  ....How long is good

enough?

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

6  

Programming

•  Critical: Avoiding Disaster!! July 11, 2012 © Andrew A. Chien, 2012

Specialization •  “Everyone uses only 10% of the functionality, the only

trouble is its a different 10% for everyone”

•  (image, character, graphics, floating point) Embedded, Smartphone, Laptop, Server processors

•  (parallel) Multithreaded applications? •  (floating point) DOE Scientific applications mini-apps

and PETSc – 20-30% of operation count

•  Architect: What to specialize and how to expose? •  Software Architect: What abstractions? (datatype,

representation, movement) What interfaces and partitions?

•  => see 10x10

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

7  

Integration •  Future programming is about orchestrating data

movement, not operations. Data Movement dominates energy consumption.

– DARPA Exascale Software report (2009)

•  Parallel computing – horizontal, internode •  Exascale computing – vertical and horizontal, internode

and intranode (memory and accelerator hierarchy)

•  Not just about computing, but the relationship of computing to memory and to each other (and to the network)

July 11, 2012 © Andrew A. Chien, 2012

Accelerator  Integration

•  Shared Nothing, Asymmetric

•  Shared Memory, Symmetric

•  Shared Memory, Internal Customization, Symmetric

July 11, 2012 © Andrew A. Chien, 2012

CPU/Acc

CPU/Acc

CPU CPU

CPU Acc

PCI

7/11/12  

8  

What’s  a  Programmer  to  do?

Towards Systematic Accelerator Evalution

July 11, 2012 © Andrew A. Chien, 2012

OmniBench:  Systematic  Evaluation  of  Accelerators

•  Objective: Neutral evaluation of performance

•  Idea: Benchmark with codes Designed for an accelerator...

But not “exactly this” accelerator

•  Software complexity is the key driver •  “The community can tune for 1, but not for dozens”

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

9  

Omnibench  Experiment •  Challenging Kernel Programs

o  SGEMM, SpMV, BFS, FFT

•  Standard Interface – OpenCL 1.2 o  Simple model: CPU + ACC

•  Range of Heterogeneous Platforms o  Vary compute capabilities o  Vary special resources o  Vary integration and memory hierarchy approaches o  Range of cost/power levels

July 11, 2012 © Andrew A. Chien, 2012

Heterogeneous  CPU-­‐‑GPU  Systems

•  IvyBridge: Intel Core i5 3570K o  4 CPU cores (3.4 GHz) o  64 graphics cores (1.15 GHz) o  6 MB LLC, shared by CPU and integrated graphics o  dual-channel, DDR3 Memory, 25.6 GB/s o  77 W, 22 nm, 216 mm^2, 1.4 billion transistors

•  APU: AMD A-8 3850 o  4 CPU cores (2.9 GHz) o  400 graphics cores (0.6 GHz) o  4 MB LLC, dedicated to the CPU cores o  Dual-channel, DDR3 Memory, 29.9 GB/s o  100 W, 32 nm, 228 mm^2, 1.45 billion transistors

•  Tesla: NVIDIA Tesla C2075 o  448 cores = 14 multiprocessors * 32 (1.15 GHz) o  768 KB LLC, 64 KB shared memory/multiprocessor o  Private GDDR5 Memory, 144 GB/s o  225 W, 40 nm, 520 mm^2, 3 billion transistors o  CPU-GPU link: PCI-Express x16 Gen 2, 8 GB/s

CPU/ACC CPU

CPU GPU

PCI

CPU/acc CPU

7/11/12  

10  

One-­‐‑sided  Performance  (SGEMM)

July 11, 2012 © Andrew A. Chien, 2012

Simple  Performance  (SGEMM)

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

11  

Self-­‐‑normalized  Accessible  Performance  (SGEMM)

July 11, 2012 © Andrew A. Chien, 2012 Self-­‐‑normalized  (integration)

•  Integration (data transfer and computation)

•  Fraction of “Peak” Performance

•  Fraction of Accessible Performance

Relative  Accessible  Performance  (SGEMM)

July 11, 2012 © Andrew A. Chien, 2012

•  Same terms •  Normalized to

performance of fastest accelerator

7/11/12  

12  

Self-­‐‑normalized  Accessible  Performance  (BFS)

July 11, 2012 © Andrew A. Chien, 2012 Highlighting  Integration

Relative  Accessible  Performance  (BFS)

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

13  

Observations •  Data home location has significant impact on

accessible performance, should be captured in benchmarking

•  Organizational differences not highlighted by compute-intensive applications, but exposed clearly by memory-intensive

•  Data movement management is problematic in current integrated CPU-GPU systems (sw/hw)

•  Performance of discrete accelerators dominates on compute-intensive, but not on memory-intensive workloads (even w/o equal chip resources)

July 11, 2012 © Andrew A. Chien, 2012

Related  and  Future  Work •  Related Work

o  Accelerator Benchmarking: CUDA, OpenCL benchmarks (Rodinia, SHoC, ...)

o  Extensive Performance Modeling

•  Future Work o  Additional Platforms – configs, types, variations o  Improved software (always): Drivers, MemHierarcy, Compilers o  Higher level software interfaces: Beyond OpenCL? ,Open

ACC, ?? o  Larger systems: Larger-nodes (e.g. 2 hybrid vs CPU+GPU), Parallel

(multi-node) systems

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

14  

10x10  Systematic  Heterogeneity

July 11, 2012 © Andrew A. Chien, 2012

July 11, 2012

Three  Paths  Forward

Heterogeneous (incl.  Hybrid)

Small  Core  (100’s)

Big  Core  (10’s)

Dennard  Scaling Energy-­‐‑limited  Scaling

Performan

ce

Time

[Borkar  and  Chien,  “Technology  Scaling  creates  New  Landscape  for  Computer   Architecture”,  Communications  of  the  ACM,  May  2011]

© Andrew A. Chien, 2012

7/11/12  

15  

Path  #3:  Customize,  Scaleup

•  Customize: a collection of custom tools form a core o  Designed for a narrow domain, high performance and energy efficient o  Tool domains complement each other to cover general-purpose space

•  Separation maximizes energy efficiency o  Layout density, Isolation o  Exercise one/few tools at a time

•  Challenges: Programmability, Code Portability, Design Effort, Architecture, Si utilization

July 11, 2012

Customize

© Andrew A. Chien, 2012

Examples:  SoC  &  Integrated  GPU

•  Apple’s A5

•  Nvidia’s Tegra 2 and 3

•  Intel Ivy Bridge

•  AMD Fusion (Ontario)

July 11, 2012

What’s WRONG with these chips?

Not very programmable...

© Andrew A. Chien, 2012

7/11/12  

16  

10x10  Framework  Enables  Systematic  Exploitation  of  Heterogeneity

Tight  Clusters Loose  Clusters No  Clusters (general-­‐‑purpose)

Micro-­‐‑engine Workload  Coverage  

Micro-­‐‑engine  Energy Efficiency  

Overall  Workload   Energy  Efficiency

July 11, 2012 © Andrew A. Chien, 2012

10x10  =  Federated  Heterogeneity

Traditional Core 10x10 Core

July 11, 2012

µengine  #6

Basic RISC  CPU

µengine  #2

µengine  #3

µengine  #4

µengine  #5 <tbd>

I-­‐‑Cache

Shared  L1  Data  Cache

I-­‐‑Cache I-­‐‑Cache I-­‐‑Cache I-­‐‑Cache I-­‐‑Cache

L1  Inst  Cache

L1  Data  Cache

© Andrew A. Chien, 2012

7/11/12  

17  

Traditional  Optimization:  90/10  Paradigm

•  Workloads: analyze and derive common cases(90%) •  Invent arch features, implementation optimizations with broad

impact (90%) •  Improve performance by adding optimizations •  Aggregation and Efficiency: 8080 80 insts => SB 500+ instructions

Workloads

“ILP” “reuse  locality” “linear  access” “bit-­‐‑field  opns” “branch  panerns”

“pipelining” “superscalar” “caches  &  blocks” “mmedia” “branch  pred”

Abstracted   “common”  cases Optimizations

Amdahl’s  Law,  H&P’s  Comp  Arch:  A  Quantitative  Approach July 11, 2012 © Andrew A. Chien, 2012

10x10  Optimization  Paradigm

•  Identify 10 application clusters; compute structures; datatypes (focus on 10 distinct bins)

•  Optimize architecture, Optimize implementation of each separately (improve energy-delay product by 10-100x)

•  Compose together sharing memory hierarchy and interconnect (preserve the benefits of customization)

7  idiom,  29  SPEC, 13  dwarves,  11NPB   “Workload”

Factor  into 10  Bins Compose Micro-­‐‑engine

per  Bin

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

18  

The  Big  Picture:  10x10 •  Spectrum of energy efficiency vs. programmability •  Asics, soc, gpu, parallel cpu, cpu

•  Where are we going? Overlap, dominate? •  Answer is deeply a hardware and software question

o  Waning days of Moore’s Law, end of Moore’s Law, success of near-threshold and device scaling heroics

o  Software translation technology for cross-compilation, transformation and optimization, higher-level programming

July 11, 2012

EE  [Ops/J]  

@  fixed  Process

Programmability/ Portability

Core

GPU

SoC /  IP  Accel

ASIC

+M-­‐‑core

+  features

+GPU+M-­‐‑core

Ideal Compute Chip

10x10

© Andrew A. Chien, 2012

10x10  Workload  Clustering   •  Challenges

o  How to cluster? (try LOTS of things) o  How many for good coverage? o  How much benefit?

•  Broad Set of Workloads (34 total, varied) o  UHPC Challenge Problems (5) “Super”

•  Streaming sensor, chess, graph, md, shock hydro •  DARPA”Extreme Computing”

o  PARSEC (12) “PC” •  Data mining, vision, financial, genetic, physics, …

o  Embedded Benchmarks (10) “Mobile, IOT” •  Image, crypto, coding, signal processing

o  Biobench (7) “Data mining” •  Alignment, assembly, phylogeny, database search

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

19  

What  Characteristics  Maner?

•  Where the time goes o  Focus on important sections– >90% coverage from each application

•  Architecturally Significant Features o  Cluster based on like requirements o  Supports sharing of customization

•  Two Feature Vectors o  Low-resolution: (Datatype x Size) o  High Resolution: (Datatype x Operation x Size)

July 11, 2012

Dynamic Profiles

Vector Clustering

Codes, Benchmarks

Loops, Opns

Memory

Clustered Regions

© Andrew A. Chien, 2012

Low  Res  Clusters  (8)

code regions

BR 8B INT 8B REG_XFER 8B INT 4B REG_XFER 4B REG_XFER 16BFLT 8B FLT 16B OTHER 4B FLT 4B OTHER <1B OTHER 2BINT 16B OTHER 8B INT 1B INT 2B REG_XFER 1B REG_XFER 2B

July 11, 2012 © Andrew A. Chien, 2012

•  Width  is  “hot  region”  count;  Ordered  by  Dynamic  Weight •  Legend  is  Operation  x  Datatype •  #1  Integer,  #2-­‐‑5  FP  single,  double,  vector •  Much  simpler,  cleaner  clustering... •  8  clusters  (100%)

7/11/12  

20  

Low  Res  Clusters  (32)

code regions

BR 8B INT 8B REG_XFER 8B INT 4B REG_XFER 4B REG_XFER 16BFLT 8B FLT 16B OTHER 4B FLT 4B OTHER <1B OTHER 2BINT 16B OTHER 8B INT 1B INT 2B REG_XFER 1B REG_XFER 2B

July 11, 2012 © Andrew A. Chien, 2012

•  #1=>  #1,2,3,5,8  Integer  split  by  size,  #4,  6,  7,...  FP •  Very  similar  clusters  (tight)... •  8  clusters  (70%),  16  clusters  (85%),  32  clusters  (100%)

Low  Res  Clusters  (128)

code regions

BR 8B INT 8B REG_XFER 8B INT 4B REG_XFER 4B REG_XFER 16BFLT 8B FLT 16B OTHER 4B FLT 4B OTHER <1B OTHER 2BINT 16B OTHER 8B INT 1B INT 2B REG_XFER 1B REG_XFER 2B

July 11, 2012 © Andrew A. Chien, 2012

•  Essentially  homogenous  clusters  (very  tight)... •  8  clusters  (50%),  16  clusters  (70%),  32  clusters  (80%),  128  clusters  (100%)

7/11/12  

21  

Clusters  Insights •  Clusters draw from across the workloads – not in any

obvious “application domain” structure. •  Clusters reflect a wide variety of different

computational needs that correlate with architecture structure o  Call and branch intensive o  32 bit integer oriented o  Bit/byte oriented o  Mixed 32 and 64-bit oriented o  Single-precision floating point o  … and so on…

•  Clusters separate cleanly,(overpartition), ample opportunities for customization and energy efficiency.

July 11, 2012 © Andrew A. Chien, 2012

Benefit  Models

•  Specialization: fraction instructions unneeded •  Interpolation from Nehalem to DP Float energy

July 11, 2012 © Andrew A. Chien, 2012

0

5

10

15

20

25

30

0 0.2 0.4 0.6 0.8 1

ener

gy e

ffici

ency

impr

ovem

ent (

x)

fraction of unimplemented opcodes

square-rootlinearquadraticcubic

7/11/12  

22  

Weighted  Benefit    vs.    Benefit  Model

0

5

10

15

20

25

30

aver

age

bene

fit (x

)

sq rootlinearquadraticcubic

July 11, 2012 © Andrew A. Chien, 2012

Weighted  Benefit  (linear)    vs.      #  Cores

0

2

4

6

8

10

12

1 co

res

4 co

res

7 co

res

10 c

ores

13 c

ores

16 c

ores

19 c

ores

22 c

ores

25 c

ores

28 c

ores

31 c

ores

34 c

ores

37 c

ores

40 c

ores

43 c

ores

46 c

ores

49 c

ores

52 c

ores

55 c

ores

58 c

ores

61 c

ores

64 c

ores

wei

ghte

d be

nefit

(x)

linear benefit model

hr8hr16hr32hr64hr128lr8lr16lr32lr64lr128

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

23  

Related  Work •  System on Chip (SoC, SoP, 3D, etc.) [CE products]

o  Rapid system integration, not architectural design. Less stable; discontinuous change, partitioned software.

o  6 months to α-silicon, 6 months to product in market

•  CPU+Reconfigurable hardware (FPGA’s, LUTS, adders, etc.) o  Convey HC-1, Sankaralingam11, Xilinx Zynq [EPP]

o  Advantages: flexibility o  Disadvantages: lose customized implementation, speed, energy efficiency.

•  Hybrid Computing (CPU-GPU, APU, GenX…) o  Advantages: Silicon today o  Disdvantages: programmability,1-way hetero, cost, energy efficiency.

•  Low-level Programmability and Heterogeneity o  QSCores/GreenDroid: Super instructions ; Khan11 [Morphing], Wu11 [VM-based, single ISA], o  Advantage: Don’t require much software support o  Disadvantages: local impact

•  Build “Chip Generators”, not Chips o  Horowitz; customization and closed systems

o  Custom for everything: programmabiliity?

July 11, 2012 © Andrew A. Chien, 2012

Summary  and  Perspective •  Heterogeneity is endemic, and a basic source of

efficiency (local) •  We need integrated assessment – programmable,

usable, delivered performance (demand it) •  Unibench – uniform, systematic assessment of

accessible performance for a diverse accelerator future

•  10x10 – federated heterogeneous architecture based on systematic optimization of energy efficiency (major benefit)

•  Prepare wisely for a Heterogenous future!

July 11, 2012 © Andrew A. Chien, 2012

7/11/12  

24  

More  Information •  Papers

o  The Future of Microprocessors. Communications of the ACM 54(5): 67-77 (2011), [Borkar & Chien]

o  10x10: A General-purpose Architectural Approach to Heterogeneity and Energy Efficiency. Procedia CS 4: 1987-1996 (2011) [Chien, Snavely, Gahagan]

o  10x10: Taming Heterogeneity for General-purpose Architecture, in 2nd Workshop on New Directions in Computer Architecture, June 2011. Held at ISCA-38. [Chien]

o  Systematic Evaluation of Workload Clustering for Designing Heterogeneous, General-Purpose Architectures, UChicago Tech Report 2012, [A. Guha, A. Chien]

o  An Empirical Foundation for Heterogeneity: Clustering Applications by Computation and Memory Behavior, UChicago Tech Report 2011, [A. Guha, P. Cicotti, A. Snavely, and A. Chien]

•  Acknowledgements o  Apala Guha, Yao Zhang, Mark Sinclair o  Allan Snavely, Pietro Cicotti, Mark Gahagan o  Insightful feedback from Shekhar Borkar (Intel) and Bill Harrod (DARPA) o  Supported by the National Science Foundation under NSF Grant OCI-1057921

and DARPA MTO

July 11, 2012 © Andrew A. Chien, 2012

Questions?

July 11, 2012 © Andrew A. Chien, 2012