PetaScale Execution Time Analysis Architecture/VLSI Chip Floorplan Monarch Chip Overview...

1
PetaScale Execution Time Analysis Architecture/ Architecture/ VLSI VLSI Chip Floorplan Monarch Chip Overview Computational Sciences Division Computational Sciences Division Bob Lucas – Director Bob Lucas – Director Poster Participants: Poster Participants: Jeff Draper, Mary Hall, Jacqueline Chame, Pedro Diniz, Jeff Sondeen, Spundun Jeff Draper, Mary Hall, Jacqueline Chame, Pedro Diniz, Jeff Sondeen, Spundun Bhatt, Tim Barrett Bhatt, Tim Barrett USC VITERBI SCHOOL OF ENGINEERING System: four boards with eight PIM chips LD on PIMs in IA64 Host App/Sys App/Sys Prototype Prototype Automatic Performance Tuning Automatic Performance Tuning Model Guided Empirical Optimization ECO: Combining models and guided empirical search for memory hierarchy optimization Authors: Pedro Diniz, Jeremy Abramson, Tejus Krishna Contact: [email protected] Performance Performance Expectation Expectation Objective Evaluate link discovery (LD) algorithms on Godiva H/W. Hypothesis LD algorithms are data-intensive and highly parallel Largely read-only data Irregular memory accesses poor cache performance PIM technology would yield performance improvement Expected Results Parallel PIM implementations of LD computations Performance comparisons with Itanium-2 host Analysis of software/hardware scalability requirements Analysis of programming complexity Results of Scalability Analysis Raw Performance Measurements PIMS for KNOWLEDGE DISCOVERY in collaboration with Hans Chalupsky & Jafar Adibi, USC ISI Tools Organization and Rationale Code Isolator Model Guided Empirical Optimization Results • IBM Cu-08 90nm CMOS • Clock 333 MHz • 64 GOPS/GFLOPS • Power 3-6 GFLOPS/W • 12 Arithmetic Clusters 96 ALUs (32-bit integer/float) • 31 Memory Clusters 256W x 32 bits each (128KB) • 6 RISC processors • 12 MBytes eDRAM • 2 memory interfaces (8 GB/s BW) • 2 RapidIO (x4 serial) interfaces • 17 DIFL ports (2.6 GB/s ea) • On-chip quad ring (40 GB/s) DIFL = Differential Inter FPCA Link PBDIFLs E D R P E D R P E D R P ED R P ED R P ED R P P Memory Interface P P P CM ROM Port DIFLs DIFLs DIFLs DIFLs DIFLs DIFLs DIFLs DIFLs DIFLs DIFLs Memory Interface P RIO P RIO DI/DO MONARCH Project MOrphable Networked ARCHitecture (MONARCH) DARPA-funded collaboration between USC, Raytheon, Mercury, IBM, Georgia Tech Combines two radically different computing paradigms Conventional thread-level parallel programming model RISC processor with extensions WideWord (MMX-like) unit formed through morphing Useful for complex code sets containing data- dependent control flow decisions Stream programming model (dataflow stream operation) Field Programmable Compute Array (FPCA) Useful for predictable operations on large data streams, e.g., pre-filtering of sensor data Achieves highest data throughput AC RISC eDRAM PBUF IC H S S MC AC NWW eDRAM eDRAM eDRAM eDRAM eDRAM HSS HSS HSS H S S H S S H S S H S S H S S H S S PLL PLL AC RISC AC RISC AC RISC AC RISC AC RISC AC NWW AC NWW AC NWW AC NWW AC NWW PBUF PBUF PBUF PBUF PBUF PBUF PBUF PBUF PBUF PBUF PBUF IC IC IC IC IC IC IC IC IC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC Status - currently in fab - First silicon expected 4Q06 - Prototype boards/modules expected 1Q06 ASIC Area Breakdown Full MONARCH Chip Based on IBM’s max die size of 352sq mm (18.76mm on a side) Total Active Cu-08 Cells = 280,054,413 ~100M Gate Equivalents Low-level Binary Instrumentation is too Expensive •Takes time, thus precluding observing real runs •Generates lots of data, thus forcing to use sampling techniques Approach: synergistic combination of compiler static analysis and dynamic run-time data extraction •Static analysis uncovers some program behavior information and identifies data to be extracted at run-time •Instruments source code to extract missing data at run-time Advantages: •Much faster then binary Sourc e Code C/Fortran Instru m ented Source Code C/Fortran What to Instrument O pen64 Front-E nd (gccfe,fef90) Static A nalysis (source level) Basic Blocks Loop& Bounds A rra y Refs & Stri de Info Symboli c Address Ranges W orkl oads (i nt, fp, load,.._ Localit y Metri csInfo Source C ode Instrumentation Static Info. Data Files Open64 Tools (w hirl2c, w hirl2f Text F iles Whirl B F iles Off-Line Analysis Basic Blocks Loop& Bounds A rra y Refs & Stri de Info Symboli c AddressRanges W orkl oads (i nt, fp, load) Localit y Metrics Info TargetArch Com piler (gcc, f90,gf) Analysis Files Instru mentation Librar y Application Executab le D ynam ic Info. Data Files Execut ion Appli cation Inputs Appli cation Outputs Whirl B F iles Whirl B F iles Goal: Derive Performance Expectations from Source Code for Different Architectures •What Should the Performance be and Why? •What is Limiting the Performance? •Data-Dependences •Architecture Limitations Approach: Use Data-Flow Analysis & Scheduling Techniques •Extract DFG from the High- Level Source Code •Make Assumptions about Memory Hierarchy •Compute As-Soon-As-Possible Schedule •Vary Number and Implementation Features of Compiler Approach to Performance Expectation Architectural Exploration Results for UMT2K 0 200 400 600 800 1000 1200 1 2 3 4 5 Number of Load/Store Un 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 0 500 1000 1500 2000 2500 1 2 3 4 5 Number of Load/Store Un 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU No Unrolling of Inner Loop Unrolling Inner Loop by 4x • Code: Inner Loop of the Angular Loop in snswp3D procedure 272 Operations, 4 FP div (non Pipelined); 41 FP Mults; 95 Int Ops; 84 Load/Store; 22 Int Mults. • Analysis: Compute-bound: adding more load/store units won’t help Not cost effective to have more than 2 ALU (non-unrolled) or 4 ALUs (4x unrolled) Authors: Chun Chen, YoonJu L. Nelson, Jacqueline Chame, Mary Hall Contact: [email protected] Authors: Jacqueline Chame, Mary Hall, Spundun Bhatt, Tim Barrett Contact: [email protected] Authors: Jeff Draper, Jeff Sondeen, Sumit Mediratta, Rashed Bhatti, TJ Kwon, Tim Barrett, et. al. Contact: [email protected] Model-guided compiler optimization static models of architecture, profitability Empirical optimization empirical data guide optimization decisions self-tuning libraries such as ATLAS, PhiPAC, FFTW and SPIRAL Exploit complementary strengths of both approaches compiler models prune unprofitable solutions empirical data provide accurate measure of optimization impact analysis/ models transformation modules application code architecture specification code variant generation phase 1 set of parameterized code variants + constraints on unbound parameters optimized code variant + representative input data set search engine performance monitoring support execution environment phase 2 optimized code Vendor BLAS ATLAS BLAS Native ECO ECO x ATLAS, vendor BLAS and native compiler matrix multiply on SGI R10K Targeting multimedia extension architectures (Superword-Level Parallelism (SLP) empirical search engine analysis/models application code phase 1 parameterized code variants + constraints on unbound parameters code variants optimized for caches/TLB + unroll&jam to expose SLP transformation modules phase 2 code variant generation on unrolled code: pack isomorphic operations align operands register optimizations: superword replacement, register packing low-level optimizations performance monitoring execution environment optimized code + representative input data set architectu re specificat ion select loop order cache and TLB optimizations unroll&jam loops with SLP and spatial reuse Results for Intel SSE In process PPC AltiVec 2xD DR , 4% 17xFD H ybrid D IFL + PBU S D M A, 10% D T D ecaps, 0% 6xAC -R ISC , 8% 6xAC -N o_W W , 5% 12xPBU F, 3% 1xR O M Port, 0% D ecaps, 6% System , 3% R eserved, 25% eFuses, 1% 2xXPIR X (as Serial R apidIO ), 1% Serial R apidIO (M ercury), 1% 31xM C , 14% 6xeD R AM +BIST+ W rapper, 17% 10xAN BI (IO C ), 2% Intel SSE Program Energy Loop Angle Loop Size (LOC) 232K 150 1.3K Executio n Time (hh:mm:s s) 41:02:05 00:00:12 00:10:0 0 #Args. 16 50 Input Data (Bytes) 0.57M 61.69M 442.84M UMT2K Summary Develop “benchmark” of computation kernel from large application Performance behavior equivalent to full application Programmer and/or compiler tool Support Model-guided Empirical Optimization (ECO project) Increase machine and programmer efficiencies Develop tool support for automatic performance tuning Locality optimizations Shared-memory parallel optimizations MUTUAL INFORMATION Clock Executio n Time Cycle s Instructi ons Per Cycle Itanium-2 900 MHz 5.5ms 4.9M 1.588 Single PIM (superword, compiler+hand tuned) 140 MHz 32.1ms 4M GRAPH CLUSTERING Clock Executio n Time Cycle s Instructi ons Per Cycle Itanium-2 900 MHz 0.26ms 233K 0.806 Single PIM (scalar, compiler) 140 MHz 1.11ms 155K 18% Fewer Cycles 33% Fewer Cycles Assume same clock on PIM and Itanium-2 Speedup using 1 PIM = IT2 Cycles PIM Cycles 1.225 for MI 1.503 for GC (1.008 for 2 PIMs) = Now normalize by IPC of scaled data, since PIM behavior is consistent across data sets. IT2 Cycles * (IPC test / IPC scaled ) PIM Cycles = 1.316 for MI 2.611 for GC, (1.75 for 2 PIMs) Original Program Code Fragment to be executed void main(){ Call OutlineFunc((<InputParameters>){ } void OutlineFunc(<InputParameters>){ } Isolated Program Isolated code 1.Compilable StoreInitialDataValues CaptureMachineState SetMachineState 2.Executable 3.Machine State StoreInitialDataValues <InputParameters>=SetInitialDataValues

Transcript of PetaScale Execution Time Analysis Architecture/VLSI Chip Floorplan Monarch Chip Overview...

Page 1: PetaScale Execution Time Analysis Architecture/VLSI Chip Floorplan Monarch Chip Overview Computational Sciences Division Bob Lucas – Director Poster Participants:

PetaScale Execution Time Analysis

Arc

hit

ectu

re/V

LS

IA

rch

itec

ture

/VL

SI Chip FloorplanMonarch Chip Overview

Computational Sciences DivisionComputational Sciences DivisionBob Lucas – DirectorBob Lucas – Director

Poster Participants:Poster Participants: Jeff Draper, Mary Hall, Jacqueline Chame, Pedro Diniz, Jeff Sondeen, Spundun Bhatt, Tim BarrettJeff Draper, Mary Hall, Jacqueline Chame, Pedro Diniz, Jeff Sondeen, Spundun Bhatt, Tim Barrett

USCVITERBI

SCHOOL OFENGINEERING

System: four boards with eight PIM chips

LD on PIMs in IA64 Host

Ap

p/S

ys P

roto

typ

eA

pp

/Sys

Pro

toty

pe

Au

tom

atic

Per

form

ance

Tu

nin

gA

uto

mat

ic P

erfo

rman

ce T

un

ing

Model Guided Empirical Optimization

ECO: Combining models and guided empirical search for memory hierarchy optimization

Authors: Pedro Diniz, Jeremy Abramson, Tejus Krishna Contact: [email protected]

Per

form

ance

Exp

ecta

tio

nP

erfo

rman

ce E

xpec

tati

on

Objective Evaluate link discovery (LD) algorithms on Godiva H/W.

Hypothesis LD algorithms are data-intensive and highly parallel Largely read-only data Irregular memory accesses poor cache performance PIM technology would yield performance improvement

Expected Results

Parallel PIM implementations of LD computations Performance comparisons with Itanium-2 host Analysis of software/hardware scalability requirements Analysis of programming complexity

Results of Scalability AnalysisRaw Performance MeasurementsPIMS for KNOWLEDGE DISCOVERY

in collaboration with Hans Chalupsky & Jafar Adibi, USC ISI

Tools Organization and Rationale

Code Isolator Model Guided Empirical Optimization Results

• IBM Cu-08 90nm CMOS

• Clock 333 MHz• 64 GOPS/GFLOPS• Power 3-6 GFLOPS/W• 12 Arithmetic Clusters

– 96 ALUs (32-bit integer/float)

• 31 Memory Clusters– 256W x 32 bits each

(128KB)• 6 RISC processors• 12 MBytes eDRAM• 2 memory interfaces (8

GB/s BW)• 2 RapidIO (x4 serial)

interfaces• 17 DIFL ports (2.6

GB/s ea)• On-chip quad ring (40

GB/s)

DIFL = Differential Inter FPCA Link

PBDIFLs

ED R P

ED R P

ED R P

EDRP

EDRP

EDRP

P

MemoryInterface

P PP

CM

ROMPort

DIFLs

DIFLs

DIFLs

DIFLs DIFLs

DIFLs

DIFLs DIFLs

DIFLsDIFLs

MemoryInterface

P

RIO

P

RIO

DI/DO

MONARCH Project

• MOrphable Networked ARCHitecture (MONARCH)– DARPA-funded collaboration between USC,

Raytheon, Mercury, IBM, Georgia Tech

• Combines two radically different computing paradigms

– Conventional thread-level parallel programming model• RISC processor with extensions

• WideWord (MMX-like) unit formed through morphing

• Useful for complex code sets containing data-dependent control flow decisions

– Stream programming model (dataflow stream operation)

• Field Programmable Compute Array (FPCA)

• Useful for predictable operations on large data streams, e.g., pre-filtering of sensor data

• Achieves highest data throughput

AC RISC

eDRAM

PBUF IC

HS

S

MC

ACNWW

eDRAM

eDRAM

eDRAM

eDRAMeDRAM HSS HSSHSS

HS

SH

SS

HS

SH

SS

HS

S

HS

S

PLL PLL

AC RISC

AC RISCAC RISC

AC RISCAC RISC

ACNWW

ACNWW

ACNWW

ACNWW

ACNWW

PBUF PBUF PBUF

PBUFPBUF

PBUF PBUF

PBUF

PBUF

PBUF

PBUF

IC

IC

IC

ICIC

ICIC

ICIC

MC MC MC MC MCMC

MC

MC

MC

MC

MC

MC

MC

MC MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

Status - currently in fab

- First silicon expected 4Q06

- Prototype boards/modules expected 1Q06

ASIC Area BreakdownFull MONARCH ChipBased on IBM’s max die size of 352sq mm (18.76mm on a side)

Total Active Cu-08 Cells = 280,054,413

~100M Gate Equivalents

•Low-level Binary Instrumentation is too Expensive

•Takes time, thus precluding observing real runs•Generates lots of data, thus forcing to use sampling techniques

•Approach: synergistic combination of compiler static analysis and dynamic run-time data extraction

•Static analysis uncovers some program behavior information and identifies data to be extracted at run-time•Instruments source code to extract missing data at run-time

•Advantages:•Much faster then binary instrumentation approach•Can relate observed metrics to source-level program

Source Code C/Fortran

Instrumented Source Code

C/Fortran

What to Instrument

Open64 Front -End

(gccfe,fef90)

Static Analysis (source level)

• Basic Blocks • Loop & Bounds • Array Refs & Stride Info • Symbolic Address Ranges • Workloads (int, fp, load,.._ • Locality Metrics Info

Source Code Instrumentation

Static Info. Data Files

Open64 Tools (whirl2c,whirl2f)

Text Files

Whirl B Files

Off-Line Analysis • Basic Blocks • Loop & Bounds • Array Refs & Stride Info • Symbolic Address Ranges • Workloads (int, fp, load) • Locality Metrics Info

Target Arch Compiler

(gcc,f90,gf)

Analysis Files

Instrumentation Library

Application Executable

Dynamic Info. Data Files Execution

Application Inputs

Application Outputs

Whirl B Files

Whirl B Files

•Goal: Derive Performance Expectations from Source Code for Different Architectures

•What Should the Performance be and Why?•What is Limiting the Performance?

•Data-Dependences•Architecture Limitations

•Approach: Use Data-Flow Analysis & Scheduling Techniques

•Extract DFG from the High-Level Source Code•Make Assumptions about Memory Hierarchy•Compute As-Soon-As-Possible Schedule•Vary Number and Implementation Features of Units

•Load/Store Units•Functional Units

Compiler Approach to Performance Expectation

Architectural Exploration Results for UMT2K

0

200

400

600

800

1000

1200

1 2 3 4 5

Number of Load/Store Units

Cyc

les

1 ALU

2 ALU

3 ALU

4 ALU

5 ALU

0

500

1000

1500

2000

2500

1 2 3 4 5

Number of Load/Store Units

Cyc

les

1 ALU

2 ALU

3 ALU

4 ALU

5 ALU

No Unrolling of Inner Loop

Unrolling Inner Loop by 4x• Code:

– Inner Loop of the Angular Loop in snswp3D procedure

– 272 Operations, 4 FP div (non Pipelined); 41 FP Mults; 95 Int Ops; 84 Load/Store; 22 Int Mults.

• Analysis:– Compute-bound: adding more load/store units won’t help

– Not cost effective to have more than 2 ALU (non-unrolled) or 4 ALUs (4x unrolled)

Authors: Chun Chen, YoonJu L. Nelson, Jacqueline Chame, Mary Hall Contact: [email protected]

Authors: Jacqueline Chame, Mary Hall, Spundun Bhatt, Tim Barrett Contact: [email protected]

Authors: Jeff Draper, Jeff Sondeen, Sumit Mediratta, Rashed Bhatti, TJ Kwon, Tim Barrett, et. al. Contact: [email protected]

Model-guided compiler optimizationstatic models of architecture, profitability

Empirical optimizationempirical data guide optimization decisionsself-tuning libraries such as ATLAS, PhiPAC, FFTW and SPIRAL

Exploit complementary strengths of both approaches

compiler models prune unprofitable solutionsempirical data provide accurate measure of optimization impact

analysis/models

transformation modules

application code architecturespecification

code variant

generation ph

ase 1

set of parameterized code variants + constraints on unbound parameters

optimized code variant +representative input data set

search engine

performancemonitoring

supportexecution

environmentph

ase 2

optimized

code

Vendor BLASATLAS BLAS

NativeECO

ECO x ATLAS, vendor BLAS and native compiler

matrix multiply on SGI R10K

Targeting multimedia extension architectures(Superword-Level Parallelism (SLP)

empirical search engine

analysis/models

application code

ph

ase

1

parameterized code variants + constraints on unbound parameters

code variants optimized for caches/TLB + unroll&jam to expose SLP

transformation modules

ph

ase

2

code variant generation

• on unrolled code:• pack isomorphic operations• align operands• register optimizations: superword replacement, register packing• low-level optimizations

performance monitoring

execution environmentoptimized code + representative input data set

architecture specification

• select loop order• cache and TLB optimizations• unroll&jam loops with SLP and spatial reuse

Results for Intel SSEIn process

PPC AltiVec

2xDDR, 4% 17xFD Hybrid DIFL + PBUS DMA, 10%

DT Decaps, 0%

6xAC-RISC, 8%

6xAC-No_WW, 5%

12xPBUF, 3%

1xROM Port, 0%

Decaps, 6%

System, 3%

Reserved, 25%

eFuses, 1%

2xXPIRX (as Serial RapidIO), 1%

Serial RapidIO (Mercury), 1%

31xMC, 14%

6xeDRAM+BIST+ Wrapper, 17%

10xANBI (IOC), 2%

Intel SSEProgram Energy

LoopAngle Loop

Size(LOC)

232K 150 1.3K

Execution Time

(hh:mm:ss)

41:02:05 00:00:12 00:10:00

#Args. 16 50

Input Data (Bytes)

0.57M 61.69M 442.84M

UMT2K SummaryDevelop “benchmark” of computation kernel from large application

Performance behavior equivalent to full application

Programmer and/or compiler tool

Support Model-guided Empirical Optimization (ECO project)

Increase machine and programmer efficiencies

Develop tool support for automatic performance tuning

Locality optimizations

Shared-memory parallel optimizations

MUTUAL INFORMATION

Clock Execution

TimeCycles

Instructions Per Cycle

Itanium-2 900 MHz 5.5ms 4.9M 1.588

Single PIM (superword,

compiler+hand tuned)

140 MHz 32.1ms 4M n/a

GRAPH CLUSTERING

Clock Execution

TimeCycles

Instructions Per Cycle

Itanium-2 900 MHz 0.26ms 233K 0.806

Single PIM (scalar, compiler)

140 MHz 1.11ms 155K n/a

18% Fewer Cycles

33% Fewer Cycles

Assume same clock on PIM and Itanium-2

Speedup using 1 PIM =

IT2 Cycles

PIM Cycles

1.225 for MI1.503 for GC

(1.008 for 2 PIMs) =

Now normalize by IPC of scaled data, since PIM behavior is consistent across data sets.

IT2 Cycles * (IPCtest / IPCscaled)

PIM Cycles=

1.316 for MI2.611 for GC, (1.75 for 2 PIMs)

Original Program

Code Fragmentto be executed

void main(){

Call OutlineFunc((<InputParameters>){}

void OutlineFunc(<InputParameters>){

}

Isolated Program

Isolated code

1.Compilable

StoreInitialDataValues

CaptureMachineState SetMachineState

2.Executable 3.Machine State

StoreInitialDataValues

<InputParameters>=SetInitialDataValues