Download - PetaScale Execution Time Analysis Architecture/VLSI Chip Floorplan Monarch Chip Overview Computational Sciences Division Bob Lucas – Director Poster Participants:

PetaScale Execution Time Analysis

Arc

hit

ectu

re/V

LS

IA

rch

itec

ture

/VL

SI Chip FloorplanMonarch Chip Overview

Computational Sciences DivisionComputational Sciences DivisionBob Lucas – DirectorBob Lucas – Director

Poster Participants:Poster Participants: Jeff Draper, Mary Hall, Jacqueline Chame, Pedro Diniz, Jeff Sondeen, Spundun Bhatt, Tim BarrettJeff Draper, Mary Hall, Jacqueline Chame, Pedro Diniz, Jeff Sondeen, Spundun Bhatt, Tim Barrett

USCVITERBI

SCHOOL OFENGINEERING

System: four boards with eight PIM chips

LD on PIMs in IA64 Host

Ap

p/S

ys P

roto

typ

eA

pp

/Sys

Pro

toty

pe

Au

tom

atic

Per

form

ance

Tu

nin

gA

uto

mat

ic P

erfo

rman

ce T

un

ing

Model Guided Empirical Optimization

ECO: Combining models and guided empirical search for memory hierarchy optimization

Authors: Pedro Diniz, Jeremy Abramson, Tejus Krishna Contact: [email protected]

Per

form

ance

Exp

ecta

tio

nP

erfo

rman

ce E

xpec

tati

on

Objective Evaluate link discovery (LD) algorithms on Godiva H/W.

Hypothesis LD algorithms are data-intensive and highly parallel Largely read-only data Irregular memory accesses poor cache performance PIM technology would yield performance improvement

Expected Results

Parallel PIM implementations of LD computations Performance comparisons with Itanium-2 host Analysis of software/hardware scalability requirements Analysis of programming complexity

Results of Scalability AnalysisRaw Performance MeasurementsPIMS for KNOWLEDGE DISCOVERY

in collaboration with Hans Chalupsky & Jafar Adibi, USC ISI

Tools Organization and Rationale

Code Isolator Model Guided Empirical Optimization Results

• IBM Cu-08 90nm CMOS

• Clock 333 MHz• 64 GOPS/GFLOPS• Power 3-6 GFLOPS/W• 12 Arithmetic Clusters

– 96 ALUs (32-bit integer/float)

• 31 Memory Clusters– 256W x 32 bits each

(128KB)• 6 RISC processors• 12 MBytes eDRAM• 2 memory interfaces (8

GB/s BW)• 2 RapidIO (x4 serial)

interfaces• 17 DIFL ports (2.6

GB/s ea)• On-chip quad ring (40

GB/s)

DIFL = Differential Inter FPCA Link

PBDIFLs

ED R P

ED R P

ED R P

EDRP

EDRP

EDRP

P

MemoryInterface

P PP

CM

ROMPort

DIFLs

DIFLs

DIFLs

DIFLs DIFLs

DIFLs

DIFLs DIFLs

DIFLsDIFLs

MemoryInterface

P

RIO

P

RIO

DI/DO

MONARCH Project

• MOrphable Networked ARCHitecture (MONARCH)– DARPA-funded collaboration between USC,

Raytheon, Mercury, IBM, Georgia Tech

• Combines two radically different computing paradigms

– Conventional thread-level parallel programming model• RISC processor with extensions

• WideWord (MMX-like) unit formed through morphing

• Useful for complex code sets containing data-dependent control flow decisions

– Stream programming model (dataflow stream operation)

• Field Programmable Compute Array (FPCA)

• Useful for predictable operations on large data streams, e.g., pre-filtering of sensor data

• Achieves highest data throughput

AC RISC

eDRAM

PBUF IC

HS

S

MC

ACNWW

eDRAM

eDRAM

eDRAM

eDRAMeDRAM HSS HSSHSS

HS

SH

SS

HS

SH

SS

HS

S

HS

S

PLL PLL

AC RISC

AC RISCAC RISC

AC RISCAC RISC

ACNWW

ACNWW

ACNWW

ACNWW

ACNWW

PBUF PBUF PBUF

PBUFPBUF

PBUF PBUF

PBUF

PBUF

PBUF

PBUF

IC

IC

IC

ICIC

ICIC

ICIC

MC MC MC MC MCMC

MC

MC

MC

MC

MC

MC

MC

MC MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

Status - currently in fab

- First silicon expected 4Q06

- Prototype boards/modules expected 1Q06

ASIC Area BreakdownFull MONARCH ChipBased on IBM’s max die size of 352sq mm (18.76mm on a side)

Total Active Cu-08 Cells = 280,054,413

~100M Gate Equivalents

•Low-level Binary Instrumentation is too Expensive

•Takes time, thus precluding observing real runs•Generates lots of data, thus forcing to use sampling techniques

•Approach: synergistic combination of compiler static analysis and dynamic run-time data extraction

•Static analysis uncovers some program behavior information and identifies data to be extracted at run-time•Instruments source code to extract missing data at run-time

•Advantages:•Much faster then binary instrumentation approach•Can relate observed metrics to source-level program

Source Code C/Fortran

Instrumented Source Code

C/Fortran

What to Instrument

Open64 Front -End

(gccfe,fef90)

Static Analysis (source level)

• Basic Blocks • Loop & Bounds • Array Refs & Stride Info • Symbolic Address Ranges • Workloads (int, fp, load,.._ • Locality Metrics Info

Source Code Instrumentation

Static Info. Data Files

Open64 Tools (whirl2c,whirl2f)

Text Files

Whirl B Files

Off-Line Analysis • Basic Blocks • Loop & Bounds • Array Refs & Stride Info • Symbolic Address Ranges • Workloads (int, fp, load) • Locality Metrics Info

Target Arch Compiler

(gcc,f90,gf)

Analysis Files

Instrumentation Library

Application Executable

Dynamic Info. Data Files Execution

Application Inputs

Application Outputs

Whirl B Files

Whirl B Files

•Goal: Derive Performance Expectations from Source Code for Different Architectures

•What Should the Performance be and Why?•What is Limiting the Performance?

•Data-Dependences•Architecture Limitations

•Approach: Use Data-Flow Analysis & Scheduling Techniques

•Extract DFG from the High-Level Source Code•Make Assumptions about Memory Hierarchy•Compute As-Soon-As-Possible Schedule•Vary Number and Implementation Features of Units

•Load/Store Units•Functional Units

Compiler Approach to Performance Expectation

Architectural Exploration Results for UMT2K

0

200

400

600

800

1000

1200

1 2 3 4 5

Number of Load/Store Units

Cyc

les

1 ALU

2 ALU

3 ALU

4 ALU

5 ALU

0

500

1000

1500

2000

2500

1 2 3 4 5

Number of Load/Store Units

Cyc

les

1 ALU

2 ALU

3 ALU

4 ALU

5 ALU

No Unrolling of Inner Loop

Unrolling Inner Loop by 4x• Code:

– Inner Loop of the Angular Loop in snswp3D procedure

– 272 Operations, 4 FP div (non Pipelined); 41 FP Mults; 95 Int Ops; 84 Load/Store; 22 Int Mults.

• Analysis:– Compute-bound: adding more load/store units won’t help

– Not cost effective to have more than 2 ALU (non-unrolled) or 4 ALUs (4x unrolled)

Authors: Chun Chen, YoonJu L. Nelson, Jacqueline Chame, Mary Hall Contact: [email protected]

Authors: Jacqueline Chame, Mary Hall, Spundun Bhatt, Tim Barrett Contact: [email protected]

Authors: Jeff Draper, Jeff Sondeen, Sumit Mediratta, Rashed Bhatti, TJ Kwon, Tim Barrett, et. al. Contact: [email protected]

Model-guided compiler optimizationstatic models of architecture, profitability

Empirical optimizationempirical data guide optimization decisionsself-tuning libraries such as ATLAS, PhiPAC, FFTW and SPIRAL

Exploit complementary strengths of both approaches

compiler models prune unprofitable solutionsempirical data provide accurate measure of optimization impact

analysis/models

transformation modules

application code architecturespecification

code variant

generation ph

ase 1

set of parameterized code variants + constraints on unbound parameters

optimized code variant +representative input data set

search engine

performancemonitoring

supportexecution

environmentph

ase 2

optimized

code

Vendor BLASATLAS BLAS

NativeECO

ECO x ATLAS, vendor BLAS and native compiler

matrix multiply on SGI R10K

Targeting multimedia extension architectures(Superword-Level Parallelism (SLP)

empirical search engine

analysis/models

application code

ph

ase

1

parameterized code variants + constraints on unbound parameters

code variants optimized for caches/TLB + unroll&jam to expose SLP

transformation modules

ph

ase

2

code variant generation

• on unrolled code:• pack isomorphic operations• align operands• register optimizations: superword replacement, register packing• low-level optimizations

performance monitoring

execution environmentoptimized code + representative input data set

architecture specification

• select loop order• cache and TLB optimizations• unroll&jam loops with SLP and spatial reuse

Results for Intel SSEIn process

PPC AltiVec

2xDDR, 4% 17xFD Hybrid DIFL + PBUS DMA, 10%

DT Decaps, 0%

6xAC-RISC, 8%

6xAC-No_WW, 5%

12xPBUF, 3%

1xROM Port, 0%

Decaps, 6%

System, 3%

Reserved, 25%

eFuses, 1%

2xXPIRX (as Serial RapidIO), 1%

Serial RapidIO (Mercury), 1%

31xMC, 14%

6xeDRAM+BIST+ Wrapper, 17%

10xANBI (IOC), 2%

Intel SSEProgram Energy

LoopAngle Loop

Size(LOC)

232K 150 1.3K

Execution Time

(hh:mm:ss)

41:02:05 00:00:12 00:10:00

#Args. 16 50

Input Data (Bytes)

0.57M 61.69M 442.84M

UMT2K SummaryDevelop “benchmark” of computation kernel from large application

Performance behavior equivalent to full application

Programmer and/or compiler tool

Support Model-guided Empirical Optimization (ECO project)

Increase machine and programmer efficiencies

Develop tool support for automatic performance tuning

Locality optimizations

Shared-memory parallel optimizations

MUTUAL INFORMATION

Clock Execution

TimeCycles

Instructions Per Cycle

Itanium-2 900 MHz 5.5ms 4.9M 1.588

Single PIM (superword,

compiler+hand tuned)

140 MHz 32.1ms 4M n/a

GRAPH CLUSTERING

Clock Execution

TimeCycles

Instructions Per Cycle

Itanium-2 900 MHz 0.26ms 233K 0.806

Single PIM (scalar, compiler)

140 MHz 1.11ms 155K n/a

18% Fewer Cycles

33% Fewer Cycles

Assume same clock on PIM and Itanium-2

Speedup using 1 PIM =

IT2 Cycles

PIM Cycles

1.225 for MI1.503 for GC

(1.008 for 2 PIMs) =

Now normalize by IPC of scaled data, since PIM behavior is consistent across data sets.

IT2 Cycles * (IPCtest / IPCscaled)

PIM Cycles=

1.316 for MI2.611 for GC, (1.75 for 2 PIMs)

Original Program

Code Fragmentto be executed

void main(){

Call OutlineFunc((<InputParameters>){}

void OutlineFunc(<InputParameters>){

}

Isolated Program

Isolated code

1.Compilable

StoreInitialDataValues

CaptureMachineState SetMachineState

2.Executable 3.Machine State

StoreInitialDataValues

<InputParameters>=SetInitialDataValues