Parallel Performance Wizard: Introduction

13
06 April 2006 Parallel Performance Wizard: Introduction Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko, Sr. Research Assistant Mr. Bryan Golden, Research Assistant Mr. Hans Sherburne, Research Assistant Mr. Max Billingsley, Research Assistant Mr. Josh Hartman, Undergraduate Volunteer HCS Research Laboratory University of Florida

description

Parallel Performance Wizard: Introduction. Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko, Sr. Research Assistant Mr. Bryan Golden, Research Assistant Mr. Hans Sherburne, Research Assistant Mr. Max Billingsley, Research Assistant - PowerPoint PPT Presentation

Transcript of Parallel Performance Wizard: Introduction

Page 1: Parallel Performance Wizard: Introduction

06 April 2006

Parallel Performance Wizard:

Introduction

Professor Alan D. George, Principal InvestigatorMr. Hung-Hsun Su, Sr. Research Assistant

Mr. Adam Leko, Sr. Research AssistantMr. Bryan Golden, Research Assistant

Mr. Hans Sherburne, Research AssistantMr. Max Billingsley, Research Assistant

Mr. Josh Hartman, Undergraduate Volunteer

HCS Research LaboratoryUniversity of Florida

Page 2: Parallel Performance Wizard: Introduction

206 April 2006

Outline

Motivations and Objectives

Background

Framework & Key Features

Phase II Tasks & Schedules

Today’s Schedule

Page 3: Parallel Performance Wizard: Introduction

306 April 2006

Motivations and Objectives Motivations

UPC/SHMEM program does not yield the expected performance. Why?

Due to complexity of parallel computing, difficult to determine without tools for performance analysis and optimization

Discouraging for users, new & old; few options for shared-memory computing in UPC and SHMEM communities

Objectives Research topics relating to performance analysis Develop framework for a performance analysis tool Design with both performance and user productivity in mind Develop a performance analysis tool for UPC and SHMEM

Page 4: Parallel Performance Wizard: Introduction

406 April 2006

Need for Performance Analysis Performance analysis of

sequential applications can be challenging

Performance analysis of explicitly communicating parallel applications is significantly more difficult Mainly due to increase in

number of processing nodes

Performance analysis of Implicitly communicating parallel applications is even more difficult Non-blocking, one-sided

communication is tricky to track and analyze accurately

Sequential programming

Processing node

Explicit parallel programming (ex: MPI)

Processing node

...Processing

node

Implicit parallel programming (ex: UPC, SHMEM)

Processing node

...Processing

node

Page 5: Parallel Performance Wizard: Introduction

506 April 2006

Background - SHMEM

SHared MEMory library Based on SPMD model Available for C / Fortran Available for servers and clusters Easier to program than MPI Hybrid programming model

Traits of message passing Explicit communication, replication and synchronization Need to give remote data location (processing element ID)

Traits of shared memory Provides logically shared memory system view Non-blocking, one-sided communication

Lower latency, higher bandwidth communication PSHMEM available for some implementations

Int xfloat y

Remotely Accessible

Memory

Private Memory

Int xfloat y

Private Memory

SAME ADDRESS

P.E. X P.E. Y

Remotely Accessible

Memory

Page 6: Parallel Performance Wizard: Introduction

606 April 2006

Background - UPC Unified Parallel C (UPC) Partitioned GAS parallel programming language Common and familiar syntax and semantics for parallel C with simple

extensions to ANSI C Many implementations

Open source: Berkeley UPC, Michigan UPC, GCC-UPC Proprietary: HP-UPC, Cray-UPC

Easier to program than MPI, software more scalable With hand-tuning, UPC performance compares favorably with MPI

Page 7: Parallel Performance Wizard: Introduction

706 April 2006

Background – Performance Analysis Three general performance analysis approaches

Analytical modeling Mostly predictive methods Could also be used in conjunction with experimental performance

measurement Pros: easy to use, fast, can be performed without running the program Cons: usually not very accurate

Simulation Pros: allow performance estimation of program with various system

architectures Cons: slow, not generally applicable for regular UPC/SHMEM users

Experimental performance measurement Strategy used by most modern performance analysis tools (PATs) Uses actual event measurement to perform analysis Pros: most accurate Cons: can be time-consuming

PAT = Performance Analysis ToolPAT = Performance Analysis Tool

Page 8: Parallel Performance Wizard: Introduction

806 April 2006

Background - Experimental Performance Measurement Stages Instrumentation – user-assisted or automatic insertion of instrumentation

code Measurement – actual measuring stage Analysis – data analysis toward bottleneck detection & resolution Presentation – display of analyzed data to user, deals directly with user Optimization – process of finding and resolving bottlenecks

Instrumentation Measurement Analysis

PresentationOptimization

Analytical model Simulation

Page 9: Parallel Performance Wizard: Introduction

906 April 2006

FrameworkInstrumentation

Module

Input interface

Sourceinstrumentation

manager

Binaryinstrumentation

manager

Measurement units Event file managerLow-level event read/

write

High-level event read/write

Analysis/data manager

Analysis units (bottleneck detection)

Simulator

Basic event data

Composite event data

Analysis data

STORAGE

Performance prediction

Performance bottleneck resolution

Code transformation

Visualization manager

Measurement Module

Analysis Module

Presentation Module

Optimization Module

Partially done with project

Completely done with project

Not part of project

Page 10: Parallel Performance Wizard: Introduction

1006 April 2006

Key Features Semi-automatic source-level instrumentation as default

Only P module and part of I module are visible to user

PAPI will be used

Tracing mode as default with profiling support

Post-mortem data processing and analysis

Analyses: load balancing, scalability, memory system

Visualizations: timeline display, speedup chart, call-tree graph, communication volume graph, memory access graph, profiling table

Page 11: Parallel Performance Wizard: Introduction

1106 April 2006

Tasks & ScheduleID Task Name

2005 2006 2007

Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

Develop detailed design

Develop detailed design for I, M, and A modules

Design and evaluate high-level user interface (P module)

Develop standard set of test cases

Construct, test, and integrate I&M modules

Construct and test I module for open platforms

Construct and test M module for open platforms

Integrate I with M modules and test for open platforms

Construct and test I module for proprietary platforms

Construct and test M unit for proprietary platforms

Integrate I with M units and test for proprietary platforms

Construct, test, and integrate A module

Construct and test A module for open platforms

Integrate A with IM modules and test for open platforms

Construct and test A module for proprietary platforms

Integrate A with IM modules and test for proprietary platforms

Construct, test, and integrate P module

Code and test P module

Integrate P with IMA modules and test

Conduct beta tests and refine tool design and implementation

Beta test #1 – functionality + usability

Beta test #2 – functionality + usability + productivity with controlled case studies

Beta test #3 – functionality + usability + productivity with uncontrolled case studies

Design and implement advanced features

Preliminary research in performance optimization via prediction

Page 12: Parallel Performance Wizard: Introduction

1206 April 2006

Discussion Topic: Target PlatformsOur current platform list; changes needed? Open

Quadrics SHMEM on Opterons + RHEL4 (qsnet) Berkeley UPC on Opterons + RHEL4 (iba)

Proprietary Cray UPC on X1E (src. inst) Cray SHMEM on X1E

Page 13: Parallel Performance Wizard: Introduction

1306 April 2006

Today’s Schedule

09:00 – 09:30 AM Project overview09:30 – 10:15 AM Instrumentation (I) module presentation10:15 – 10:30 AM BREAK10:30 – 11:15 AM Measurement (M) module presentation11:15 – 11:45 AM I&M-modules demo11:45 – 13:00 PM LUNCH13:00 – 13:45 PM Analysis (A) module presentation13:45 – 14:00 PM A-module demo14:00 – 14:45 PM Presentation (P) module presentation14:45 – 15:00 PM P-module demo15:00 – 15:15 PM BREAK15:15 – 16:00 PM Wrap-up & planning discussion