1 Metrics for the Office of Science HPC Centers Jonathan Carter User Services Group Lead...

1

Metrics for the Office of Science HPC Centers

Jonathan CarterUser Services Group Lead

[email protected]

NERSC User Group MeetingJune 12, 2006

2

Goals

• Informational– Metrics Panel– Draft proposal

• Solicit Feedback– Are proposed metrics reasonable?– Fine tuning ‘capability job’ metrics

3

Office of Science “Metrics Panel”

• ASCR has asked a panel for recommendations about metrics

• Panel is headed by Gordon Bell from Microsoft• Its goals:

– performance measurement and assessment at Office of Science (SC) HPC facilities

– appropriateness and comprehensiveness of the measures – science accomplishments and their effects on SC’s science

programs – provide input for the Office of Management and Budget (OMB) – evaluation of ASCR progress towards the long-term goals

specified in the OMB Program Assessment Rating Tool (PART)• NERSC, ORNL and ANL have provided input

4

Current OMB PART Metrics

1. Acquisitions should be no more than 10% more than planned cost and schedule.

This metric is reasonable.

2. 40% of the computational time is used by jobs with a concurrency of 1/8 or more of the maximum usable compute CPUs.

Meeting this metric has positive and negative effects: motivated increased scaling of user codes; not related to the quantity, quality, or productivity of the science.

3. Every year several selected science applications are expected to increase efficiency by at least 50%.

This metric was motivated by the desire to increase the percent of peak performance in large science applications, which now has less merit. Should be replaced by a scaling metric.

55

• Three PART metrics are sufficient to demonstrate DOE Office of Science’s progress in advancing the state of high performance computing.

• Cost-efficient and timely acquisitions clearly important– Metric #1 retained but slightly modified (scoring).

• Primary interest of OMB is whether the computational resources in the Office of Science are facilitating scientific discovery: the PART metrics should reflect this interest. – Metrics #2 and #3 should be changed

Suggestions for PART Metrics

6

Suggestions for PART Metrics

• Scientific Discovery is hard to measure in near term

• Propose using following sets of metrics to assess two factors that are highly influential on scientific discovery– how well the computational facilities provision

resources and services (Facility Metrics), and – how well computational scientists use these resources

to produce science (Computational Science Metrics)• Some combination of these metrics should

replace PART #2 and #3

7

Metrics Terminology

• Goal: the behavior being motivated

• Metric: what is being measured

• Value: the value for the metric that must be achieved

8

Facility Metrics

• How well the computational facilities provision resources and services

• Specifics of goals and metrics impacts your experience running at NERSC

9

Facility Metrics: User Satisfaction

Goal #1: User Satisfaction Meeting the metric means that the users are satisfied with

how well the facility provides resources and services.

Metric #1.1: Users find the systems and services of a facility useful and helpful.

Value #1.1: The overall satisfaction of an annual user survey is 5.25 or better (out of 7).

Metric #1.2: Facility responsiveness to user feedback

Value #1.2: There is an improved user rating in areas where previous user ratings had fallen below 5.25 (out of 7).

10

Facility Metrics: System Availability

Goal #2: Office of Science systems are ready and able to process the user workload.

Meeting this metric means the machines are up and available most of the time. Availability has real meaning to users.

Metric #2.1: Scheduled availability Scheduled availability is the percentage of time a

system is available for users, accounting for any scheduled downtime for maintenance and upgrades.

Value #2.1: Within 18 months of delivery and thereafter, scheduled availability is > 95%

11

Facility Metrics: Effective Assistance

Goal #3: Facilities provide timely and effective assistance Helping users effectively use complex systems is a key

service that leading computational facilities supply. Users desire their inquiry is heard and is being worked. Users also need to have their problems answered properly in a timely manner.

Metric #3.1: Problems are recorded and acknowledged

Value #3.1: 99% of user problems are acknowledged within 4 working hours.

Metric #3.2: Most problems are solved within a reasonable time

Value #3.2: 80% of user problems are addressed within 3 working days, either by resolving them or (for longer term problems) by informing the user of a longer term plan and providing periodic updates

12

Facility Metrics: Facilitating Capability Jobs

Goal #4: Facilitate running capability jobs Major computational facilities have to run capability jobs.

The definition of a capability job needs to be defined by agreement between the Program Office and the Facility. The number of processors that define a capability job is a function of the number of available processors, the number and kind of projects or users that the facility supports. This function has not yet been determined.

Metric #4.1: The majority of computational time goes to capability jobs.

Value #4.1: T% of all computational time will go to jobs that use more than N CPUs (or x% of the available processors)

Metric #4.2: Capability jobs are provided excellent turnaround

Value #4.2: For capability jobs, the expansion factor is X or less.

13

Discussion: What is a Capability Job?

• A job using 1/8 of the processors?• A job using 1/10 of the processors?• A project that received ≥ 3% of the DOE allocation

(3 such projects at NERSC)?• A project that received ≥ 2% of the total allocation

(12 projects)?• A project that received ≥ 1% of the total allocation

(25 projects)?• A function of both the number of processors and

the number of projects at a facility? E.g. 10 * max procs / num projects:

• NERSC: 10 * 6080 procs / 300 projects = 202 processors• Leadership: 10* 10,000 procs / 20 projects = 5,000

processors

14

Discussion: Should we have a Target Expansion Factor?

• Relationship between Expansion Factor and Allocations:– Inverse relationship between the expected expansion factor

and the percentage of resource that is allocated – the more that gets allocated the longer the wait times and the

higher the expansion factor • For which class of jobs should an Expansion Factor metric

apply?– Capability jobs only?– All regular charge jobs?– Other?

• For which machines should an Expansion Factor metric apply?– Only the largest machine at a facility?– All machines, each weighted by their contribution to the total

allocation?

15

Discussion: What should the Target Expansion Factor Be?

• Traditional Expansion Factor: E(job) = (wait_time + run_time) / run time

• Proposed Formula (only request time can influence scheduling decisions):

E(job) = (wait time + request time) / request time

• Weight to use in computing the Expansion Factor for a class of jobs:– Simple average– Request time

Request time * number of processors (this gives more weight to capability jobs)

• When to start counting wait time?– On Seaborg and Bassi: when the job enters Idle state– On Jacquard: when the job was submitted (this will

change with Maui scheduler)

16

Past NERSC Expansion Factors for Regular Charge Class

QuarterAllocation Pressure

Seaborg EF

Bassi EF

Jacquard EF

NERSC EF

FY05 Q3Over-allocated

6.72 1.39 5.14

FY05 Q4Over-allocated

6.62 1.50 5.10

FY06 Q1 Mixed 5.69 4.61 3.62 4.89

FY06 Q2 Very Low 2.48 2.00 1.96 2.00

FY06 Q3 thru 6/5

Low 4.00 1.50 2.24 2.72

17

Past Seaborg Expansion Factors for Regular Charge Class

Year1-112 procs

128-240

procs

256-496

procs

512-1,008 procs

1,024-2,032 procs

2,048+

procsAll

FY05 Q2 3.97 7.06 9.87 5.52 7.16 17.76 6.72

FY05 Q3 4.96 10.06 13.68 5.38 7.12 8.63 6.72

FY05 Q4 4.04 5.20 10.10 5.29 7.81 9.25 6.62

FY06 Q1 2.48 5.41 7.08 5.47 8.04 6.58 5.69

FY06 Q2 1.39 1.71 2.55 1.92 2.96 4.20 2.48

FY06 Q3 thru 6/5

1.92 3.73 5.37 4.12 4.65 4.29 4.00

18

Computational Science Metrics

• Ability of projects to use facility resources for science

19

Computational Science Metrics: Science Progress

CS Goal #1: Science Progress While there are many laudable science goals, it is vital that

significant computational progress is made against the Nation’s science challenges and questions.

Metric #CS1.1: Progress is demonstrated toward the scientific milestones in the top 20 projects at each facility based on the simulation results planned and promised in their project proposals.

Value #CS1.1: For the top 20 projects at each facility, an

assessment is made by the related program office regarding how well scientific milestones were met or exceeded relative to plans determined during the review period.

20

Computational Science Metrics: Code Scalability

CS Goal #2: Scalability of Computational Science ApplicationsThe major challenge facing computational science during the

next five to ten years is the increased parallelism needed to use more computational resources. Multi-core chips accelerate the need to respond to this challenge.

Metric #CS2.1: Science applications should increase in scalability.

Value #CS2.1: The scalability of selected applications increase by a factor of 2 every three years. The definition of scalability (strong, weak, etc.) might be domain- and/or code-specific.

1 Metrics for the Office of Science HPC Centers Jonathan Carter User Services Group Lead...

Documents

Transcript of 1 Metrics for the Office of Science HPC Centers Jonathan Carter User Services Group Lead...