Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff...

Active Sampling for Accelerated Learning of

Performance Models

Piyush Shivam, Shivnath Babu, Jeff Chase

Duke University

C3

C1

C2

Site A

Site B

Site C

Task scheduler

Task workflow

A network of clusters or grid sites.

Each site is a pool of heterogeneous resources (e.g., CPU, memory, storage, network)

Managed as a shared utility.

Jobs are task/data workflows.

Challenge: choose the ‘best’ resource mapping/schedule for the job mix.

Instance of “utility resource planning”.

Solution under construction: NIMO

Networked Computing Utility

Subproblem: Predict Job Completion Time

AttributesSamples

CPU speed

Memory size

Network latency

Disk spindles Execution time

s1 2.4 GHz

2 GB 1 ms 10 2 hours

. . . . . .

. . . . . .

Premises (Limitations)• Important batch applications are run repeatedly.

– Most resources are consumed by applications we have seen in the past.

• Behavior is predictable across data sets.– …given some attributes associated with the data set.– Stable behavior per unit of data processed (D)– D is predictable from data set attributes.

• Behavior depends only on resource attributes.– CPU type and clock, seek time, spindle count.

• Utility controls the resources assigned to each job.– Virtualization enables precise control.

• Your mileage may vary.

NIMONonInvasive Modeling for

Optimization

• NIMO learns end-to-end performance models– Models predict performance as a function of, (a)

application profile, (b) data set profile, and (c) resource profile of candidate resource assignment

• NIMO is active– NIMO collects training data for learning models by

conducting proactive experiments on a ‘workbench’• NIMO is noninvasive

App/data profiles

(Target) performance

Candidate resource profiles

Model

“What if…”

Applicationprofiler

Training setdatabase

Active learning

C3

C1

C2

Site A

Site B

Site C

SchedulerResourceprofiler

The Big Picture

Jobs, benchmarks

Pervasive instrumentation

Correlate metrics

with job logs

Generic End-to-End Model

compute phases(compute resource busy)

stall phases(compute resource

stalled on I/O)

Od

(storage

occupancy)

On

(network

occupancy)

+ + )(T = D *totaldata

comp.time

Oa

(compute

occupancy)

Os

(stall occupancy)

occupancy: average time consumed per unit of datadirectly observable

Independent variables

Dependent variables

Resource profile ( )

Dataprofile ( )

Statistical Learning

Complexity (e.g., latency hiding, concurrency, arm contention) is captured implicitly in the training data rather than in the structure of the model.

Sampling Challenges

• Full system operating range– Samples must cover space of candidate resource

assignments

• Cost of sample acquisition– Acquiring a sample has a non-negligible cost, e.g.,

time to acquire a sample, or opportunity cost for the application

• Curse of dimensionality– Too many parameters!– E.g., 10 dimensions X 10 values per dimension– 5 minutes for each sample => 951 years for 1%

samples!

Active Learning in NIMO

Passive sampling

Active sampling

Number of training samples

Accuracy of

current model

100%

• Passive sampling might not expose the system operating range

• Active sampling using “design of experiments” collects most relevant training data

• Automatic and quick

How to learn accurate models quickly?

Sample Carefully

Passive sampling

Active sampling with acceleration

Number of training samples

Accuracy ofcurrent model

100%

Active samplingwithout acceleration

Active Sampling Challenges

• How to expose the main factors and interactions in the shortest time?– Which dimensions/attributes to perturb?– What values to choose for the attributes?

• Where to conduct the experiment?– On a separate system (“workbench”) or “live”?

Planning `active’ experiments

1. Choose a predictor function to refine• Focus in on the most significant/relevant

predictors….or…the least accurate• Example: CPU-intensive app needs an

accurate compute time predictor2. Choose attribute (if any) to add to the predictor

• Example: CPU speed3. Choose the values of the attributes 4. Conduct the experiment5. Compute current prediction error; Go to Step 1

Choosing the Next Predictor

• Learn the most significant/relevant predictors first.– Static vs. dynamic ordering– Static: define total order, e.g., a priori or by

pre-estimates of influence (Plackett-Burman).• Cycle through the order: round-robin vs.

improvement threshold– Dynamic: choose the predictor with maximum

current error

Choosing New Attributes

• Include the most significant/relevant attributes– Choose attributes to expose main factors and

interactions• Add an attribute when error reduction from

further training with the current set falls below threshold.

• Choose the attribute with maximum potential improvement in accuracy.– Establish total order using pre-estimate of

relevance using Plackett-Burman.

Choosing New Values• Select a new value sample to train the selected

predictor function with the chosen set of attributes.

• Range of approaches balance coverage vs. interactions

Binary search/bracketPB to identify interactions

La-Ib

a = #levels for valueb = degree of interactions

Experimental Results

• Biomedical applications– BLAST, fMRI, NAMD, CardioWave

• Resources– 5 CPU speeds, 6 Network latencies, 5 Memory

sizes– 5 X 6 X 5 = 150 resource assignments

• Goal: Learn executing time model with least number of training assignments

• Separate test set to evaluate the accuracy of the current model

BLAST Application

• Total time for 150 assignments: 130 hrs

• Active sampling: 5 hrs

• Sample space: 2%

• Incorrect order of predictor refinement

• 12 hrs• 10% sample space

BLAST Application

• Total time for 150 assignments: 130 hrs

• Active sampling: 5 hrs

• Sample space: 2%

• Incorrect order of attribute refinement

• 12 hrs• 10% sample space

Summary/Conclusions

• Current SLT – given the right data, learn the right model

• Use active sampling to acquire the right data• Ongoing experiments demonstrate the

importance/potential of guided active sampling– 2% sample space, >= 90% model accuracy

• Upcoming VLDB paper…

Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff...

Documents

Transcript of Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff...