Quality of Service Proﬁling - MIT CSAIL

Quality of Service Profiling

Sasa MisailovicMIT CSAIL

[email protected]

Stelios SidiroglouMIT CSAIL

[email protected]

Henry HoffmannMIT CSAIL

[email protected] Rinard

MIT [email protected]

ABSTRACTMany computations exhibit a trade off between executiontime and quality of service. A video encoder, for example,can often encode frames more quickly if it is given the free-dom to produce slightly lower quality video. A developerattempting to optimize such computations must navigate acomplex trade-off space to find optimizations that appropri-ately balance quality of service and performance.

We present a new quality of service profiler that is de-signed to help developers identify promising optimizationopportunities in such computations. In contrast to stan-dard profilers, which simply identify time-consuming partsof the computation, a quality of service profiler is designedto identify subcomputations that can be replaced with new(and potentially less accurate) subcomputations that deliversignificantly increased performance in return for acceptablysmall quality of service losses.

Our quality of service profiler uses loop perforation (whichtransforms loops to perform fewer iterations than the orig-inal loop) to obtain implementations that occupy differentpoints in the performance/quality of service trade-off space.The rationale is that optimizable computations often con-tain loops that perform extra iterations, and that removingiterations, then observing the resulting effect on the qualityof service, is an effective way to identify such optimizablesubcomputations. Our experimental results from applyingour implemented quality of service profiler to a challengingset of benchmark applications show that it can enable devel-opers to identify promising optimization opportunities anddeliver successful optimizations that substantially increasethe performance with only small quality of service losses.

Categories and Subject DescriptorsD.2.8 [Software Engineering]: Metrics – performance

General TermsPerformance, Experimentation, Measurement

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ICSE ’10, May 2-8 2010, Cape Town, South AfricaCopyright 2010 ACM 978-1-60558-719-6/10/05 ...$10.00.

KeywordsProfiling, Loop Perforation, Quality of Service

1. INTRODUCTIONPerformance optimization has been an important soft-

ware engineering activity for decades. The standard ap-proach typically involves the use of profilers (for example,gprof [13]), to obtain insight into where the program spendsits time. Armed with this insight, developers can then focustheir optimization efforts on the parts of the program thatoffer the most potential for execution time reductions.

In recent years a new dimension in performance optimiza-tion has emerged — many modern computations (such asinformation retrieval computations and computations thatmanipulate sensory data such as video, images, and audio)exhibit a trade off between execution time and quality of ser-vice. Lossy video encoders, for example, can often produceencoded video frames faster if the constraints on the qualityof the encoded video are relaxed [16]. Developers of suchcomputations must navigate a more complex optimizationspace. Instead of simply focusing on replacing inefficientsubcomputations with more efficient subcomputations thatproduce the same result, they must instead consider (a po-tentially much broader range of) subcomputations that 1)increase performance, 2) may produce a different result, but3) still maintain acceptable quality of service.

For such developers, existing profilers address only oneside (the performance side) of the trade off. They provideno insight into the quality of service implications of mod-ifying computationally expensive subcomputations. To ef-fectively develop computations with an appropriate trade offbetween performance and quality of service, developers neednew profilers that can provide insight into both sides of thetrade off.

1.1 Quality of Service ProfilingThe goal of quality of service profiling is to provide in-

formation that can help developers identify forgiving sub-computations — i.e., subcomputations that can be replacedwith different (and potentially less accurate) subcomputa-tions that deliver increased performance in return for ac-ceptable quality of service losses. Note that a necessaryprerequisite for quality of service profiling is the availabil-ity of multiple implementations (with different performanceand quality of service characteristics) of the same subcom-putation.

We present a technique, loop perforation, and an associ-ated profiler that is designed to give developers insight into

the performance versus quality of service trade-off space ofa given application. Loop perforation automatically gener-ates multiple implementations (with different performanceand quality of service characteristics) of a given subcompu-tation. The profiling tool leverages the availability of thesemultiple implementations to explore a range of implementa-tion strategies and generate tables that summarize the re-sults of this exploration. The goal is to help developers focustheir optimization efforts on subcomputations that do notjust consume a significant amount of the computation timebut also offer demonstrated potential for significant perfor-mance increases in combination with acceptable quality ofservice losses.

1.1.1 Loop PerforationGiven a loop, loop perforation transforms the loop so that

it performs fewer iterations (for example, the perforatedloop may simply execute every other iteration of the originalloop). Because the perforated loop executes fewer iterations,it typically performs less computational work, which may inturn improve the overall performance of the computation.However, the perforated loop may also produce a differentresult than the original loop, which may, in turn, reduce thecomputation’s quality of service. An appropriate analysisof the effect of loop perforation can therefore help the de-veloper identify subcomputations that can be replaced withless computationally expensive (and potentially less accu-rate) subcomputations while still preserving an acceptablequality of service.

The rationale behind the use of loop perforation to iden-tify promising optimization opportunities is that many suc-cessful optimizations target partially redundant subcompu-tations. Because this partial redundancy often manifestsitself as extra loop iterations, one way to find such subcom-putations is to find loops that the profiler can perforate withonly small quality of service losses. Our experimental results(see Sections 4 and 5) support this rationale — many of theoptimization opportunities available in our set of benchmarkapplications correspond to partially redundant subcomputa-tions realized as loops in heuristic searches of complex searchspaces. Our results indicate that quality of service profilingwith loop perforation is an effective way to uncover thesepromising optimization targets.

1.1.2 Quality of Service Metrics and RequirementsTo quantify the quality of service effects, the profiler works

with a developer-provided quality of service metric. Thismetric takes an output from the original application, a cor-responding output from perforated application run on thesame input, and returns a non-negative number that mea-sures how much the output from the perforated applicationdiffers from the output from the original application. Zeroindicates no difference, with larger numbers indicating corre-spondingly larger differences. One typical quality of servicemetric uses the scaled difference of selected output fields.Another uses the scaled difference of output quality metricssuch as peak signal to noise ratio.

A quality of service requirement is simply a bound on thequality of service metric. For example, if a 10% differencefrom the output of the original program is acceptable, theappropriate quality of service requirement for a scaled dif-ference metric is 0.1.

1.1.3 Individual Loop ProfilingThe profiler starts by characterizing the impact of per-

forating individual loops on the performance and quality ofservice. For each loop in the program that consumes a signif-icant amount of the processing time, it generates a version ofthe program that perforates the loop. It then runs this per-forated version, recording the resulting execution time andquality of service metric. It is possible for the perforationto cause the program to crash or generate clearly unaccept-able output. In this case the profiler identifies the loop as acritical loop. Loops that are not critical are perforatable.

The developer can use the individual loop profiling results(as presented in the generated profiling tables, see Section 4)to obtain insight into potentially fruitful parts of the pro-gram to explore for optimizations. The profiling tables canalso identify parts of the program that show little potentialas optimization targets.

1.1.4 Combined Loop ProfilingThe profiler next explores potential interactions between

perforated loops. Given a quality of service requirement,the profiler searches the loop perforation space to find thepoint that maximizes performance while satisfying the qual-ity of service requirement. The current algorithm uses acost/benefit analysis of the performance and quality of ser-vice trade off to order the loops. It then successively andcumulatively perforates loops in this order, discarding perfo-rations that exceed the quality of service requirement. Theresult is a set of loops that, when perforated together, deliverthe required quality of service.

The developer can use the combined loop profiling re-sults (as presented in the generated tables, see Section 4)to better understand how the interactions (both positiveand negative) between different perforated loops affect theperformance and quality of service. Positive interactionscan identify opportunities to optimize multiple subcompu-tations with more performance and less quality of service lossthan the individual loop perforation results might indicate.Conversely, negative interactions may identify subcompu-tations that cannot be optimized simultaneously withoutlarger than expected quality of service losses.

1.2 Results and ScopeWe have applied our techniques to a set of applications

from the PARSEC benchmark suite [9]. This benchmarksuite is designed to contain applications representative ofmodern workloads for the emerging class of multicore com-puting systems. Our results show that, in general, loop per-foration can increase the performance of these applicationson a single benchmark input by a factor of between two orthree (the perforated applications run between two and threetimes faster than the original applications) while incurringquality of service losses of less than 10%. Our results alsoindicate that developers can use the profiling results to de-velop new, alternative subcomputations that deliver signifi-cantly increased performance with acceptably small qualityof service losses.Separation of Optimization Targets: Our results showthat quality of service profiling produces a clear separation ofoptimization targets. Some perforated loops deliver signifi-cant performance improvements with little quality of serviceloss. The subcomputations in which these loops appear aregood optimization targets — at least one optimization at-

tempt (loop perforation) succeeded, and the application hasdemonstrated that it can successfully tolerate the pertur-bations associated with this optimization attempt. Other,potentially more targeted, optimization attempts may there-fore also succeed.

Other perforated loops, on the other hand, either fail toimprove the performance or produce large quality of servicelosses. These loops are much less compelling optimizationtargets because the first optimization attempt failed, indi-cating that the application may be unable to tolerate per-turbations associated with other optimization attempts.Application Properties: Guided by the profiling results,we analyzed the relevant subcomputations to understandhow loop perforation was able to obtain such significant per-formance improvements with such small quality of servicelosses. In general, we found that the successfully perforatedloops typically perform (at least partially) redundant sub-computations in heuristic searches over a complicated searchspace. For five out of seven of our applications, the resultsshow that it is possible to obtain a significantly more efficientsearch algorithm that finds a result that is close in quality tothe result that the original search algorithm finds. This anal-ysis clearly indicates that, for our set of benchmark applica-tions, loops that perform partially redundant computationsin heuristic search algorithms are an appropriate optimiza-tion target (and an optimization target that our profiler canenable developers to quickly and easily find).Manual Optimization Results: We also used the pro-filing results to identify optimization opportunities in twoPARSEC applications (x264, a video encoder, and body-track, a computer vision application). Guided by the pro-filing results, we manually developed alternate implemen-tations of the identified subcomputations. These alternateimplementations deliver significantly increased performancewith small quality of service losses.Scope: We note that our overall approach is appropriatefor computations with a range of acceptable outputs. Com-putations (such as compilers or databases) with hard logicalcorrectness requirements and long dependence chains thatrun through the entire computation may not be appropriatetargets for optimizations that may change the output thatthe program produces.

1.3 ContributionsThis paper makes the following contributions:

• Quality of Service Profiling: It introduces the con-cept of using a quality of service profiler to help the de-veloper obtain insight into the performance and qualityof service implications of optimizing different subcom-putations.

• Loop Perforation: Any quality of service profilerneeds a mechanism to obtain alternative subcomputa-tions with a range of performance and quality of servicecharacteristics. This paper identifies loop perforationas an effective mechanism for automatically obtainingalternative subcomputations for this purpose.

• Rationale: It explains why loop perforation is an ef-fective mechanism for identifying promising optimiza-tion opportunities. Many optimizable subcomputa-tions manifest themselves as loops that perform extraiterations. Eliminating loop iterations, in combinationwith measuring quality of service effects, is an effectiveway to find such subcomputations.

• Experimental Results: It presents experimental re-sults that characterize the effectiveness of quality ofservice profiling. These results use a set of benchmarkapplications chosen to represent modern workloads foremerging multicore computing platforms.

– Loop Perforation Results: It presents indi-vidual and cumulative loop perforation tables forour benchmark applications. The information inthese tables provides a good separation betweenoptimization targets, enabling developers to iden-tify and focus on promising targets while placinga lower priority on less promising targets.

– Manual Optimization Results: It presents re-sults that illustrate how we were able to use theprofiling results in two manual optimization ef-forts. These efforts produced optimized versionsof two benchmarks. These versions combine sig-nificantly improved performance with small qual-ity of service losses.

• Unsound Program Transformations: Over thelast several years the field has developed a range ofunsound program transformations (which may changethe semantics of the original program in principledways) [23, 8, 20, 10, 11, 18, 17, 14]. This paper presentsyet another useful application of an unsound transfor-mation (loop perforation), further demonstrating theadvantages of this approach.

2. LOOP PERFORATIONWe implemented the loop perforation transformation as

an LLVM compiler pass [15]. Our evaluation focuses on ap-plications written in C and C++, but because the perforatoroperates at the level of the LLVM bitcode, it can perforateapplications written in any language (or combination of lan-guages) for which an LLVM front end exists.

The perforator works with any loop that the existing LLVMloop canonicalization passes, loopsimplify and indvars, canconvert into the following form:

for (i = 0; i < b; i++) { ... }

In this form, there is an induction variable (in the codeabove, i) initialized to 0 and incremented by 1 on every itera-tion, with the loop terminating when the induction variablei exceeds the bound (in the code above, b). The class ofloops that LLVM can convert into this form includes, forexample, for loops that initialize an induction variable toan arbitrary initial value, increment or decrement the in-duction variable by an arbitrary constant value on each it-eration, and terminate when the induction variable exceedsan arbitrary bound.

The perforation rate (r) determines the percentage of it-erations that the perforated loop skips — the loop perfora-tion transformation takes r as a parameter and transformsthe loop to skip iterations at that rate. Modulo perforationtransforms the loop to perform every nth iteration (here theperforation rate is r = n-1/n):

for (i = 0; i < b; i += n) { ... }

Our implemented perforator can also apply truncationperforation (which skips a contiguous sequence of iterationsat either the beginning or the end of the loop) or random per-foration (which randomly skips loop iterations). The loop

perforator takes as input a specification of which loops toperforate, what kind of perforation to apply (modulo, trun-cation, or random), and the perforation rate r. It producesas output the corresponding perforated program. All of theexperimental results in this paper use modulo perforationwith an n of 2 and a perforation rate of 1

2(i.e., the perfo-

rated loop skips half the iterations of the original loop).

3. QUALITY OF SERVICE METRICIn general, it is possible to use any quality of service met-

ric that, given two outputs (typically one from the originalapplication and another from the perforated application),produces a measure of the difference between the outputs.

The quality of service metric for our benchmark applica-tions all use a program output abstraction to obtain numbersfrom the output to compare. These abstractions typicallyselect important output components or compute a measure(such as peak signal to noise ratio) of the quality of theoutput. The quality of service metric simply computes therelative scaled difference between the produced numbers.Specifically, we assume the output abstraction produces asequence of numbers o1, . . . , om. Given numbers o1, . . . , om

from an unmodified execution and numbers o1, . . . , om froma perforated execution, the following quantity d, which wecall the distortion [21], measures the accuracy of the outputfrom the perforated execution:

d =1

m

mXi=1

˛oi − oi

oi

˛The closer the distortion d is to zero, the less the perfo-rated execution distorts the output. By default the dis-tortion equation weighs each component equally, but it ispossible to modify the equation to weigh some componentsmore heavily than others.

The distortion measures the absolute error that loop per-foration (or, for that matter, any other transformation) in-duces. It is also sometimes useful to consider whether thereis any systematic direction to the error. We use the bias [21]metric to measure any such systematic bias. If there is asystematic bias, it may be possible to compensate for thebias to obtain a more accurate result.

4. PROFILING RESULTSTo evaluate the effectiveness of our approach, we apply

quality of service profiling to seven benchmarks chosen fromthe PARSEC benchmark suite [9]. Unless otherwise noted,the input for the profiling runs is the simlarge input providedas part of the benchmark suite.

• x264. This media application performs H.264 encod-ing on a video stream. The quality of service met-ric includes the mean distortion of the peak signal-to-noise ratio (PSNR) (as measured by the H.264 refer-ence decoder) and the bitrate of the encoded video. Weran this application on the tractor input from xiph.org(available at http://media.xiph.org/video/derf/).

• streamcluster. This data mining application solvesthe online clustering problem. The quality of servicemetric uses the BCubed (B3) clustering quality metric[4]. This clustering metric calculates the homogene-ity and completeness of the clustering generated bythe application, based on external class labels for datapoints. The value of the metric ranges from 0 (bad

clustering) to 1 (excellent clustering). The quality ofservice metric itself is calculated as the difference be-tween the clustering quality of the perforated and orig-inal application. It is possible for the perforated pro-gram to perform better than the unmodified version,in which case the quality of service metric is zero.For this application, the simlarge input contains a uni-formly distributed set of points with no clusters. Thisinput is therefore not representative of production data.We instead ran streamcluster on the covtype inputfrom the UCI machine learning repository [2].

• swaptions. This financial analysis application usesMonte-Carlo simulation to solve a partial differentialequation and price a portfolio of swaptions. The qual-ity of service metric uses the scaled difference of theswaption prices from the original and perforated appli-cations. With the simlarge input from the PARSECbenchmark suite, all swaptions have the same value.To obtain a more realistic computation, we changedthe input to vary the interest rates of the swaptions.

• canneal. This engineering application uses simulatedannealing to minimize the routing cost of microchipdesign. The quality of service metric is the scaled dif-ference between the routing costs from the perforatedand original versions.

• blackscholes. This financial analysis application solvesa partial differential equation to compute the price ofa portfolio of European options. The quality of servicemetric is the scaled difference of the option prices.

• bodytrack. This computer vision application usesan annealed particle filter to track the movement of ahuman through a scene. The quality of service met-ric uses the relative mean squared error of the seriesof vectors that the perforated application produces torepresent the changing configurations of the trackedbody (as compared with the corresponding series ofvectors from the original application). The metric di-vides the relative mean squared errors by the magni-tudes of the corresponding vectors from the originalapplication, then computes the mean error for all vec-tors as the final quality of service metric. The simlarge

input from the PARSEC benchmark suite has only fourframes. We therefore used the first sixty frames fromthe PARSEC native input. These sixty frames containthe four frames from the simlarge input.

• ferret. This search application performs content-basedsimilarity search on an image database. It returns alist of images present in the database whose contentis similar to an input image. The quality of servicemetric is based on the intersection of the sets returnedby the original and perforated versions. Specifically,the quality of service metric is 1 minus the number ofimages in the intersection divided by the number ofimages returned by the original version.

In addition to these benchmarks, the PARSEC benchmarksuite contains the following benchmarks: facesim, dedup,fluidanimate, freqmine, and vips. We do not include fre-qmine and vips because these benchmarks do not success-fully compile with the LLVM compiler. We do not includededup and fluidanimate because these applications producecomplex binary output files. Because we were unable to de-cipher the meaning of these files given the time available tous for this purpose, we were unable to develop meaningful

http://media.xiph.org/video/derf/

quality of service metrics. We do not include facesim be-cause it does not produce any output at all (except timinginformation).

4.1 Single Loop Profiling ResultsTable 1 presents the results of the single loop profiling

runs. The rows of the table are grouped by application.Each row of the table presents the results of the profilingrun for a single perforated loop. The rows are sorted ac-cording to the number of instructions executed in the loopin the original unperforated application.1 Our profiling runsperforated all loops that account for at least 1% of the ex-ecuted instructions. For space reasons, we present at mostthe top eight loops for each application in Table 1.First Column (Function): The first column contains thename of the function that contains the loop, optionally aug-mented with additional information to identify the specificloop within the function. Many functions contain a singleloop nest with an outer and inner nested loop; the loops insuch functions are distinguished with the outer and innerlabels. Other functions contain multiple loops; the loopsin such functions are distinguished by presenting the linenumber of the loop in the file containing the function.Second Column (Instruction %): The second columnpresents the percentage of the (dynamically executed) in-structions in the loop for the profiling run. Note that be-cause of both interprocedural and intraprocedural loop nest-ing, instructions may be counted in the execution of multipleloops. The percentages may therefore sum to over 100%.Third Column (Quality of Service): The third columnpresents the quality of service metric for the run. Recallthat a quality of service metric of zero indicates no qualityof service loss. A quality of service metric of 0.5 typicallyindicates a 50% difference in the output of the perforatedapplication as compared with the output of the original ap-plication. A dash (-) indicates that the perforated executioneither crashed or produced a clearly unacceptable output.Fourth Column (Speedup): The fourth and final col-umn presents the speedup of the perforated application —i.e., the execution time of the perforated application dividedby the execution time of the original application. Numbersgreater than 1 indicate that the perforated application runsfaster than the original; numbers less than 1 indicate thatthe perforated application runs slower than the original.Interpreting the Results: Good candidate subcomputa-tions for manual optimization contain loops with the follow-ing three properties: 1) the application spends a significantamount of time in the loop, 2) perforating the loop signif-icantly increases the performance, and 3) perforating theloop causes small quality of service losses. The developercan easily identify such loops by scanning the data in Ta-ble 1. For example (and as discussed further in Section 5),the pixel_satd_wxh loops in x264 are promising candidatesfor optimization because they have all three properties. Thesuccess of the loop perforation optimization provides evi-dence that the computation is amenable to optimization andthat other optimization attempts may also succeed.

1 Our technique is designed to work with any execution timeprofiler. Our profiling runs use an execution time profilerthat counts the number of times each basic block executes.This execution time profiler uses an LLVM compiler pass toaugment the program with the instrumentation required tocount basic block executions.

Perhaps more importantly, the developer can easily rejectotherwise promising candidates that fail to satisfy one of theproperties. For example, the pixel_sub_wxh2018, outer loopin x264 would be a reasonable candidate for optimizationexcept for the large quality of service loss that the applica-tion suffers when the loop is perforated. In comparison witha standard profiler, which only provides information aboutwhere the application spends its execution time, the addi-tional information present in a quality of service profile canhelp the developer focus on promising optimization oppor-tunities with demonstrated potential while placing a lowerpriority on opportunities with more potential obstacles.

4.2 Multiple Loop Profiling ResultsTable 2 presents the profiling results for the multiple loop

profiling runs. The quality of service requirement for eachapplication is set to 0.10 (which corresponds to a 10% differ-ence between the outputs from the original and perforatedapplications). The rows are grouped by application, witheach row presenting the profiling results from augmentingthe existing set of perforated loops with the next loop. Thefirst column (function) identifies the loop added to set ofperforated loops from the preceding rows. The next twocolumns repeat the profiling results for that loop from thesingle loop profiling runs as presented in Table 1. The fi-nal two columns present the quality of service metric andspeedup from the corresponding cumulative profiling run.This run perforates all loops from all rows up to and includ-ing the current row.Interpreting the Results: The multiple loop profiling re-sults identify a group of loops that have demonstrated po-tential when optimized together. The lack of negative inter-actions between the perforated loops in the group indicatesthat other optimization attempts that target all of the cor-responding computations as a group may also succeed.

4.3 Individual Application ResultsProfiling Results for x264: The single loop profiling re-sults for x264 (Table 1) indicate that the the top two loops inthe table (the outer and inner loops from pixel_satd_wxh) arepromising optimization candidates — these loops accountfor a significant percentage of the executed instructions andperforation delivers a significant performance improvementwith more than acceptable quality of service loss. Guidedby these profiling results, we were able to develop an op-timized implementation of the corresponding functionalitywith even more performance and less quality of service lossthan the perforated version (see Section 5).

The profiling results also identify subcomputations thatare poor optimization candidates. Perforating the loops inpixel_sub_wxh2018, for example, produces a large quality ofservice loss for relatively little performance improvement.Perforating the loop in x264_mb_analyse_inter_p8x8 (as wellas several other loops) causes x264 to crash.

The cumulative profiling results in Table 2 highlight thesedistinctions. The final perforated collection of loops containsall of the promising optimization candidates from Table 1and none of the poor optimization candidates. The cumula-tive profiling results also include loops that were not in thetop 8 loops from Table 2. Because perforating these loopsadds relatively little performance but also relatively littlequality of service loss, they made it into the cumulative setof perforated loops.

x264Function Instruction % Quality of Service Speeduppixel satd wxh, outer 41.70% 0.0365 1.459pixel satd wxh, inner 40.90% 0.0466 1.452refine subpel 40.90% 0.0005 1.098pixel sub wxh2018, outer 20.50% – –x264 mb analyse inter p8x8 20.10% – –pixel sub wxh2018, inner 16.20% 0.2787 1.210pixel avg, outer 12.90% 0.0158 1.396x264 mb analyse inter p8x16, outer 12.30% – –

streamclusterFunction Instruction % Quality of Service SpeeduppFL, inner 99.10% 0.0065 1.137pgain 95.30% 0.2989 0.865dist 92.00% 0.5865 1.249

swaptions (with bias correction)Function Instruction % Quality of Service Speedupworker 100.00% 1.0000 1.976HJM Swaption Blocking, outer 100.00% 0.0285 1.979HJM SimPath Forward Blocking, L.74 45.80% 1.5329 1.146HJM SimPath Forward Blocking, L.75 45.80% 1.5365 1.167HJM SimPath Forward Blocking, L.77 43.70% 1.5365 0.982HJM SimPath Forward Blocking, L.80 31.00% 0.3317 1.011HJM SimPath Forward Blocking, L.41 15.10% 0.0490 1.008HJM SimPath Forward Blocking, L.42 15.00% 1.9680 1.005

cannealFunction Instruction % Quality of Service Speedupannealer thread::run 68.40% 0.0014 0.992netlist elem::swap cost, L.80 26.20% 0.0048 0.981netlist elem::swap cost, L.89 26.20% – –netlist::netlist 25.30% 0.0000 0.989MTRand::reload, L.307 2.79% 0.0414 1.079netlist elem::routing cost given loc, L.56 1.84% 0.2000 1.002netlist elem::routing cost given loc, L.62 1.79% – –MTRand::reload, L.309 1.42% 0.0254 1.079

blackscholesFunction Instruction % Quality of Service Speedupmain, outer 98.70% 0.0000 1.960main, inner 98.70% 0.4810 1.885

bodytrackFunction Instruction % Quality of Service SpeedupmainPthreads 99.40% 1.00000 1.992ParticleFilter::Update 74.70% 0.00050 1.542ImageMeasurements::ImageErrorInside, outer 36.00% 0.00038 1.201ImageMeasurements::ImageErrorInside, inner 35.90% 0.00028 1.165ImageMeasurements::InsideError, outer 28.10% 0.00036 1.121ImageMeasurements::ImageErrorEdge, outer 27.30% 0.00062 1.112ImageMeasurements::ImageErrorEdge, inner 27.30% 0.00146 1.033ImageMeasurements::InsideError, inner 24.00% 0.00023 1.134

ferretFunction Instruction % Quality of Service Speedupraw query, L.168 62.30% – –emd, L.134 37.60% 0.0020 1.020

LSH query, L.569 27.70% 0.4270 1.60LHS query bootstrap, L.243 27.10% – –LHS query bootstrap, L.247 26.80% 0.2523 1.162emd, L.418 19.10% 0.8391 6.663emd, L.419 18.70% – –LSH query bootstrap, L.254 15.30% – –

Table 1: Single loop profiling results. For Quality of Service, smaller is better. For Speedup, larger is better.

x264Individual Cumulative

Function Quality of Service Speedup Quality of Service Speeduppixel satd wxh, outer 0.0365 1.459 0.0365 1.459pixel satd wxh, inner 0.0466 1.452 0.0967 1.672refine subpel 0.0005 1.098 0.0983 1.789pixel sad 8x8, outer 0.0001 1.067 0.0996 1.929pixel sad 8x8, inner 0.0003 1.060 0.0994 1.986x264 me search ref 0.0021 1.014 0.0951 2.058

streamclusterIndividual Cumulative

Function Quality of Service Speedup Quality of Service SpeeduppFL, inner 0.0065 1.137 0.0065 1.137

swaptions (with bias correction)Individual Cumulative

Function Quality of Service Speedup Quality of Service SpeedupHJM Swaption Blocking, outer 0.0285 1.979 0.0285 1.979HJM Swaption Blocking, middle 0.0111 1.015 0.0245 2.044HJM Swaption Blocking, inner 0.0131 1.009 0.0255 2.068HJM SimPath Forward Blocking, L.41 0.0490 1.008 0.0799 2.117

cannealIndividual Cumulative

Function Quality of Service Speedup Quality of Service SpeedupMTRand::reload, L.307 0.0254 1.079 0.0254 1.079

blackscholesIndividual Cumulative

Function Quality of Service Speedup Quality of Service Speedupmain, outer 0.0000 1.960 0.0000 1.960

bodytrackIndividual Cumulative

Function Quality of Service Speedup Quality of Service SpeedupParticleFilter::Update 0.00050 1.542 0.00050 1.542ImageMeasurements::ImageErrorInside, outer 0.00038 1.201 0.00032 1.809ImageMeasurements::ImageErrorInside, inner 0.00028 1.165 0.00040 1.960TrackingModel::GetObservation 0.00371 1.200 0.00309 2.282ImageMeasurements::InsideError, inner 0.00023 1.134 0.00573 2.376ImageMeasurements::InsideError, outer 0.00036 1.121 0.00684 2.420ImageMeasurements::ImageErrorEdge, inner 0.00146 1.033 0.00684 2.627ImageMeasurements::EdgeError, L.71 0.00099 1.049 0.00500 2.742ImageMeasurements::EdgeError, L.64 0.00056 1.042 0.00564 2.827ImageMeasurements::ImageErrorEdge, outer 0.00062 1.112 0.01260 2.984

ferretIndividual Cumulative

Function Quality of Service Speedup Quality of Service Speedupemd, L.134 0.0020 1.020 0.0020 1.020LSH query bootstrap, L.257 0.0066 1.021 0.0066 1.021

Table 2: Multiple loop profiling results using 0.1 quality of service bound. For Quality of Service, smaller isbetter. For Speedup, larger is better.

An analysis of x264 shows that the vast majority of suc-cessfully perforated loops perform computations that arepart of motion estimation [12]. Motion estimation performsa heuristic search for similar regions of different frames.The profiling results indicate that perforation somewhat de-grades the quality of this heuristic search, but by no meansdisables it. Indeed, eliminating motion estimation entirelymakes the encoder run more than six times faster than theoriginal version but, unfortunately, increases the size of theencoded video file by more than a factor of three. The cu-

mulatively perforated version, on the other hand, runs morethan a factor of two faster than the original program, butincreases the size of the encoded video file by less than 18%(which indicates that motion estimation is still working welleven after perforation). This result (which is directly re-flected in the profiling tables) shows that there is a signifi-cant amount of redundancy in the motion estimation com-putation, which makes this computation a promising targetfor optimizations that trade small quality of service lossesin return for substantial performance increases.

Profiling Results for streamcluster: Perforating theloop in pgain actually decreases the performance. Upon ex-amination, the reason for this performance decrease becomesclear — this loop is embedded in a larger computation thatexecutes until it satisfies its own internally calculated resultquality metric. Perforating this loop causes the computationto take longer to converge, which decreases the performance.

Perforating the inner loop in pFL, on the other hand, doesproduce a performance improvement. This loop controlsthe number of centers considered when generating a newclustering. Perforation causes the computation to relax theconstraint on the actual number of centers. In practice thismodification increases performance but in this case does notharm quality of service — there is enough redundancy inthe default set of potential new centers that discarding othernew centers does not harm the overall quality of service.

The loop in dist computes the Euclidean distance betweentwo points. Perforating this loop causes the computationto compute the distance in a projected space with half thedimensions of the original space, which in turn causes a sig-nificant drop in the quality of service.Profiling Results for swaptions: The single loop pro-filing results for swaptions indicate that there is only onepromising optimization candidate: the outer loop in thefunction HJM_Swaption_Blocking. Further examination re-veals that perforating this loop reduces the number of Monte-Carlo trials. The result is a significant performance increasein combination with a reduction in the swaption prices pro-portional to the percentage of dropped trials. The bias cor-rection mechanism (see Section 3) corrects this reductionand significantly decreases the drop in quality of service.

Perforating the other loops either crashes the applicationor produces an unacceptable quality of service loss. Thecumulative profiling results reflect this fact by includingHJM_Swaption_Blocking from the single loop profiling runsplus several other loops that give a minor performance in-crease while preserving acceptable quality of service.Profiling Results for canneal: The profiling results forcanneal identify no promising optimization candidates —perforating the loops typically provides little quality of ser-vice loss, but also little or no performance gain.Profiling Results for blackscholes: The profiling resultsfor blackscholes indicate that there are only two loops ofinterest. Perforating the first loop (the outer loop in main)produces a significant speedup with no quality of service losswhatsoever. Further investigation reveals that this loop wasapparently added to artificially increase the computationalload to make the benchmark run longer. While this loopis therefore not interesting in a production context, theseprofiling results show that our technique is able to identifycompletely redundant computation. Perforating the otherloop (in main,inner) produces unacceptable quality of serviceloss coupled with significant performance improvement.Profiling Results for bodytrack: The profiling resultsfor bodytrack indicate that almost all of the loops are promis-ing optimization candidates. Only one of the loops (the toploop, mainPthreads) has substantial quality of service losswhen perforated. The others all have very small quality ofservice losses. And indeed, the cumulative profiling resultsshow that it is possible to perforate all of these loops (andmore, the cumulative profiling results in Table 2 present onlythe first eight perforated loops) while keeping the quality ofservice losses within more than acceptable bounds.

We attribute this almost uniformly good quality of ser-vice even after perforation to two sources of redundancyin the computation. First, bodytrack uses a Monte Carloapproach to sample points in the captured images. It sub-sequently processes these points to recognize important im-age features such as illumination gradients. Sampling fewerpoints may reduce the accuracy of the subsequent imageprocessing computation, but will not cause the computationto fail or otherwise dramatically alter its behavior. Second,bodytrack does not simply perform one sampling phase. Itinstead performs a sequence of sampling phases, with theresults of one sampling phase used to drive the selection ofpoints during the next sampling phase. Once again, per-forming fewer sampling phases may reduce the accuracy ofthe overall computation, but will not cause the computationto fail or dramatically alter its behavior.

Optimizations that target such sources of redundancy canoften deliver significant performance gains without corre-sponding reductions in the quality of service. This applica-tion shows how quality of service profiling can help the de-veloper identify such sources of redundancy. See Section 5for a discussion of how we used quality of service profilingto develop a manually optimized version of this application.Profiling Results for ferret: For each query image, fer-ret performs two processing phases. In the first phase ferretdivides the image into segments. In the second phase fer-ret uses the image segments to find similar images in itsdatabase. Our profiling results show that the most timeconsuming loops are found in the database query phase. Formost of these loops, perforation produces unacceptable out-put. For two loops the output distortion is acceptable, butperforation does not significantly improve the performance.

5. CASE STUDIESWe next present two case studies that illustrate how we

used the profiling information to guide application optimiza-tion while preserving acceptable quality of service.x264: The top two perforated loops in the x264 profile (seeTables 1 and 2) both occur in the pixel_satd_wxh() function.The profiling results indicate that perforating these loopsdelivers a significant performance improvement with accept-able quality of service loss, which makes the pixel_satd_wxh()

subcomputation a promising candidate for optimization.A manual examination of the pixel_satd_wxh() function

indicates that it implements part of the temporal redun-dancy computation in x264, which finds similar regions ofdifferent frames for motion estimation [12]. The functionpixel_satd_wxh() takes the difference of two regions of pix-els, performs several Hadamard transforms on 4 × 4 subre-gions, then computes the sum of the absolute values of thetransform coefficients (a Hadamard transform is a frequencytransform which has properties similar to the Discrete Co-sine Transform).

One obvious alternative to the Hadamard-based approachis to eliminate the Hadamard transforms and simply returnthe sum of absolute differences between the pixel regionswithout computing the transform. Replacing the originalpixel_satd_wxh() function with this simpler implementationdelivers a speedup of 1.47 with a quality of service loss of0.8%. This loss is due to a 1.4 % increase in the size of the en-coded video and a .1 dB loss in PSNR. As Table 2 indicates,the automatically perforated version of pixel_satd_wxh() de-livers speedup of 1.67 with a quality of service loss of 9.67%

— a larger performance improvement than the manually op-timized version, but also a larger quality of service loss.

Our next optimization subsamples the sum of absolutedifferences computation by discarding every other value ineach row of the subregion. This optimization produces aspeedup of 1.68 and a quality of service loss of 0.4 %. Thisloss results from a 0.3 dB loss in PSNR with the size of theencoded video remaining the same. Note that this PSNRloss remains well below the accepted 0.5 dB perceptabilitythreshold. These results show that, guided by the profilinginformation, we were able to identify and develop an op-timized alternative to an important x264 subcomputationwhile preserving acceptable quality of service.bodytrack: Bodytrack processes video streams from fourcoordinated cameras to identify and track major body com-ponents (torso, head, arms, and legs) of a subject movingthrough the field of view (see Figure 1). It uses a particle-based Monte-Carlo technique — it randomly samples a col-lection of pixels (each sampled pixel corresponds to a par-ticle), then processes the samples to identify visual features(such as sharp illumination gradients) and map the visualfeatures to body components. The application itself per-forms an annealing computation with several phases, or lay-ers, of samples. Each layer uses the image processing resultsfrom the previous layer to guide the selection of the particlesduring its sampling process.

The top loop in the individual loop profile table is themainPthreads loop. But the results show that perforatingthis loop causes an unacceptable quality of service loss. Wetherefore focus our optimization efforts on the next loopin the profile (ParticleFilter::Update), which can be per-forated with significant performance improvement and verylittle quality of service loss. This loop performs the core ofbodytrack’s computation in which it randomly samples andprocesses particles (using the annealed particle filter algo-rithm) to perform edge and foreground silhouette detection.The loop itself iterates through the annealing layers per-forming directed particle filtering at each layer.

Guided by the profiling results, we developed an optimiza-tion that reduces the number of particles sampled at eachlayer. Specifically, the optimization reduces the number ofsampled particles from 4000 to 2000. This optimization pro-duces a speedup of 1.86 with a quality of service loss of0.01% — a larger performance increase than the automati-cally perforated version of this loop with a smaller quality ofservice loss. Figure 2 presents the output of this optimizedversion. This figure shows that this optimized version ofbodytrack successfully identifies all of the major body com-ponents (with the exception of one forearm). Each figurepresents four frames, one from each of the cameras.

6. RELATED WORKPerformance Profiling. Profiling a system to under-

stand where it spends its time is an essential componentof modern software engineering. Standard profilers simplyidentify the amount of time spent in each subcomputation[1, 3, 5, 13, 19, 26]. A quality of service profiler, in contrast,adds the extra dimension of providing developers with infor-mation about the quality of service implications of changingthe implementation of specific subcomputations. This ad-ditional information can enhance developer productivity byenabling the developer to focus on promising subcomputa-tions that loop perforation has already shown can be opti-

mized with acceptable quality of service losses while avoidingless promising subcomputations for which one optimizationattempt has already failed.Automatic Generation and Management of Perfor-

mance versus Quality of Service Trade Offs. Wehave also used loop perforation to automatically enhanceapplications with the ability to execute at a variety of dif-ferent points in the underlying performance versus qualityof service trade-off space [14]. We have demonstrated howan application can use this capability to automatically in-crease its performance or (in combination with an appro-priate monitoring and control system) dynamically vary itsperforation policy to adapt to clock frequency changes, corefailures, load fluctuations, and other disruptive events in thecomputing substrate [14].

Rinard has developed techniques for automatically de-riving empirical probabilistic quality of service and timingmodels that characterize the trade-off space generated bydiscarding tasks [21, 22]. Loop perforation operates on ap-plications written in standard languages without the needfor the developer to identify task boundaries.

An alternate approach to managing performance/qualityof service trade offs enables developers to provide alternateimplementations for different pieces of functionality, withthe system choosing implementations that are appropriatefor a given operating context [6, 7]. Loop perforation, incontrast, automatically identifies appropriately optimizablesubcomputations.Unsound Program Transformations. We note thatloop perforation is an instance of an emerging class of un-sound program transformations. In contrast to traditionalsound transformations (which operate under the restrictiveconstraint of preserving the semantics of the original pro-gram), unsound transformations have the freedom to changethe behavior of the program in principled ways. Unsoundtransformations have been shown to enable applications toproductively survive memory errors [23, 8, 24], code injec-tion attacks [23, 20, 25, 24], data structure corruption er-rors [10, 11], memory leaks [18], and infinite loops [18]. Theyhave also been shown to be effective for automatically im-proving application performance [21, 22, 14] and paralleliz-ing sequential programs [17]. The success of loop perforationin enabling quality of service profiling provides even more ev-idence for the value of this class of program transformations.

7. CONCLUSIONTo effectively optimize computations with complex per-

formance/quality of service trade offs, developers need toolsthat can help them locate promising optimization opportuni-ties. Our quality of service profiler uses loop perforation toidentify promising subcomputations. Developers can thenleverage the generated profiling tables to focus their opti-mization efforts on optimization targets that have alreadydemonstrated the potential for significant performance im-provements with acceptably small quality of service losses.

Experimental results from our set of benchmark applica-tions show that our quality of service profiler can effectivelyseparate promising optimization opportunities from oppor-tunities with less promise. And our case studies provide con-crete examples that illustrate how developers can use theprofiling information to guide successful efforts to developnew, alternate, and effectively optimized implementationsof important subcomputations.

Figure 1: bodytrack reference output. Figure 2: bodytrack output after manual optimization.

8. ACKNOWLEDGEMENTSThis research was supported in part by the National Sci-

ence Foundation under Grant Nos. CNS-0509415, CCF-0811397 and IIS-0835652, DARPA under Grant No. FA8750-06-2-0189 and Massachusetts Institute of Technology. Wewould like to thank Derek Rayside for his input on earlierdrafts of this work.

9. REFERENCES[1] prof. Digital Unix man page.

[2] UCI Machine Learning Repository.http://archive.ics.uci.edu/ml/.

[3] VTune Performance Analyser, Intel Corp.

[4] E. Amigo, J. Gonzalo, and J. Artiles. A comparison ofextrinsic clustering evaluation metrics based on formalconstraints. In Information Retrieval Journal.Springer Netherlands, July 2008.

[5] J. Anderson, L. Berc, J. Dean, S. Ghemawat,M. Henzinger, S. Leung, R. Sites, M. Vandevoorde,C. Waldspurger, and W. Weihl. Continuous profiling:where have all the cycles gone? ACM Transactions onComputer Systems, 15(4):357–390, 1997.

[6] J. Ansel, C. Chan, Y. Wong, M. Olszewski, Q. Zhao,A. Edelman, and S. Amarasinghe. PetaBricks: alanguage and compiler for algorithmic choice. InPLDI ’09.

[7] W. Baek and T. Chilimbi. Green: A system forsupporting energy-conscious programming usingprincipled approximation. Technical ReportTR-2009-089, Microsoft Research, Aug. 2009.

[8] E. Berger and B. Zorn. DieHard: probabilistic memorysafety for unsafe languages. In PLDI, June 2006.

[9] C. Bienia, S. Kumar, J. P. Singh, and K. Li. ThePARSEC benchmark suite: Characterization andarchitectural implications. In PACT ’08.

[10] B. Demsky, M. Ernst, P. Guo, S. McCamant,J. Perkins, and M. Rinard. Inference and enforcementof data structure consistency specifications. InISSTA ’06.

[11] B. Demsky and M. Rinard. Data structure repairusing goal-directed reasoning. In ICSE ’05, 2005.

[12] B. Furht, J. Greenberg, and R. Westwater. MotionEstimation Algorithms for Video Compression. KluwerAcademic Publishers, Norwell, MA, USA, 1996.

[13] S. Graham, P. Kessler, and M. Mckusick. Gprof: Acall graph execution profiler. In SCC ’82.

[14] H. Hoffmann, S. Misailovic, S. Sidiroglou, A. Agarwal,and M. Rinard. Using Code Perforation to ImprovePerformance, Reduce Energy Consumption, andRespond to Failures . Technical ReportMIT-CSAIL-TR-2009-042, MIT, Sept. 2009.

[15] C. Lattner and V. Adve. LLVM: A CompilationFramework for Lifelong Program Analysis &Transformation. In CGO’04.

[16] D. Le Gall. MPEG: A video compression standard formultimedia applications. CACM, Apr. 1991.

[17] S. Misailovic, D. Kim, and M. Rinard. AutomaticParallelization with Automatic Accuracy Bounds.Technical Report MIT-CSAIL-TR-2010-007, 2010.

[18] H. Nguyen and M. Rinard. Detecting and eliminatingmemory leaks using cyclic memory allocation. InISMM ’07.

[19] G. Pennington and R. Watson. jProf – a JVMPI basedprofiler, 2000.

[20] J. Perkins, S. Kim, S. Larsen, S. Amarasinghe,J. Bachrach, M. Carbin, C. Pacheco, F. Sherwood,S. Sidiroglou, G. Sullivan, , W. Wong, Y. Zibin,M. Ernst, and M. Rinard. Automatically patchingerrors in deployed software. In SOSP ’09.

[21] M. Rinard. Probabilistic accuracy bounds forfault-tolerant computations that discard tasks. InICS ’06.

[22] M. Rinard. Using early phase termination to eliminateload imbalancess at barrier synchronization points. InOOPSLA ’07.

[23] M. Rinard, C. Cadar, D. Dumitran, D. M. Roy,T. Leu, and J. William S. Beebee. Enhancing ServerAvailability and Security Through Failure-ObliviousComputing. In OSDI ’04.

[24] S. Sidiroglou, O. Laadan, C. Perez, N. Viennot,J. Nieh, and A. D. Keromytis. Assure: automaticsoftware self-healing using rescue points. InASPLOS ’09.

[25] S. Sidiroglou, M. E. Locasto, S. W. Boyd, and A. D.Keromytis. Building a reactive immune system forsoftware services. In USENIX ’05.

[26] D. Viswanathan and S. Liang. Java virtual machineprofiler interface. IBM Systems Journal, 39(1), 2000.

http://archive.ics.uci.edu/ml/

Quality of Service Proﬁling - MIT CSAIL

Documents

Transcript of Quality of Service Proﬁling - MIT CSAIL