Performance Measurement Research: A High-Level Overview ...brecht/courses/854... · The importance...

PerformanceMeasurementResearch:A High-Level Overview

Draft – Version0.9�

Tim [email protected]

May, 2001

Abstract

The importanceof goodtools to evaluateandunderstandthe performanceof variousas-pectsof a computingenvironmentis far too oftenunderestimated.Thethoughtof going intoa hospitalbecauseyou arefeeling ill andhaving a surgeonmove you into anoperatingroomto replaceyour, heart,liver, or kidney without first conductingextensive andconclusive testsseemsabsurd.

Andyetananalogoussituationtakesplaceonaregularbasisin thecomputerandcommuni-cationsindustry. Whenapplications,systemsor servicesaren’t performingup to expectations,it is far toooftenthecasethatsufficientdiagnosticsarenotperformedor thatthey lack theso-phisticationrequiredto accuratelypinpoint thesourceof theproblem.Oftenapplicationsandsubsystemsaremodifiedor replaced,orextramemory, disks,bandwidthandevenmachinesareaddedto theenvironment– all in thehopethesechangeswill solvetheproblem(andoftentheydo). Unfortunately, theapproachusedto make thedecisionregardingwhatactionto take toofrequentlylackssolidevidencethatpointsdirectly to thebottleneck.Additionally it is seldomclearfrom the informationprovidedwhatactionis appropriate.Beingableto quickly, easilyandaccuratelylocateandalleviateperformanceproblemsin complicatedsystemsis becomingincreasinglycritical andincreasinglydifficult.

In thisdocumentweprovideahigh-level overview of someof theresearchbeingconductedrelatedto this problem.

�A final versionreflectingfeedbackfrom thetalk anddiscussionswill besubmitted.

1

Contents

1 Intr oduction 4

2 PossibleCategorizations 42.1 An AlternateCategorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Dir ect Observation Using Instrumentation 63.1 SourceCodeInsertion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.1 Pros: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.1.2 Cons: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.3 Scopeof Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.4 PossibleAvenuesfor Investigation. . . . . . . . . . . . . . . . . . . . . . 8

3.2 Compiler-basedCodeInsertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2.1 Pros: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2.2 Cons: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2.3 Scopeof Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2.4 PossibleAvenuesfor Investigation. . . . . . . . . . . . . . . . . . . . . . 10

3.3 Linker-basedCodeInsertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.1 Pros: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3.2 Cons: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3.3 Scopeof Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3.4 PossibleAvenuesfor Investigation. . . . . . . . . . . . . . . . . . . . . . 13

3.4 Rewriting Binariesto InsertCode . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4.1 Pros: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4.2 Cons: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4.3 Scopeof Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4.4 PossibleAvenuesfor Investigation. . . . . . . . . . . . . . . . . . . . . . 16

4 Dir ect Observation Without Instrumentation 174.1 InstructionSetSimulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1.1 Pros: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.1.2 Cons: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.1.3 Scopeof Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.1.4 PossibleAvenuesfor Investigation. . . . . . . . . . . . . . . . . . . . . . 18

4.2 HardwareProfiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2.1 HardwarePerformanceCountersandInstrumentation. . . . . . . . . . . . 194.2.2 UsingHardwarePerformanceCountersWithout Instrumentation. . . . . . 204.2.3 Pros: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2.4 Cons: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2.5 Scopeof Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2.6 PossibleAvenuesfor Investigation. . . . . . . . . . . . . . . . . . . . . . 21

4.3 ExecutionTracing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.3.1 Pros: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.3.2 Cons: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2

4.3.3 Scopeof Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.3.4 PossibleAvenuesfor Investigation. . . . . . . . . . . . . . . . . . . . . . 21

5 Indir ect Observation 225.1 BenchmarksandWorkloadGenerators. . . . . . . . . . . . . . . . . . . . . . . . 22

5.1.1 ParallelApplicationBenchmarks . . . . . . . . . . . . . . . . . . . . . . 225.1.2 World-Wide-WebWorkloadGenerators. . . . . . . . . . . . . . . . . . . 225.1.3 Pros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.1.4 Cons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.1.5 PossibleAvenuesfor Investigation. . . . . . . . . . . . . . . . . . . . . . 23

5.2 ObservationTools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.2.1 Pros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.2.2 Cons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.2.3 PossibleAvenuesfor Investigation. . . . . . . . . . . . . . . . . . . . . . 25

6 Summary 256.1 SomePotentialOpportunities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6.1.1 MeasuringandImproving Performancein Complex Environments. . . . . 256.1.2 New andImprovedToolsthatLeverageExistingTechnology . . . . . . . 266.1.3 BinaryRewriting to ReduceLock Contention . . . . . . . . . . . . . . . . 266.1.4 A NoteonTuningNotes– Autotuning. . . . . . . . . . . . . . . . . . . . 26

7 Future Work 27

3

1 Intr oduction

While implementing,testing,anddebuggingapplications,systemsandservices1 will continuetobeamajorconcernin mostsoftwaredevelopmentprojects,agrowing concernis theirperformance.This is especiallytruein environmentswheresignificantmonetarygainscanbeobtainedby offer-ing a competitive advantageor throughcostsavingsdueto a reductionin therequiredresources.As a result,significantinterestandresearchin performancehasbeenfocussedon areaswith largemonetaryincentives (e.g., computerarchitecturedesign,databasesand e-commerceservers) orareaswith highperformancedemands(e.g.,parallelcomputingandembeddedsystems).

In orderto demonstrablyimprove performanceonemustbeableto: identify theenvironmentandworkloadin which performanceis anissue,determineappropriateperformancemetrics,reli-ably measureperformanceusinga representative environmentandworkload,modify the systemunderstudy, re-measureperformance,andto draw statisticallysignificantconclusionsregardingthedifferencesin performance.

In this document,we provide a high-level overview of someof the researchbeingconductedrelatedto measuringthe performanceof applications,systemsandservices. We looselydividethecurrentwork into categoriesandthenprovide a slightly moredetailedexampleof oneor twopiecesof work in eachcategory. Thegoalat this time is not to provideadetailedsurvey of work inthis area,but insteadto provide a high-level roadmapin orderto furtherexploreareasof mutualinterestandto identify possibleavenuesfor moredetailedexploration.

Sincethe ultimate goal of this project is to investigateanddevelop techniquesand tools tomorequickly andeasilypin-point bottlenecksand to help to determinemethodsfor alleviatingthosebottlenecks,we focusonwork relatedto measurementratherthanmodellingor simulation.

Weassumethatthereaderis reasonablyfamiliarwith issuesrelatedto performanceevaluation(for backgroundinformationsee[33] [24]). As a resultwe forego theusualdiscussionrelatedtotheprosandconsof analyticmodels,simulationandempiricalmeasurements.Wealsotry to avoidreplicatingsomeof themorecommoninformationthatis availableelsewhere.For example,wedonot discussinformationregardingstandardUNIX tools,techniquesandtoolsthataredescribedinbookslike [17] [14] [27], or webpagesrelatedto thesubject[57] [42] [22].2

2 PossibleCategorizations

Onedifficulty in summarizinga significantbody of work is trying to make senseof andprovideinsightsinto the relationshipsandsimilaritiesbetweenelementsof that work. An approachthatis often helpful in this regard is to attemptto categorize the work and to placeeachindividualcontribution into anappropriatecategory.

While we attemptonesuchcategorizationhere,our goal is not to deviseanall encompassingcategorizationbut to provide a simpleframework within which to begin describingsomeof theresearchrelatedto performancemeasurement3.

1We will frequentlyjustusethetermsystemsto referto applications,systemsandservices.2I’m in theprocessof building awebpagethatwill providelinks to variouscompanies,researchprojectsandother

sourcesof informationrelatedto thetopic. With permission,I would like to make this informationpublicly available– of courseI would acknowledgeSUN’ssupportfor its creation.

3Feedbackregardingthisandotherpossiblecategorizationswould beappreciated.

4

Our currentcategorizationis derived from the tools andtechniquesusedto measureandan-alyzeperformance.At thefirst level we divide techniquesandtoolsaccordingto whetheror notthey directly or indirectly observe the performanceof the target application,systemor service(sometimesreferredto simplyasthesystem) while operatingundernormalconditions.

An exampleof directobservationmight beto run andmeasuretheexecutiontime of anappli-cationin orderto understandhow it executeson a particularmachine.On theotherhand,indirectobservationmight involve runninga setof benchmarkapplications(but not theactualapplicationof interest)ontheparticularmachineandthenattemptingto draw conclusionsaboutthebehaviourof theapplicationof intereston thatmachinebasedon theobservationsobtainedfrom thebench-mark.Anotherexampleof indirectobservationmight involvetheuseof aprogramlikenetperf[25] to observe thebandwidthof network transfersbetweentwo hostsin anattemptto determineif thenetwork maybeabottleneck.

While directobservationis typically preferable,it maynotalwaysbepossibleor appropriate.We further sub-divide direct observation into categoriesthat requireor do not requireinstru-

mentingthe systembeingobserved. Hereinstrumentationof the systeminvolvesmodifying theexecutionof the systemby modifying its underlyingsourcecode,libraries,run-timesystem,orexecutables.

Beforedelving into eachof thesecategoriesin detail we briefly summarizethembelow andincludesomeexamplesof thetypesof toolsandtechniquesthatwould fall undereachcategory.

Dir ectObservation� InstrumentingTechniques

Instrumentinga systemrequiresthata personor programeitherunderstandsthesystemoris somehow ableto discernits behaviour. For this reason,this approachcanbe viewed asa white-boxapproachbecausesomeunderstandingof what the systemis doing is usedinorderto instrumentits behaviour. Examplesof techniquesthatfall underthis categoryare:

– Manualinsertionof timing andor statisticsgatheringcode.

– Compiler-basedautomaticprofiling (e.g.,gprof).

– Usingshims,wrappersor sharedlibrary interposition.

– Rewriting theexecutable.

� Non-instrumentingTechniques

If instrumentationis notbeingused,thesystemcanbetreatedasablack-box.In thiscasetheoriginal systemis not modifiedandexternalaidsareusedto measurebehaviour. Examplesof someof thetoolsandtechniquesthatwould fall underthis categoryare:

– Tracingexecutionusingasimulator.

– Usinghardwareprofiling (e.g.,performancecountersandregisters).

– Tracingexecution(e.g.,usingstrace, truss or somethingsimilar).

5

Indir ectObservation� Benchmarksto examinetheperformanceof thesystem.

– Splash-2parallelapplicationbenchmarks.

– httperf workloadgeneratorfor measuringwebserverperformance

� Systemmeasurementandobservationtools(e.g.,netperf, ping, vmstat, iostat andnetstat).

2.1 An Alter nateCategorization

Anotherpossibleapproachto categorizationwouldbeto considerthescopewith which thetool ortechniquecanbeapplied.For example,thedivision might reflectwhetherthetool or techniqueisdesignedto measuretheperformanceof:

� Theunderlyinghardware.

� TheOperatingSystem.

� Libraries.

� An application.Thismaybefurtherdividedinto categorieslike:

– Single-threadeduserapplications.

– Multi-threadeduserapplications.

– Paralleland/ordistributedapplications.

– Specializedserverapplications(e.g.,DNS,databases,httpd,etc.)

� Aspectsof anetwork.

� A combinationof someof theabove.

� A serviceor combinationof services(e.g.,multi-tieredInternetservice).

Throughoutthis documentwe’ll attemptto placeeachtool andtechniqueinto oneof themaincategoriesoutlinedabove (directobservationwith instrumentation,directobservationwithout in-strumentation,or indirectobservation)andalsotry to outlinethescopeof applicability.

3 Dir ectObservation Using Instrumentation

The mostpopulartechniquefor observingthe performanceof an application,systemor serviceis to instrumentthat system,run that system,directly observe andmeasureits behaviour andtoanalyzeits performance.Arguablythemostactiveareaof recentresearchappearsto bein theareaof instrumentionof systemsfor directobservation. In additionto naturalimprovementsmadeto

6

existing techniques,recentdevelopmentsin rewriting programexecutablesandbinarytranslationshave leadto someinterestingbreakthroughsin this area.

Figure1 shows possiblepointsat which instrumentationcodecanbeinsertedinto anapplica-tion (denotedby thegrayareas).Wenext briefly describeanddiscussthetechniquesusedto insertinstrumentationcodeat eachof thesepoints.

Preprocessor

Source Code

Source Code

Compiler

Object Files

Linker

Executable

ExecutionPlatform Specific

Language Specific

Library

Figure1: Pointsof possibleinstrumentation(diagramfrom [53])

3.1 SourceCodeInsertion

This approachinvolvesmodifying thesourcecodeof a systemto insertcodeto track informationof interest.Most of thework in this area[29] [52] [13] involvesdevelopingmacrosandlibrariesthattheprogrammercancall. Thesemacrosandlibrariesimplementcommonportionsof codethatwouldoftenbeusedwheninstrumentingaprogramto measureits performance.Examplesof suchsupportincludestartingandendingtimings, print timing information,andtrackingandprintingstatisticsor visualizingcollecteddata.

3.1.1 Pros:� As portableastheinstrumentedsourcecode.

7

� Canobtaindetailedinformationon targetedinformationof interest.

� Canbe usefulwherecompiler-basedprofiling is weak(e.g.,threaded/parallelapplicationsandfor includingthingslikeoverheadsdueto C++ templatesandvirtual functions).

� Quiteagoodscopeof applicability.

3.1.2 Cons:� Requiressourcecode.

� Timeconsuming.

� Errorprone.

� Requiresfairly detailedunderstandingof thesystem.

� May belesseffective in distributedand/orclient-serverapplications.

3.1.3 Scopeof Applicability

Becausewhat is beingmeasuredis directly controlledby theprogrammer, this approachhasper-hapsthewidestscopeof applicability. If usedwith hardwareperformancecountersit canbeusedto monitorhardwareperformance.It canalsobeusedto measuretheperformanceof anapplica-tion, operatingsystemandnetwork, andpossiblycombinationsof theabove. With thesupportofanexternalmethodfor synchronizingclocks(e.g.,NTP or GPSreceivers),it canevenbeusedtomeasureseveralaspectsof client/serveror distributedsystems.

3.1.4 PossibleAvenuesfor Investigation� It might be possibleto develop techniquesto automatethe processof codeinsertionforcertaintypesof measurements.For example,it wouldbetrivial to instrumentspecifiedfunc-tionsby insertingtiming codeat functionentryandexit points. More interestinghowever,might be to develop a systemor languagethat permitsone to specify codefragmentsorpatterns(e.g.,usingregularexpression)thatareto instrumented.

� It also seemsthat bettersupportcould be provided to assistprogrammerswhen they areaddingmeasurementcode.

3.2 Compiler-basedCodeInsertion

The notion of insertingextra instructionsinto the instructionstreamin orderto help to measureandunderstandtheperformanceof applicationshasbeenaroundfor quitesometime. A tool likegprof usesa combinationof compilerinsertedinstructionsandinterruptsto collectdataregardingtheexecutionof anapplication.Themainideais to periodicallyinterrupttheexecutingapplicationand,while in the interrupthandler, readthe programcounterandincrementan appropriatedata

8

structure.If theinterruptsarerelatively frequentandtheapplicationexecutesfor sufficiently long,theresultis afairly accuratepictureof wheretheapplicationis spendingmostof its timeexecuting.

Thecompileris usedto insertcodeto storeandlatergatherinformationaboutprocedurecalls.Thiscodeis usedto produceacall graphandto countthenumberof timeseachfunctionis called.Similarly, basic-blockcountingis implementedby insertinginstructionsto incrementanelementof a zero-initializedstaticarray, prior to enteringeachnew block (i.e., for eachif statement).Eachelementof thearrayis usedto countthenumberof timeseachbasicblock is executed.Afterexecutionthedatais writtenandlatercorrelatedwith theprogram’ssourcecode.

The power of this approachlies in the ability to directly pinpoint wherethe applicationisspendingmostof its time executingby correlatingthe collectedexecutiondatawith the sourcecode. Therefore,if onewantsto improve the application’s performanceoneknows wherefocusattention– onthosepartsof theprogramthataccountfor thegreatestportionof theexecutiontime.

A modificationto this approachwasrequiredfor parallelor threadedapplications.Quartz[4]alsointerruptsthe executionof the applicationandsamplesthe programcounter, but during theinterruptit alsochecksto seehow many of theotherCPUsarebusy. This informationis usedtocomputethenormalizedprocessortime:��

� �� ! (1)

Whenreportingnormalizedprocessortime,eachtermis summedseparatelyandthetimespentin eachstateis alsoreported(e.g.,busy, spinning,blocked,ready).

Figure2 showsanexampleof thetypeof outputthatmightbeproducedusingatool likeQuartz(theexampleis a slightly modifiedversionof theoneusedin theoriginal paper[4]). For eachofthe possiblenumbersof processorsthat could be usedit shows a breakdown of the normalizedprocessortime spenton eachof themain routines.Hereonecanseethata significantportionofthe time is spentexecutingthe“cone” routinesequentially. As a result,it providesguidancethatindicatesthat if onecouldparallelizethis routine,theexecutiontime of theapplicationcouldbesignificantlyreducedandbetterspeedupscouldbeobtained.

TheQuartzpaperalsodescribesfurtherimprovementsmadeto theexampleprogramandotherexperiencesusingtheir tool to improveparallelprograms.

3.2.1 Pros:� Easyto use(little or no barrierto entry).

� Languageindependent.

� No understandingof systemrequired.

� Focusesattentionin theright placethroughstrongtieswith thesourcecode.

3.2.2 Cons:� Requiressourcecodeandrecompiling.

9

Figure2: Normalizedprocessortime (diagrammodifiedfrom [4])

� Mostly helpsto figure out whereto work on performanceand not much help in how toimproveperformance.

� Fairly limited scopeof applicability(compiledapplication).


Thescopeof applicability is fairly limited here.It is mainly helpful in trying to determinewhereaprogramis spendingmuchof its timeexecutingandnotnecessarilywhy. It is generallynot veryhelpful in understandingtheperformanceof anythingexceptanapplication.

3.2.4 PossibleAvenuesfor Investigation� Supportfor complex systemsandservices.In applicationsthatconsistof multipleexecutingprograms(possiblyexecutingon differenthosts),it might bepossibleto profile eachof theprogramsandto somehow combineandcorrelatethecollecteddata.Seethework donebyDigital/Compaqon ContinuousProfiling [3] for an exampleof a similar approachwith alargerscopeof applicability.

10

3.3 Link er-basedCodeInsertion

Onetechniquethat hasbeendeployed for sometime but until recentlyhasn’t beendiscussedindetail is the useof a methodfor sharedlibrary interpositionsuchasa shimor wrapper.4 Recentwork hasdescribedthis techniquein detail anddemonstratedhow it canbe deployed [41] [28].While theauthorsof this work makeno claim thattheideais original (asthey’veheardnumerousanecdotesof thisorsimilartechniquesbeingusedby others),thisis thefirst knownworkdescribingtheapproachin detailanddemonstratinghow it canbeusedto measurea realsystem.

A shim is a thin pieceof materialplacedbetweentwo objects. In software, it is a smallfunctionthatessentiallyreplacestherealfunctionandis usedto collectperformancedataregardingcalls to the real function. The shim is responsiblefor implementingthe sameinterface,passingall parametersthroughto andcalling therealfunction,collectingany performancedataof interest,andreturningresultsto thecalling (if thereareany).

This techniquecanbeappliedwhenevertheinterfaceto thefunctionbeinginstrumentedis wellknown. Figure3 shows that theshimcanbeplacedbetweenthecallerandthecalleebecausetheinterfaceis known.

funcB funcAfuncA funcB

shim

Figure3: Usingshimsto insertinstrumentationcode(diagramfrom [41])

In anenvironmentthatsupportssharedlibraries,eachshimimplementstheinterfaceto therealfunctionusingthatfunction’sname.In thiswaytheshimwill becalledinsteadof therealfunction.Thekey is beingableto call therealfunctionfrom within theshim.

This is accomplished,for example,by compiling theshimsinto a dynamicallylinked librarynamedlibshimdo.so andthenrenamingthat library to thenameusedby theoriginal library(e.g.,libdo.so) butstoringtheshimlibrary in adifferentdirectory(e.g.,/usr/lib/shims/libdo.so).Next, thesearchorderfor dynamicallylinkedlibrariesis changedsothattheshimlibrary is foundandusedprior to theoriginal library (e.g.,/usr/lib/libdo.so). This ordercanbechangedon many systemsby changingtheenvironmentvariableLD LIBRARY PATH. Thentheprogrambeinginstrumentedis relinkedusinganoptionthatinstructsthelinker/loadertocall aninitializationfunction,init shims() whenthedynamiclibrariesareloaded. In someenvironmentsthis isdoneby passingtheoption-init init shims to thelinker. Whentheprogramis run andthedynamiclibrariesareloadedit callsthefunctioninit shims, which loadstheactuallibrary and

4This approachis reportedlyusedin theSUN ThreadEventAnalyzer[62].

11

renamesthe functionsto thenamesusedby thecalling shim function. Examplesof a function’sprototype,shim,andinitialization codeareshown in Figure4.

int do(int param); // prototype

int do(int param) // instrument the real function{

int tmp;do_call_count++;start_time(do);tmp = real_do(param);stop_and_record_time(do, do_call_count);return(tmp);

}

void init_shims() // load and link the real function{

dynlib = dlopen("/usr/lib/libdo.so");real_do = dlsym(dynlib, "do");

}

Figure4: Exampleprototype,shimandinitialization code

A similar approachis commonlyknown andusedat the sourcecodelevel, but the the keyadvantageof usinga shim is that it canbe usedwithout the sourcecodefor the programbeinginstrumentedandit canbeeasilyinsertedandremoved.

3.3.1 Pros:� Doesn’t requiresourcecode.

� Cancollectdataof interest.


� Doesnot requireadetailedunderstandingof thesystem.

3.3.2 Cons:� Limited opportunitiesfor instrumentation(canreally only instrumentat interfaces).

� Potentiallytimeconsumingandsomewhaterrorprone(needto createtheshims).

12

� Fairly limited scopeof applicability(informationcapturedmainlyappliesto theapplication,librariesand/ortheoperatingsystem).


If usedwith hardware performancecountersthis techniquecould be usedto monitor hardwareperformance.It canalsobe usedto measurethe performanceof an application,librariesand/oraspectsan operatingsystemandpossiblycombinationsof theabove. Thepapersdescribingthistechnique[41] [28] useit to measurethe performanceof a distributedcomputingenvironment(DCE)andarequitesuccessfulat identifyingwheretimeis beingspentin adistributedapplication.

3.3.4 PossibleAvenuesfor Investigation� The useof shimsseemsto offer considerablebenefitswhile being relatively easyto use.Thedifficulty lies mainly in creatingtheshimsandin determiningwhat to measure.Theremight beconsiderablevaluein producinga setof shimsthatcouldbewidely used.A usercouldthenlink with theversionsof thelibrariescontainingtheshimsandcontroloptionstospecifywhich shimfunctionswouldcollectdataandwhatdatawouldbecollected.

� It mightalsobeinterestingto examinetechniquesfor automaticallygeneratingasetof shims.That is, givena setof functionprototypesmight onebeableto automaticallygeneratea setof usefulshims.

3.4 Rewriting Binaries to Insert Code

Recentlymuchresearch[19] [7] [54] [31] [32] [5] [46] hasbeenconductedinto techniquesandtools for rewriting programexecutablesafter they have beencompiledand linked so that theycontainadditionalcodein order to instrumentsomeaspectof the applicationor systembeingstudied.Interestingly, someof this researchhasinvolvedmodifying executablesat run-time[45][20] [37] [59] [12] in orderto dynamicallyinsertanddeleteinstrumentationcode.Previousworkhaddynamicallymodifiedcodefor debuggingpurposes[26] [10].

Typically eachof thestudiesdescribedabovealsoincludesadescriptionof how they’veappliedtheir tool or techniquein orderto measureandimprovetheperformanceof anexistingsystem.

Furthermore,the ability to instrumentexecutableshasalsobeenusedto developothertools,someof which arenot strictly designedfor measuringperformance.For examplea tool hasbeendevelopedfor: detectingdataracesin multi-threadedprograms[49], instrumentingthreadedappli-cations[63], andimplementingfine-graineddistributedsharedmemorysystems[64] [51] [50].

One of the earlier known tools that utilized binary rewriting techniquesis pixie [15] [38].Pixie relieson thecompilerto instrumentanexecutableto collectperformancerelateddataduringexecution.After executionit usesthatdatato rewrite thebinaryto improveperformancebasedonthepreviousexecutionor executionsof thatprogram.5

5SeveralpapersclassifyPixieasabinaryrewriting tool but Pixiemayactuallyrequireoneto recompiletheprogramandthecompilermayutilize thecollectedperformancedatato generatemoreefficient code.

13

Thecollecteddatais usedto generatemoreefficient code.For example,theprogrammaybemademoreefficient by having betterinformationregardingbranchpredictionsandthenumberoftimesloopsexecute(for loopunrolling).

Theprocedureusedto rewrite binaryfiles in orderto instrumentprogrambehaviour (includingperformance)is asfollows:

1. Determinethelocationwhereinstrumentationcodeis to beinserted.

2. Patch the executableto include the new instrumentationcodebut ensurethat all existinginstructionsareexecutedproperly.

For example,a typical tool might instrumentall loadsandstoresin orderto obtainmemoryreferencetraces,conductcachestudies,or implementa fine-graineddistributedsharedmemorysystem.Figure5 shows anexampleof how aninstruction(e.g.,a loador store)might bereplacedwith codethat calls or trampolines to instrumentationcodeand transfersexecutionback to theoriginal instructions.

instr x instr xinstr x+1 call <code patch>instr x+2 instr x+2

<code patch>// inserted measurement codeinstr x+1 // or equivalent// branch back to next instruction

Figure5: Exampleof rewriting executablecode

The modificationof codecanchangethe addressesof existing instructions.Therefore,theseaddressesmayneedto bemodifiedin thenew executablebeingproduced.Theapproachtypicallyusedroughlythreephases(basedon thedescriptionin [5]):

1. Analysis

� Breakprograminto functionsandbasic-blocks.� Analyzecontroltransfers(jumps,branches,calls).� Generateaflow controlgraphfor theprogram.

2. Instrumentation

� Insert,change,removecode.

14

3. Regeneration� Computeaddresstranslations.� Generatea translationtableif required.� Regeneratethebasic-blocks,patchingcontroltransfersat thattime.� Updatesymboltableandrelocationinformationif required.

Fairly earlyin theprocessof developingbinaryrewriting techniquespeoplerealizedthesizableamountof effort that would be involved in developinga new tool that employs binary rewritingtechniques.As a result,muchinitial work concentratedon tools for developingbinary rewritingtools. A typical exampleof onesuchtool is EEL [32]. EEL providesa setof APIs anda C++library to easethe processof binary rewriting. EEL handlesloadingthe executableandproduc-ing a control-flow graph;it alsoprovidesaccessto anditeratorsfor applyinginstrumentationtocommonlyinstrumentedportionsof aprogram.For example,onecaniterateacrossevery routine,basicblock andedgein thecontrol-flow graphanddeleteinstructionsor addcodeusingsnippetswhichencapsulatemachine-specificinstructionsandprovidecontext dependentregisterallocation.Additionally, EEL handlesthemodificationof calls,branchesandjumpsto ensurethatthecontrolflow is correctin theeditedprogram.

Providedoneis ableto easilydescribeandlocatepointsin theexecutableat which codeis tobe insertedor modified,thetechniqueof rewriting executablescanbeutilized to implementverypowerful tools.Althoughit is not strictly designedasa tool for performancemeasurement,Eraser[49] is a goodexampleof onesuchtool. Eraseris designedto detectandreporton dataracesinmultithreadedprograms.This is doneby rewriting binariesto instrumentevery shared-memoryreferenceandby verifying that locks areusedconsistently. Unprotectedaccessesto shareddataarereportedaswell asinconsistentuseof locks. Thegeneralideabehindhow this workscanbeseenby examiningtheir first locksetalgorithmwhich is usedto trackthesetof locksusedby eachthread[49]. The algorithmis subsequentlyrefinedlater in the paperbut this versionprovidesagoodunderstanding.

Let "$# �&%'� �(� "$)+* ��, bethesetof locksheldby thread�

For eachsharedvariable- , initialize thecandidatelocks ./*0- , to thesetof all locks.On eachaccessto - by thread

�,

set ./*0- ,21 ./*0- ,43 "$# �&%'� �(� "$)�* ��,if ./*0- ,21 576

, thenissuea warning

Oneof the reasonsfor describingthis approachin detail is that it might be possibleto useasimilar approachin orderto measurelock performanceandpossiblyevento provide suggestionsregardinglock splitting. This is discussedin slightly moredetailin thesectiononpossibleavenuesfor investigation(Section3.4.4).6

A more flexible approachto performancemeasurementscan be achieved by modifying theprogrambinarydynamicallyduring its execution[20]. Oneof thekey advantagesof dynamicin-strumentationis that it permitstheprogramto executeun-instrumenteduntil it reachesthepoint

6Onewould have to seeif this is a betterapproachthanusingdynamicinstrumentationfor threadedprogramsasdescribedin [63].

15

of interest.This not only avoidstheoverheadsdueto instrumentationprior to thepoint of interestbut alsocansignificantlyreducetheamountof datacollected(becauseinstrumentingcodecanbeaddedanddeletedwhenneeded).Thisapproachhasbeenusedto build asystemfor measuringtheperformanceof parallelapplicationsandsystemscalledParadyn[37]. In Paradyn,instrumenta-tion is controlledby thePerformanceConsultantmodule,which is usedto automaticallydirecttheplacementof instrumentationcode. Sinceit hasinformationregardingperformancebottlenecksandprogramstructureit canassociatebottleneckswith specificpartsof a program.Additionally,overheaddueto instrumentationcanbe limited by settinga usercontrolledthresholdandperfor-mancedatacanbevisualizedusingasuppliedvisualizationlibrary.

Thiswork ondynamicinstrumentationhascontinuedby extendingthetechniquesothatit canalsobeusedwith threadedapplications[63] andoperatingsystems[59]. Additionally, work hasbeendoneto make thecodeavailableandto publishanAPI for runtimecodepatching[12].

3.4.1 Pros:� Doesn’t requiresourcecode.


� In somecasesdirectcorrelationbetweenperformancedataandsourcecode(e.g.,[37]).

� Fairly goodopportunitiesfor instrumentation(canreally instrumentat many places/levels).

3.4.2 Cons:� It wouldseemthatagreaternumberof usefultoolswouldhavebeengenerated.

� Fairly largebarrierto entryunlessthereexistsa tool thatmeasureswhatyouwant.

� Potentiallytimeconsuming(if learningandusingonetool to generateanew tool).

� Fairly limited scopeof applicability(informationcapturedmainlyappliesto theapplication,librariesand/ortheoperatingsystem).


If usedwith hardwareperformancecountersthis techniquecouldbeusedto monitorhardwareper-formanceashasbeendonein [2]. It canalsobeusedto measuretheperformanceof anapplication,librariesand/oraspectsanoperatingsystem,andpossiblycombinationsof theabove.

3.4.4 PossibleAvenuesfor Investigation� Utilize the technologyto createusefultools. Might it bepossibleto createsomethinglikeEraserbut useit for measuringoverheaddueto locksandlock contentionandfor providinginsightsinto how to split locks. For example,might onebeableto monitor theshareddatathatis protectedby thesamelock but notaccessedconcurrentlyanddeterminethatthesinglelock might besplit.

16

� Moreandbetterdemonstrationsof utility of thetoolsandtechniques.Typically papersshowa relatively trivial applicationof the toolsor techniqueto improve theexecutionof a fairlysimpleprogram.

� Look at methodsfor improving the scopeof applicability. For example,could oneinstru-mentseveralprogramssothatapplicationsconsistingof multipleexecutingprogramsand/orprocesses(possiblycommunicating)couldbemeasured.This might beplausiblegiventhework thathasbeendonewith Shasta[50].

4 Dir ectObservation Without Instrumentation

This category of tools andtechniquesinvolvesexecutingthe systemof interestwithout makingany modifications. In this case,performanceis observed usingmethodsthat areexternal to theexecutingsystem.Commontechniquesincludeinstructionsetsimulation,hardwareprofiling, andexecutiontracing.

4.1 Instruction SetSimulation

Simulationhasbeendeployed as a tool for performanceanalysisfor a long time. One of theproblemsis that it typically involvestradingoff speedof simulationfor accuracy in what is be-ing simulated.Advancesin instructionsetsimulation[8] [16] [36] [30] [39] [35], combinedwithincreasingprocessorspeedshave madeit feasibleto provide highly accurateinstructionsetsimu-lators.Usingsuchsimulatorsoneis ableto take unmodifiedbinariesthathavebeencompiledandlinkedfor thetargetarchitectureandexecutethemona simulator.

Someof the importantwork in this areahasbeenthe developmentof techniquesfor greatlyimproving the speedof the simulationwhile maintainingaccuracy. Shade[16] drastically im-provedsimulationspeedsby cross-compilinginstructionsfor thetargetmachineinto instructionsfor thehostmachinethat is runningthesimulator. By cachinghost-codeandby integratingtraceinformationinto thehost-codethey areableto amortizethecostof cross-compilationandenablehigh-speedsimulationanddatacollection.

An alternateapproachtaken by SimGen[30] andsubsequentwork on SimICS[39] usesanintermediaterepresentationfor instructionswhich is designedto beeasyto simulatein software.SimGentakesadefinitionof thetargetinstructionsetandgeneratesahighly-efficient intermediaterepresentation,instructiondecoder, disassembler, encoder, andrequiredserviceroutines.Thebe-lief is thatthisapproachis ableto generatesimulatorsthatexecutemorequickly thanthosethatcanbegeneratedmanually. Oneway thatspeedis improvedis by generatingversionsof thesimula-tor thatcollectinformationabouttheexecutionof thesimulatorandthenfeedingthat informationaboutthesimulator’s behaviour backinto thegeneratorto producea fasterversion.Thesepaperstypically includesomemeasurementof theslowdown of anapplicationexecutingonthesimulator.Thisis how muchslowertheprogramexecutesonthesimulatorwhencomparedwith theexecutionon theactualtargetmachine.In thecaseof thebenchmarksusedin [39] slowdownsaretypicallyin therangeof 30-50timesslower.

Originally simulatorswererestrictedto executingonly individual programs.However, recentadvances[47] [35] have enabledthesimulationof hardwaredevicesandnow enablethefull scale

17

simulationof a completeoperatingsystem.A standardoperatingsystemis bootedandexecutedinsidethesimulator. This providesuserswith new insightsconcerningboth theexecutionof theoperatingsystemandthebehaviour of complex systemsundermorerealisticandmixedworkloads.

4.1.1 Pros:� Runexecutablesonsimulator.


� Doesnot requirea detailedunderstandingof the systemin order to useit (but might tointerprettheresults).

� Quitegoodscopeof applicability.

� Simulatorcanbeasdetailedasdesired.Cancollectany or all dataof interest.

4.1.2 Cons:� Slowdownsaretoo largeto make this approachpracticalfor someapplications.

� Performanceinformationprovidedis typically gearedmoretowardshardwaredesignersandcompilerwritersbecausethesehave usuallybeenthepeoplewho have developedandusedsuchsimulators. Informationthat is typical concernscachemisses,instructionstalls,andinformationregardingbranchpredictions.


Becausethe simulatorcanbe modified,onecancollectany or all informationthat is of interest.Thisenablesdatato becollectedregardingtheunderlyinghardware,theapplication,libraries,andtheoperatingsystem.

4.1.4 PossibleAvenuesfor Investigation� While simulatorscanprovide significantinsightsit is not clearthatthey hasbeenusedverywidely. It would be interestingto studyandrectify this situation. For example,it is quitelikely thattheinformationprovidedby simulatorsis at a level thatis too low for theaverageuser.

� Onemight beableto simulatemorecomplex environmentsby connectingseveralmachinesimulators(possiblyevenexecutingon differenthosts).This might beused,for example,tomoreeffectively studyclient/serverandor distributedsystems.

� It mightalsobepossibleto combinetheuseof machinesimulatorswith anetwork simulator,like ns [60] [6] to provide a morecompleteenvironmentanda very detailedpictureof theperformanceof fairly complex systems.

18

4.2 Hardware Profiling

As modernprocessordesignhasbecomemorecomplex andasworkloadsexecutingon systemsbuilt usingtheseprocessorhave increasedin diversity, thesetof tradeoffs madewhendesigningprocessorshasbecomeoverwhelming.Onetechniquebeingusedby designersin orderto betterunderstandboththeefficacy of their designsandtheworkloadsbeingexecutedon themhasbeento includemethodsfor obtaininginformationrelatedto performancedirectly from theprocessor.

This is doneby includingoneor morecontrolandcountingregistersdirectlyon theprocessor.By manipulatingthecontrolregister(s),theprogrammeris ableto controlwhatinformationis to becollectedin theperformancecountingregisters.While executing,theprocessorupdatescountersasguidedby thecontrolregister(s).An interruptcanbegeneratedwhenacounterregisteris aboutto overflow in order for the valueto be recordedin software. As a result,an advantageof thistechniqueis thatit canpotentiallyoffer veryaccurateinformationwith extremelylow overhead.Afew examplesof the typesof datathatcanbecollectedare: instructioncounts,instructionstalls,andcachereads,writesandmisses.

Hardwareperformancecounterscanbeusedin oneof two ways.Firstaprogramcanbeinstru-mentedto make directcalls to obtainthedatafrom thecounters(this would fit undertheclassifi-cationof instrumentedprograms).Thiswouldbeclassifiedasadirectobservationtechniqueusinginstrumentation.Theotherway they canbeusedis to utilize theperformancecountersat a differ-entlevel, thusnotrequiringmodificationsto theprogrambeingmeasured.Thiswouldbeclassifiedasa directobservationapproachwithout instrumentationandhasbeenusedin Digital/Compaq’sContinuousProfiling Infrastructure(DCPI) [3].

4.2.1 HardwarePerformanceCountersand Instrumentation

Althoughout of placewith respectto theothertechniquesin this category we now briefly discussissuesrelatedto instrumentingprogramsto usehardwareperformancecounters.This is doneinorderto keepthediscussionof thetwo differentusesof hardwareperformancecounterstogether.

Oneof the problemsfacingpeoplewho want to usetheseperformancecountersis that eachprocessorprovidesdifferentcountersanddifferentmethodsfor controllingtheiruse.Theproblemis exacerbatedby thefactthateachoperatingsystemprovidesadifferentmechanismfor accessingthesecounters.In orderto provideacleanandportableinterface,somework hasbegunto providea standardlibrary for accessingperformancecounterson commonplatforms.PCL (PerformanceCounterLibrary) [9] providessuchsupportby designingandimplementinga library thathasbeenportedto severalcombinationsof processorsandoperatingsystems.

A competingproject,PAPI (PerformanceAPI) [11] is attemptingto definea standardthatcanbe usedto programandaccessperformancecounterson differentplatforms. They have definedtwo interfaces.Thefirst is ahigh-level interfacefor acquiringsimplemeasurementsandthesecondis a fully programmable,threadsafe,low-level interfacefor moresophisticatedusers.PAPI pro-videssomeimportantfeaturesthatarenot supportedby PCL.For examplePCL doesnot providemethodsfor handlingcounteroverflow or for softwaremultiplexing.

As hardwaredesignersarenotyet readyto commitmuchrealestateto whatis oftenperceivedasfrivolousfunctionalitysuchasperformancecountingregisters,thenumberof countersis typi-cally severelylimited andasa resultthedataoneis ableto gatherat onetime is severely limited.

19

As a resultDCPI [3] andPAPI [11] supportthemultiplexing of hardwareperformancecountersinorderto provideamorecompletepictureof aprogram’sexecution.

It is interestingto note that somework [2] hascombinedthe useof hardware performancecounterswith programinstrumentationtechniques.

4.2.2 UsingHardwarePerformanceCountersWithout Instrumentation

TheDigital/CompaqContinuousProfilingInfrastructure(DCPI)[3] utilizeshardwareperformancecountersto implementa sampling-basedprofiling systemdesignedto run continuouslyon pro-ductionsystemswith relatively low overhead(1-3%slowdown for mostworkloads).Thesystemconsistsof two parts:adatacollectionsystemthatreliesonperiodicinterruptsgeneratedby perfor-mancecountersavailableonAlphaprocessorsto sampleprogramcounters(periodicallyrecordingthemto disk) andtools for analyzingthe storeddata. Collecteddatais usedto producetypicalinformationof interest,suchasthe time spentper image,procedure,sourceline andinstruction.In addition,a morecomplex analysistool is ableto determinetheaveragenumberof cyclesspentexecutingeachinstruction.

Thepower of this techniqueof continuousprofiling is thatbecauseof its low overheadit canbe usedon productionsystemsto examinecommercialworkloads.Much of the paperdescribesthetechniquesusedin orderto supportthehighsamplingrate.

4.2.3 Pros:� One of the advantagesof performancecountersis that they can often be usedto obtaininformation at a much higher resolutionthan other techniques.As an example,one canobtainacountof theactualnumberof instructionsusedto executea function.� The techniqueof continuousprofiling doesn’t requiresourcecodeandit is languageinde-pendent.� Continuousprofiling hasa very low barrier to entry. Essentiallyall programsare beingmeasuredso it’ s a matterof using the analysistools to determineinformation regardingperformance.

4.2.4 Cons:� As is the casefor simulators,performanceinformationprovided is typically gearedmoretowardshardwaredesignersandcompilerwritersbecausethesehaveusuallybeenthepeoplewho have developedandusedhardwareperformancecounters.Informationthat is typicalconcernscachemissesandinstructionstalls.� Correlatingthe informationobtainedwith the sourcecodeis typically not aseasyasonewould like.


This techniquecanalsobe usedto measurethe performanceof an the underlyingprocessor, anapplication,librariesand/oraspectsof anoperatingsystemandpossiblycombinationsof theabove.

20

4.2.6 PossibleAvenuesfor Investigation� While mostmicroprocessorscontainhardwareprofile countersit’ s not clear that they arebeingutilized,exceptperhapsby themostperformancehungryusersof parallelcomputers.Onewould think thatbettertoolscouldbedevelopedthatwould make it easierto useper-formancecountersandthatwould provide a moreclearlink betweentheperformancedataandwhereandhow thesystemwouldbemodifiedto improveperformance.� Onemightbeableto useinformationfrom separatesystemsrunningcontinuousprofiling toobtainbetterandmoredetailedinformationabouttheperformanceof morecomplex systems.For example,client/serverand/ordistributedsystems.

4.3 ExecutionTracing

Thereexist sometools that permit oneto tracean existing programwithout modificationto thatprogram.An exampleis thoseprogramsdesignedto tracesystemcalls,which include:strace[34], truss [58], andtusc [21].

We’re not awareof researchrelatedto suchtools but wantedto include them becausetheycanbe very powerful for understandinganddebuggingsystemsthat aren’t working. We’re alsonot surehow often they areusedto measureperformancebut perhapsthey couldbe extendedorimprovedto make themmoreeffectiveor morewidely applicable.

4.3.1 Pros:� Doesn’t requiresourcecodeandis languageindependent.� Veryeasyto use.

4.3.2 Cons:� Only provide informationrelatedto systemcallsandsignals.� Fairly limited scopeof applicability.


Theexisting toolsreporton all systemcallsandsignalsandwith theproperoptionswill reportonthetime thatelapsesbetweeneachsystemcall.

4.3.4 PossibleAvenuesfor Investigation� Might it bepossibleto modify strace to includethetime spentin eachsystemcall andtoprovideasummaryreportat theendof theexecution.� Might it be possibleto producea similar tool that can be usedfor a broaderscopethansystemcall. For example,mightonebeableto leverageshimtechnologyto build a tool thatinsteadof tracingandreportingonsystemcallsreportson library calls.Canthisbeextendedto includeall functioncalls?

21

5 Indir ectObservation

Thereis a subtlebut importantdifferencebetweenour category of “Direct Observation WithoutInstrumentation”and “Indirect Observation”. The key differenceis whetherthe actualsystemthatonewould like to improve is executedwhile measurementsarebeingmade.If thesystemofinterestis runningwhile datais beingcollected,we view this asdirect observation. If it is notrunningwhile datais beingcollectedor only a portionof thesystemis running,we considerthisto beindirectobservation.

The techniqueof indirect observation is often usedwhenthe systemof interestcan’t be ex-ecutedin order to measureperformanceor the metric of interestcan’t be obtainedduring thesystem’sexecution.

As an example,considera web server that is not providing the level of desiredthroughput.Onemight suspectthat the 1 Gbpsnetwork connectionis a bottleneckandobservesthat duringexecutiona goodputof 560 Mbps is obtained. Theremay not be muchthat canbe donein thecontext of therunningwebserver to testthis hypothesis.An indirectapproachmight beto run anexternalprogram(for example,netperf [25] usingappropriatesizesfor theaveragerequestandresponseandto observe thegoodputobtainedover thesuspectedbottlenecklink. If theobservedgoodputwith netperf is near560 Mbps thusmay indicatethat thenetwork is a potentialbot-tleneck.However, if thegoodputobtainedusingnetperf is significantlyhigherthan560Mbpsthatwould indicatethatthenetwork bandwidthis likely not thebottleneck.

5.1 Benchmarksand Workload Generators

Therearealargenumberandvarietyof methodsfor performingindirectmeasurements.Two of themainapproachesareto usebenchmarksor workloadgeneratorsof somesort (which mayincludeartificial or tracedrivenworkloads)or to usesomeform of specialpurposeobservation tool. Wenow verybriefly giveexamplesof somework beingconductedin eachcategory.

5.1.1 Parallel Application Benchmarks

An exampleof somebenchmarksthat have beenproducedand studiedin somedepthare theSPLASHparallelapplicationbenchmarks[61]. This setof benchmarkapplicationshasbeenusedfor a variety of purposesincluding: microprocessordesign,multiprocessorinterconnectdesign,compilerdesignandimplementation,andevaluatingandcomparingcachecoherenceschemesinmultiprocessoranddistributedvirtual sharedmemorysystems.Theseapplicationshavebeenusedin multiprocessorenvironmentsin muchthesameway thattheSPECbenchmarkshavebeenusedin uniprocessorenvironments.

5.1.2 World-W ide-WebWorkload Generators

A significanttopic of interestis the performanceof web andproxy servers. As a result,severaltechniqueshave beenusedto measuretheir performance.Onetechniqueis to usedataobtainedfrom tracesgatheredfrom realwebsitesandthento usethetracesto drivequeriesfrom aworkloadgenerator.

22

Oneof theproblemswith theseapproachesis thatit hasn’t beenclearhow to scaletheworkloadfor larger systemsnor how to generatea workloadthat is significantlymoredemandingthantheserveris capableof handling.Oneapproachto offeringsustainedloadandoverloadfor webserversis providedin thehttperf [40] tool. Thekey observationmadein this work is thatonceserversbecomesaturated,they areunableto acceptandprocessnew connectionsfrom clientsandasaresultworkloadgeneratorsaresignificantlyslowedbecauseof delaysandtimeoutscausedby TCP.In orderto maintaina high rateof requests,clientsthataregeneratingloadmustalsotimeoutandterminateunresponsiveconnections.

5.1.3 Pros� If benchmarksare representative and repeatablethey can be usedto gain very powerfulinsights.

� Onecandevelopbenchmarksthatarebasedon predictionsof futureworkloads.

5.1.4 Cons� Thebenchmarksmustberepresentativeof realworkload.

� Systemdesignersandbuilderscanspendtoomuchtimedevelopingsystemsthatrunbench-marksreally well.

� Runningbenchmarksandanalyzingthedatacanconsumeconsiderabletime andeffort. Forexample,benchmarkinga highly-scalablemulti-tierede-servicearchitecturecould requireconsiderablehardware, software and humanresourcesjust to get to the point wherethesystemcouldbetested.

5.1.5 PossibleAvenuesfor Investigation

It wouldseemthatthis is anareawherethereis muchpotentialfor futurework. Somepossibilitiesinclude:

� How doesonedesignworkloadsthatscalein orderto effectively drive largescalesystemsandsupposedlyscalablesystems?

� It would be interestingto investigatetechniquesfor monitoringand reportingweb serverresponsetimesasseenby clients. Onewould likely startby looking at productsthathaveappearedon themarket recentlythatreportedlyprovide this service.

� A difficult problemfor e-serviceproviders is figuring out how to measuretheir system’sperformancein sucha way thatbottlenecksareidentified. It would seemthatthereis muchwork thatcouldbedonein this area.

23

5.2 Observation Tools

Tools designedfor observingthe performanceof a systemcould be usedto conductdirect orindirectobservations.However, becausesomeof thesetoolsareunableto provideonly informationregardingaspecificprogramor subsystemwediscussthemhere,underthecategoryof techniquesandtoolsfor indirectobservation.

Two examplesof recentwork conductedin this areainvolve tools for observingnetwork per-formance.

Thefirst exampleis clink [18] which wasdevelopedasa resultof conductingexperimentsandobservingproblemswith theaccuracy of thebandwidthandlatenciesreportedby pathchar[23]. By usingbetterandadaptive samplingtechniquesclink is ableto provide moreaccurateresults,oftenin ashortertime.

A secondexampleis Sting [48] which hasbeendevelopedto measurenetwork packet lossrates.Someof theadvantagesof Sting are:� It is ableto measurepacket lossratesin both forwardandbackwarddirections.ping and

traceroute (see[55] [56] for more information)can’t handlelossasymmetryandaresusceptibleto ICMP filtering.� It doessowithout requiringmeasurementinfrastructuresuchassoftwaredaemonsrunningonboththesenderandreceiverof interestasrequiredby existingapproaches[43] [1] [44].

In orderto measurepacket lossoneneedsto know thenumberof packetssentandthenumberof packetsreceived.A one-way lossrateis calculatedas 8:9<;>=@?0ACB�DFE GHB ?$B

� I B�J;K=@?$ALB�DFE E�B0M&D . At thesourceyoucanknow thenumberof packetssentbut not thenumberreceivedandat thedestinationyou know thenumberof packetsreceivedbut not thenumbersent.Thekey questionis thenhow do you obtainsuchmeasurementswithout usingmeasurementinfrastructure.

The big contribution madein this work is to recognizethat onecanuseTCP’s error controlmechanismsto derivetheunknown variable.Stingcleverly usesthesemechanismsandknowledgeof TCP’s behaviour in orderto first sendsomedata(a phasethey call data-seeding)andthentodiscover which packetswerelost (a phasethey call hole-filling). Hole-filling is accomplishedbyfirst sendingapacketwith asequencenumberthatis onelargerthanthelastpacketsentin thedata-seedingphase.If thedestinationrespondswith anacknowledgement,it eitherindicateswherethehole is or that thereis no hole (becauseTCP usesinclusive ACKS andit will ACK the last lastpacketsent).For eachholediscoveredthesourceretransmitsthemissingpacketandtheprocedureis repeateduntil all holeshavebeendiscoveredandfilled.

5.2.1 Pros� Thesetoolsaretypically veryeasyto use.� They canbeuseddirectlywhile thesystemof interestis executingor indirectly.

5.2.2 Cons� Someof thesetoolsprovideonly generalsystemwide informationsoit canoftenbedifficultto determinetheperformanceof anindividualentity of interest.

24

� Sometoolsaredesignedto very specificallymeasureor observe only oneaspectof systembehaviour.

5.2.3 PossibleAvenuesfor Investigation� Of courseoneimaginesthereis alwaysa needfor new andbettertools. Oneapproachherewould be to determinewhat aspectsof performancearen’t coveredwell by existing tools.Anotherapproachmight be to extendexisting tools (e.g., to perhapsdirect more focusedattentionto particularaspectsof thesystem).

For example,wearenotawareof a tool or optionsto a tool thatcanbeusedto show thenet-work bandwidthbeingconsumedby anindividualprocess.It might proveusefulto developa tool thatoperateslike top but thatcanbeusedto show thebandwidthbeingconsumedbyeachprocess(andonwhich interface),aswell asthebandwidthbeingconsumedoneachin-terface,andthetotalamountof bandwidthbeingconsumedin theentiresystem.Becausethismayrequirekernelsupportaninterestingquestionwouldbe,couldthis bedoneefficiently.

� We expect that more work could be donetowardseffectively using information obtainedfrom combinationsof tools to provide betterinsightsinto systemperformanceandto helppinpoint bottlenecks.A startingpoint heremight be to determinethe main weaknessesofSE[17].

6 Summary

Becauseapplications,systemsandservicesarebecomingincreasinglytied with a corporation’sbottomline, thereis a growing concernwith their performance.In many instancesgoodperfor-manceis essential– or moreto thepoint, poorperformancecanbecostly. As a result,muchtimeandeffort is expendedmeasuring,improving andcontinuallymonitoringtheperformanceof keycomputerandcommunicationcomponents.Webelievethattheamountof timeandeffort spentontheseendeavourscanbesignificantlyreducedby usingtoolsandtechniquesthatclearlypinpointproblemareas.With the growing complexity of computerandcommunicationsystemsit is alsobecomingmore importantfor thesetools to not only help determinewhereproblemslie but toprovidehelpin determininghow to alleviatetheseproblems.

6.1 SomePotential Opportunities

Our goal is to identify key areasof needand to investigatetools and techniquesfor to help toquickly, easily and accuratelydiagnoseandsolve performanceproblems. To this end we nowbriefly discussa few of themoreinterestingor importantareaswherewebelieveexisting researchcouldbeimprovedor extendedor wherenew opportunitiesmight exist.

6.1.1 Measuring and Impr oving Performancein Complex Envir onments

Many applicationsandservicesarebeingcreatedby combiningseveralcommunicatingprograms,oftenexecutingon differentsystems.We believe that furtherwork is requiredto betterserve the

25

needsof this type of environment. This may requirenew approachesandnew tools or perhapsa combinationof existing methodscanbe deployed simultaneously. Interestingquestionsherearehow doesone: determinethe right combinationof tools, avoid excessive overhead,andcancollecteddatabecorrelatedin sucha way that it canidentify problemareasandprovide insightsinto alleviating theproblems.

6.1.2 Newand Impr oved Toolsthat LeverageExisting Technology

Someof therecentlydevelopedtechnologiesoffer considerableopportunitiesif utilizedeffectively.Sincesomeof thesetechniqueshave only recentlybeendevelopedandbecausetools developedduringresearcharequiteoftendesignedmainly asa proof of conceptfor theunderlyingmethods,someexisting tools could be significantly improved. Sometools appearto be fairly difficult todeploy, producedatathatwill bedifficult for anaveragepersonto interpret,do not supplyenoughinformationwith regardsto pinpointingthesourceof theproblem,arenotveryhelpfulwhentryingmodify thesystemto alleviateproblems,or all of theabove. Potentialwork in this arewould beto moreeffectively utilize existing technologyto morequickly, easilyanddirectly locateproblemsandimproveperformance.

6.1.3 Binary Rewriting to ReduceLock Contention

This work would examinethe useof binary rewriting technologyin orderto track the time andapplicationspendsholding variouslocks, track shareddataaccessedwhile thoselocks areheldandinvestigatetechniquesfor determiningwherelocks might be split and/ormoved in ordertoenhanceperformance.This would operatein a fashionthat is similar to Eraserbut with a signif-icantly differentgoal. Informationcould alsobe gatheredduring this time that might be usefulin projectingboundson parallelapplicationperformancewhenexecutingin largermultiprocessorenvironments.

Although it is not clear whetherit would be possible,ultimately a tool suchas this wouldrewrite thebinariesto automaticallysplit locksand/ormove locksto reducelock contention.

6.1.4 A Noteon Tuning Notes– Autotuning

Evenbeforetheadventof theWorld-Wide-Web,but especiallysinceits developmentcompaniesandorganizationshaveproduceda numberof shortdocuments(sometimeswhitepapers)explain-ing how to tuneanapplicationor operatingsystemin orderto improveperformance.Goodexam-plesof this typeof documentarethevastnumberof webpagesdevotedto tuningyour particularcombinationof operatingsystemandwebserver in orderto obtainpeakperformance.Thereareoftenseveraldifferentversionsavailablethathave beenproducedby theorganizationwho devel-opedthe operatingsystem,the organizationthat producedthe web server, and individualswhoaretrying to helpothersby providing themwith insightsthey’ve gainedwhile trying to improveperformancein theirenvironment.

Oneof thebiggestcomplaintsI’ve heardandexperiencedmyselfwith thesedocumentsis thatthey fail to explain which of the tensor hundredsof “tweaks” applyunderwhich circumstances,whicharethemostimportant,andtheimpactthey arelikely to haveon performance.

26

Useful work in this areacould involve conductinga systematicstudyof the impactof suchproposedtweaksandto documenttheir impacton performancein order to provide moreusefulinformationto thosewhoarefacedwith theprospectof tuningsystemperformance.

However, a moreinteresting,challengingandpotentiallyrewardingapproachwould be to in-vestigatemethodsfor automaticallymeasuringand tuning systems. The idea would be to runa representative workload(or even the actualworkload),gatherperformancedata,automaticallyanalyzethedata,make thosetweaksdeterminedto mostlikely behelpful, andto repeatthepro-cess.Ideallyperformancewouldconvergesothatpeak-performancewouldbeobtainedaftera fewiterations.

This would first requirefirst ensuringthat mechanismsexist for easily changingimportantperformanceparameters(i.e.,suchanapproachwouldbemuchlesshelpful if eachtestrunrequiresrecompilingthekerneland/orrebootingtheoperatingsystem).

7 Futur eWork

Clearly many performancemeasurementandanalysistools arebeingdevelopedby a numberofcompanies,many of which specializein performancetools. Beforereinventingan existing tool,furtherwork is requiredto survey existingproducts.Thiswill bemoredifficult thansurvey existingresearchsincein somecasesthe only way to understandwhat a tool is really capableof is topurshaseand usethat tool. An additionaldifficult with respectto developing new techniquesis that companiesdo not like to disclosethe methodsusedto obtain information usedby theirtools.Hopefully theincreasein thenumberof availabledemoversionsof toolswill amelioratetheproblemwith testingexpensivesoftware.

References

[1] G. Almes. Metrics and infrastructurefor IP performance. http://www.advanced.org/-surveyor/presentations.html,1997.

[2] GlennAmmons,ThomasBall, andJamesR. Larus.Exploiting hardwareperformancecoun-terswith flow andcontext sensitiveprofiling. ACM SIGPLAN Notices, 32(5):85–96,1997.

[3] J.M. Anderson,L.M. Berc,J.Dean,S. Ghemawat,M.R. Henzinger, S.A. Leung,R.L. Sites,M.T. Vandevoorde,C.A. Waldspurger, andW.E.Weihl. Continuousprofiling: Wherehaveallthecyclesgone?ACM Transactions on Computer Systems, 15(4):357–390,November1997.

[4] T.E.AndersonandE.D.Lazowska.Quartz:A tool for tuningparallelprogramperformance.In The ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems,pages115–125,May 1995.

[5] R.ArpaciandM. Fahndrich.revEELingSolaris.IRAM projectSpring96,ComputerScienceDivision,EECSUniversityof Californiaat Berkeley, May, 1996.

[6] SandeepBajaj, LeeBreslau,DeborahEstrin,Kevin Fall, Sally Floyd, PadmaHaldar, MarkHandley, AhmedHelmy, JohnHeidemann,Polly Huang,SatishKumar, Steven McCanne,

27

RezaRejaie,PuneetSharma,KannanVaradhan,Ya Xu, HaoboYu, andDanielZappala.Im-proving simulationfor network research.TechnicalReport99-702b,Universityof SouthernCalifornia,March1999.revisedSeptember1999,to appearin IEEE Computer.

[7] ThomasBall andJamesR. Larus. Optimally profiling andtracingprograms. In The 19thSymposium on Principles of Programming Languages, pages59–70,January1992.

[8] R.C. Bedichek. Someefficient architecturesimulationtechniques. In Proceedings of theWinter 1990 USENIX Conference, pages53–63,January1990.

[9] R. BerrendorfandH. Zieler. PCL – the performancecounterlibrary: A commoninterfaceto accessperformancecounterson microprocessors.Version1.3,http://www.fz-juelich.de/-zam/PCL.

[10] J.S.Brown. Theapplicationof codeinstrumentationtechnologyin theLosAlamosdebugger.Technicalreport.Los AlamosNationalLaboratory, October, 1992.

[11] S. Browne, J. Dongarra,N. Garner, G. Ho, andP. Mucci. A portableprogramminginter-facefor performanceevaluationon modernprocessors.The International Journal of HighPerformance Computing Applications, 14(3):189–204,2000.

[12] BryanBuck andJeffrey K. Hollingsworth. An API for runtimecodepatching.The Interna-tional Journal of High Performance Computing Applications, 14(4):317–329,Winter 2000.

[13] PeterA. Buhr andRobertDenda.uprofiler: Profiling user-level threadsin a shared-memoryprogrammingenvironment.In The Second International Symposium on Computing in Object-Oriented Parallel Environments (ISCOPE’98), pages159–166,December1998.

[14] H. FrankCervone. Solaris 7 Performance Administration Tools. McGraw Hill, New York,2000.

[15] F. Chow, A.M. Himelstein,E. Killian, andL. Weber. Engineeringa RISCcompilersystem.In IEEE COMPCON, March1986.

[16] Bob Cmelik andDavid Keppel.Shade:A fastinstruction-setsimulatorfor executionprofil-ing. ACM SIGMETRICS Performance Evaluation Review, 22(1):128–137,May 1994.

[17] A. CockcroftandR. Pettit. Sun Performance And Tuning: Java and the Internet. SunMi-crosystemsPress(Prentice-Hall),MountainView, CA, 1998.

[18] Allen B. Downey. Usingpathcharto estimateinternetlink characteristics.In Measurementand Modeling of Computer Systems, pages222–223,1999.

[19] A.J.Goldberg andJ.Hennessy. Performancedebuggingsharedmemorymultiprocessorpro-gramswith MTOOL. In Supercomputing ’91, pages189–200,November1991.

[20] J.K. Hollingsworth,B.P. Miller, andJ.Cargille. Dynamicprograminstrumentationfor scal-ableperformancetools. In Scalable High-Performance Computing Conference, pages841–850,May 1994.

28

[21] HP-UX ManualPages.tusc man page.

[22] IEEE ComputerSociety. ParaScope:A List of Parallel ComputingSites. http://www.-computer.org/parascope/.

[23] V. Jacobson.Pathchar:A tool to infer characteristicsof internetpaths.http://www.caida.org/-tools/utilities/others//pathchar.

[24] R. Jain. The Art of Computer Systems Performance Analysis: Techniques for ExperimentalDesign, Measurement, Simulation, and Modeling,. Wiley-Interscience,New York, NY, USA,1991.

[25] Rick Jones.PublicNetperfHomepage.http://www.netperf.org/netperf/NetperfPage.html.

[26] P.B. Kessler. Fast breakpoints:Designand implementation. In Proceedings of the ACMSIGPLAN ’90 Conference on Programming Language Design and Implementation (PLDI),pages78–84,June1990.

[27] Patrick Killelea. Web Performance Tuning Speeding Up the Web. O’Reilly andAssociates.Inc.,October1998.

[28] D.P. Konkin, G.M. Oster, andR.B. Bunt. Exploiting software interfacesfor performancemeasurement.In Workshop on Software Performance, pages208–218,SantaFe,NM, 1998.

[29] F. Lange,R. Kroger, andM. Gergeleit. JEWEL:Designandimplementationof a distributedmeasurementsystem.IEEE Transactions on Parallel and Distributed Systems, 3(6):657–671,1992.

[30] F. Larsson,P. Magnusson,andB. Werner. SimGen:Developmentof efficient instructionsetsimulators.Technicalreport. SIC-R97/03,SwedishInstituteof ComputerScience,Novem-ber, 1997.

[31] JamesR. LarusandThomasBall. Rewriting executablefiles to measureprogrambehaviour.Software, Practice and Experience, 24(2):197–218,February1994.

[32] JamesR. LarusandEric Schnarr. EEL: machine-independentexecutableediting. ACM SIG-PLAN Notices, 30(6):291–300,1995.

[33] E. D. Lazowska,J. L. Zahorjan,G. S. Graham,andK. C. Sevcik. Quantitative System Per-formance. Prentice-Hall,1984.

[34] Linux ManualPages.strace man page.

[35] P. Magnusson,F. Dahlgren,H. Grahn,M. Karlsson,F. Larsson,F. Lundholm,A. Moestedt,J. Nilsson,P. Stenstrom,andB. Werner. SimICS/sun4m:A virtual workstation. In Usenix1998 Annual Technical Conference, pages119–130,1998.

[36] P. MagnussonandB. Werner. Efficientmemorysimulationin SimICS.In Proceedings of the28th Annual Simulation Symposium, pages62–73,1995.

29

[37] BartonP. Miller, JonCargille, R. BruceIrvin, Tia Newhall, Mark D. Callaghan,Jeffrey K.Hollingsworth, Karen L. Karavanic, and Krishna Kunchithapadam.The Paradynparallelperformancemeasurementtools. IEEE Computer, pages37–46,November1995.

[38] MIPSComputerSystems.UMIPS-V Reference Manual (pixie and pixstats), 1990.

[39] P. MagnussonJ.Montelius.Performancedebuggingandtuningusinganinstructionsetsim-ulator. Technicalreport. SIC-T–97/02-SE,SwedishInstitute of ComputerScience,June,1997.

[40] David MosbergerandTai Jin. httperf:A tool for measuringwebserverperformance.In FirstWorkshop on Internet Server Performance, pages59–67,June1998.

[41] Greg OsterandRick Bunt. UsingShimtechnologyto monitorDCEruntimeperformance.InIBM Center for Advanced Studies Conference (CASCON), 1997.

[42] ParallelToolsConsortium.TheParallelToolsConsortium.http://www.ptools.org/.

[43] V. Paxson.End-to-endroutingbehaviour in theinternet.In Proceedings of SIGCOMM 1996,pages25–38,August1996.

[44] V. Paxson,G. Almes,J. Mahdavi, andM. Mathis. Framework for IP performancemetrics.RFC-2230,May, 1998.

[45] B. Ries, R. Anderson,W. Auld, D. Breazeal,K. Callaghan,E. Richard, and W. Smith.Theparagonperformancemonitoringenvironment.In Supercomputing ’93, pages850–859,November1993.

[46] T. Romer, G. Voelker, D. Lee, A. Wolman,W. Wong, H. Levy, B. Bershad,andB. Chen.Instrumentationandoptimizationof Win32/IntelexecutablesusingEtch. In Proceedings ofthe USENIX Windows NT Workshop, pages1–8,August1997.

[47] MendelRosenblum,StephenA. Herrod,EmmettWitchel,andAnoopGupta.Completecom-putersystemsimulation: The SimOSapproach.IEEE parallel and distributed technology:systems and applications, 3(4):34–43,Winter 1995.

[48] StefanSavage. Sting: A TCP-basednetwork measurementtool. In USENIX Symposium onInternet Technologies and Systems, 1999.

[49] Stefan Savage,MichaelBurrows, Greg Nelson,Patrick Sobalvarro,andThomasAnderson.Eraser:A dynamicdataracedetectorfor multithreadedprograms. ACM Transactions onComputer Systems, 15(4):391–411,1997.

[50] D.J. Scales,K. Gharachorloo,andC.A. Thekkath. Shasta:A low overhead,software-onlyapproachfor supportingfine-grainsharedmemory. In The 7th Symposium on ArchitecturalSupport for Programming Languages and Operating Systems, October1996.

30

[51] I. Schoinas,B. Falsafi,A.R. Lebek,S.K. Reinhardt,J.R.Larus,andD.A. Wood. Fine-grainaccesscontrolfor distributedsharedmemory. In The 6th Symposium on Architectural Supportfor Programming Languages and Operating Systems, pages297–306,October1994.

[52] S. Shende.Portableprofiling andtracingfor parallelscientificapplicationsusingC++. InACM SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT), pages134–145,1998.

[53] SameerShende.Profiling andtracingin Linux. In Proceedings of the Extreme Linux Work-shop, June1999.

[54] AmitabhSrivastava andAlan Eustace.ATOM: A systemfor building customizedprogramanalysistools. ACM SIGPLAN Notices, 29(6):196–205,1994.

[55] W.R.Stevens.AddisonWesley, 1994.

[56] W.R.Stevens.PrenticeHall, 2ndedition,1998.

[57] Sun MicrosystemsInc. Sun PerformanceInformation. http://www.sun.com/sun-on-net/-performance.html.

[58] SUNOSManualPages.truss man page.

[59] Ariel TamchesandBartonP. Miller. Fine-graineddynamicinstrumentationof commodityoperatingsystemkernels.In Operating Systems Design and Implementation, pages117–130,1999.

[60] UCB/LBNL/VINT. Network simulator- ns (version2). http://www-mash.CS.Berkeley.-EDU/ns/.

[61] S. Woo, M. Ohara,E. Torrie, J.P. Singh,andA. Gupta. TheSPLASH-2programs:Charac-terizationandmethodologicalconsiderations.In The 22nd Annual International Symposiumon Computer Architecture, pages24–36,1995.

[62] P.T. Wu andPNarayan.Multithreadedperformanceanalysiswith sunworkshopthreadeventanalyzer(Revision 03). Technicalreport. SunMicrosystemsInc., AuthoringandDevelop-mentTools,SunSoft,April, 1998.

[63] ZhichenXu, BartonP. Miller, andOscarNaim. Dynamicinstrumentationof threadedappli-cations.In Principles Practice of Parallel Programming, pages49–59,1999.

[64] M.J. Zekauskas,W.A. Sawdon,andB.N. Bershad.Softwarewrite detectionfor distributedsharedmemory. In The First Symposium on Operating System Design and Implementation,pages87–100,November1994.

31

Performance Measurement Research: A High-Level Overview ...brecht/courses/854... · The importance...

Documents

Transcript of Performance Measurement Research: A High-Level Overview ...brecht/courses/854... · The importance...