Space-Efficient Time-Series Call-Path Profiling of ... Webseite/pdf... · Space-Efficient...

16
Aachen Institute for Advanced Study in Computational Engineering Science Preprint: AICES-2009-13 05/May/2009 Space-Efficient Time-Series Call-Path Profiling of Parallel Applications Zoltán Szebenyi, Felix Wolf, Brian J.N. Wylie

Transcript of Space-Efficient Time-Series Call-Path Profiling of ... Webseite/pdf... · Space-Efficient...

Page 1: Space-Efficient Time-Series Call-Path Profiling of ... Webseite/pdf... · Space-Efficient Time-Series Call-Path Profiling of ... by a small group of processes with time-dependent

Aachen Institute for Advanced Study in Computational Engineering Science

Preprint: AICES-2009-13

05/May/2009

Space-Efficient Time-Series Call-Path Profiling of

Parallel Applications

Zoltán Szebenyi, Felix Wolf, Brian J.N. Wylie

Page 2: Space-Efficient Time-Series Call-Path Profiling of ... Webseite/pdf... · Space-Efficient Time-Series Call-Path Profiling of ... by a small group of processes with time-dependent

Financial support from the Deutsche Forschungsgemeinschaft (German Research Association) through

grant GSC 111 is gratefully acknowledged.

©Zoltán Szebenyi, Felix Wolf, Brian J.N. Wylie 2009. All rights reserved

List of AICES technical reports: http://www.aices.rwth-aachen.de/preprints

Page 3: Space-Efficient Time-Series Call-Path Profiling of ... Webseite/pdf... · Space-Efficient Time-Series Call-Path Profiling of ... by a small group of processes with time-dependent

Space-Efficient Time-Series Call-Path Profilingof Parallel Applications

Zoltan Szebenyi∗#1, Felix Wolf ∗#2, Brian J. N. Wylie#3

# Julich Supercomputing Centre, Forschungszentrum Julich GmbHD-52425 Julich, Germany

{1 z.szebenyi, 2 f.wolf, 3 b.wylie}@fz-juelich.de∗ Aachen Institute for Advanced Study in Computational Engineering Science

RWTH Aachen University, D-52056 Aachen, Germany

Abstract—The performance behavior of parallel simulationsoften changes considerably as the simulation progresses — withpotentially process-dependent variations of temporal patterns.While call-path profiling is an established method of linking aperformance problem to the context in which it occurs, call pathsreveal only little information about the temporal evolution ofperformance phenomena. However, generating call-path profilesseparately for thousands of iterations may exceed available bufferspace — especially when the call tree is large and more thanone metric is collected. In this paper, we present a runtimeapproach for the semantic compression of call-path profilesbasedon incremental clustering of a series of single-iteration profilesthat scales in terms of the number of iterations without sacrificingimportant performance details. Our approach offers low runtimeoverhead by using only a condensed version of the profile datawhen calculating distances and accounts for process-dependentvariations by making all clustering decisions locally.

I. I NTRODUCTION

As numerical simulations model the temporal evolution ofa system, their progress occurs via a series of discrete pointsin time. According to this iterative nature, the core of suchanapplication is typically a loop that advances the simulatedtimestep by step — the entire loop often preceded by initializationand concluded by finalization procedures. However, the per-formance behavior may vary between individual iterations,forexample, due to periodically re-occurring extra activities [1]or when the state of the computation adjusts to new conditionsin so-called adaptive codes [2]. As we have shown in a recentperformance study of the SPEC MPI2007 benchmark suite [3],temporal variations are wide-spread and may appear in highlydiverse patterns ranging from gradual changes and suddentransitions of the base-line behavior to both periodicallyandirregularly occurring extrema. To complicate matters further,the temporal behavior can be subject to significant process-dependent variations, that is, the performance behavior canbe a function of both time and space. However, recognizingthe relationship between temporal and spatial patterns canbe crucial for the understanding of performance problems.For example, thePEPC Coulomb solver was found to sufferfrom a gradually increasing communication imbalance causedby a small group of processes with time-dependent con-stituency [4]. Sometimes, behavioral deviations may also be

caused by outside influences, such as operating system jitter orapplication interference, which may limit their reproducibility.

Call-path profiling [5], [6], [7], [8], [9], which aggregatesperformance metrics across the entire execution broken downby call path, is a widely-used method of linking a performanceproblem to the context in which it occurs. Especially wheninvestigating the use of library functions, call-path informationcan be critical in tracking performance problems back totheir roots in the user code. For example, during the analysisof MPI programs, call-path information is often essential todecide where in the program a communication bottleneckoccurs. However, the context expressed in the call path usuallyreveals little about the temporal evolution of a performancephenomenon. As a result, important insights as to how andwhy certain behaviors develop may be lost.

On the other hand, writing a separate call-path profile forevery iteration, a special use case of a technique commonly re-ferred to asphase profiling[8], may pose scalability problemsin terms of the number of iterations that can be captured andanalyzed. The most severe restriction arises from the bufferspace required to store such atime-series call-path profilemeasurement because adding a time dimension multiplies theamount of data by the number of timesteps. If flushing ofprofile data buffers is to be avoided during measurement, dueto its disruptive impact on the behavior to be observed, time-series call-path profiles can exceed available buffer space—especially when the call tree is large and more than one metricis collected. Moreover, even if the perturbation caused byflushing can be tolerated or compensated for, the aggregate sizeof the profile across a potentially large number of iterationsand processes may hinder post-processing and interactive end-user analysis, which often occurs on a front-end node ordesktop with moderate processing and memory capacity.

In this paper, we present a runtime approach for the (lossy)semantic compression of time-series call-path profiles. Ourapproach of incremental on-line clustering of single-iterationprofiles enables the collection of multi-metric call-path-leveldata for thousands of iterations, while consuming only a fewtimes the space of a single-iteration profile to circumventthe need to intermittently flush the data to disk. Exploitingrepetition in the time-dependent behavior of the target appli-

Page 4: Space-Efficient Time-Series Call-Path Profiling of ... Webseite/pdf... · Space-Efficient Time-Series Call-Path Profiling of ... by a small group of processes with time-dependent

cation, we achieve good compression rates without sacrificingimportant performance details or introducing major artefacts.Since calculating the similarity between two iterations basedon one or more metrics and a potentially large number ofcall paths may be prohibitively time consuming, we keep theruntime overhead low by calculating the separation distancebased on only a condensed version of the profile data. Atthe same time, this measure also improves the quality ofthe clustering algorithm by reducing the dimensionality ofthe distance operator. Moreover, to account for the fact thatdifferent processes may exhibit different temporal patterns, theclustering is a purely local operation, resulting in a custom-tailored process-local partitioning of the iteration space notrequiring any communication or synchronization at runtime.

The article is structured as follows: We start with a review ofrelated work in Section II. Then, we provide a brief descriptionof our general profiling methodology along with an outlineof the key data structures in Section III, before explainingour compression algorithm along with our design choices inSection IV. In Section V, we offer a quantitative evaluationofour approach based on the SPEC MPI2007 benchmark suite interms of (i) the accuracy of the compressed data, and (ii) theruntime overhead incurred. A detailed large-scale applicationexample withPEPCpresented in Section VI demonstrates thevalue of our solution for the performance analysis of long-running applications with significant time-dependent behavior.Finally, in Section VII, we conclude the paper and discussfuture work.

II. RELATED WORK

Our approach builds on the concept ofphase-basedper-formance characterization. The execution of a program canbe naturally divided into phases, which form the buildingblocks of its performance behavior. Since phases have differentexecution characteristics and may react differently to externalstimuli such as the change of the execution configurationor of the input problem, it seems reasonable to analyzetheir performance behavior independently instead of lookingat the execution only as a whole. While phases provide ageneral concept to represent arbitrary logical and runtimeaspects of the computation, our approach concentrates onlyonmajor timestep loop iterations to allow comparisons betweenexecution intervals that occur on the same logical level.

Phase profiling in the performance system TAU allows theuser to obtain a separate profile for every marked programinterval, distinguishing between static and dynamic phases [8].Whereas performance metrics are aggregated across all exe-cutions of a static phase, a separate profile object is createdfor every instance of a dynamic phase. According to thisdistinction, the time intervals we consider can be classifiedas instances of a dynamic phase. Similar in spirit, incrementalprofiling in the OpenMP profiler ompP takes a separate profilefor every fixed-sized execution time interval, using elapsedwall-clock time instead of the dynamic program structure asthe delimiter between phases [10].

The principle of dividing the execution of a program intointervals and grouping them into phases according to theirperformance characteristics has already been studied by Sher-wood et al. in the context of microprocessor hardware simula-tion [11]. Using clustering algorithms, they identified represen-tative subsections of the instruction stream of a program thatcan be used as input for simulations that would otherwise betoo slow if fed with the complete stream. The results of theseshorter simulations can subsequently be extrapolated to reflectthe execution of the entire program. In the context of large-scale parallel applications, various statistical and dataminingtechniques have been employed to reduce the size and improvethe understanding of performance data. While projection pur-suit [12] and clustering [13] have been applied to improve theunderstanding of real-time performance data, Ahn et al. haveshown that multivariate statistical techniques provide a usefulmeans to identify correlations between hardware performancemetrics and to highlight clusters of similar behavior in pro-cess topologies [14]. Moreover, PerfExplorer [15] structuresprofile data of parallel programs post-mortem by performinghierarchical andk-means clustering on vectors of metric valueswhose elements correspond to program regions.

Space limitations of highly-structured performance data setsthat include a time dimension are most apparent for eventtraces containing timestamped records for a huge number ofprogram actions. Complete call graphs are able to compressevent traces by exploiting repetitive patterns [16], but re-quire their prior existence at full length, as runtime overheadprevents the method from being applied on-line. Similarly,automatic structure extraction starts from very large tracefiles, explores their internal structure using signal processingtechniques, and selects meaningful parts [17]. Whereas theprevious approaches target the time dimension, Gamblin et al.lower the overall trace volume by tracing only a subset ofthe processes [18], with sample size periodically readjustedbased on summary performance data sent to a central clientprocess at runtime. Finally, application signatures provide away of summarizing the time-dependent behavior expressedin historical trace data in a much more compact represen-tation [19]. Here, the temporal evolution of a metric vectoris described as a metric trajectory using curve fitting ascompression mechanism, simplifying the comparison betweentwo signatures.

III. PROFILING METHODOLOGY

While our general approach is independent of a particularparallel programming model, we choose single-threaded MPIapplications as a starting point, which finds its expressionin our selection of test cases and profiling metrics. In theremainder of the paper, we therefore always use the termprocessto denote an independent locus of execution, whichmay be translated tothread in a multithreaded context. Ourcompression algorithm is intended to be used in the runtimemeasurement system of Scalasca [20], an open-source toolsetfor diagnosing inefficiencies in parallel applications written inC, C++ and Fortran on a wide range of HPC platforms. Since

Page 5: Space-Efficient Time-Series Call-Path Profiling of ... Webseite/pdf... · Space-Efficient Time-Series Call-Path Profiling of ... by a small group of processes with time-dependent

in addition to pure MPI, Scalasca also supports measurementand analysis of OpenMP and hybrid OpenMP+MPI codes, weplan to evaluate our algorithm with respect to those program-ming models in the near future. Users of Scalasca can choosebetween two types of performance data: (i) call-path profilessummarized during measurement, and (ii) call-path profilesderived from patterns in event traces. Whereas the former areneeded to identify the most resource-intensive call paths inthe program, the latter can be used to identify call paths thatexhibit a significant fraction of idle time. The result of sucha trace analysis is therefore similar in structure to a runtimecall-path profile but enriched with higher-level communicationand synchronization inefficiency metrics. While in principleour algorithm applies to those data sets as well, we consideronly runtime call-path profiling for now.

A. Call-Path Profiling

When configured for runtime profiling, Scalasca aggregatesperformance metrics such as execution time, message statis-tics, and hardware counters individually for every thread andcall path encountered during the entire program execution.

The metrics collected during measurement of MPI appli-cations are listed in Table I with indentation representingthe hierarchical relationship between them. They are dividedinto a subsetM = {M1, . . . , Mm} that is directly measuredand another subsetD = {D1, . . . , Dd} derived by lateranalysis (set in italic). Major measurement overhead and MPIinitialization and finalization are typically one-off expensesoutside the primary computational loop. Optional hardwarecounter metrics and specific MPI metrics related to file I/O,collective communication, etc., may not exist or be entirelyzero-valued in certain measurements.

Different from sampling-based approaches, all measure-ments are obtained via the direct instrumentation of functionor region entries and exits, either via PMPI library inter-position, compiler-generated instrumentation, or source-codepreprocessing. During program finalization, Scalasca collatesall process-local measurement data sets into a global profilereport and writes it to disk.

Scalasca defines call paths as lists of visited regions (usuallystarting from themain function) and maintained in a tree datastructure. Thus, a new call path is specified as an extensionof a previously defined call path to the new terminal region.When a region is entered from the current call path, any childcall path and its siblings are checked to determine whetherthey match the new call path, and if not a new call pathis created and appropriately linked (to both parent and lastsibling). Exiting a region is then straightforward as the newcall path is the current call path’s parent call path. Whenexecution is complete, a full set of locally-executed call pathsare defined, which are merged into a global set during programfinalization, serving as a basis for later call-tree visualizations.

B. Time-Series Profiling

To measure time-dependent application behavior, Scalascawas equipped with phase instrumentation capabilities follow-

TABLE ISUMMARY METRICS COLLECTED AND DERIVED BY SCALASCA.

Category MetricTime Total

Measurement overheadExecution

MPIMPI init/exitMPI synchronization

MPI collective synchronizationMPI communication

MPI point-to-point communicationMPI collective communication

MPI file I/OMPI collective file I/O

Counts Call path visitsHardware counters (optional)MPI synchronizations

MPI point-to-point synchronizationsMPI point-to-point send synchronizationsMPI point-to-point receive synchronizations

MPI collective synchronizationsMPI communications

MPI point-to-point communicationsMPI point-to-point send communicationsMPI point-to-point receive communications

MPI collective communicationsMPI collective communications as sourceMPI collective communications as dest’nMPI collective exchange communications

MPI bytes transferedMPI point-to-point bytes transfered

MPI point-to-point bytes sentMPI point-to-point bytes received

MPI collective bytes transferedMPI collective bytes incomingMPI collective bytes outgoing

ing TAU’s dynamic phase model [8], which implies that per-formance data are stored separately for every interval instance.With such instrumentation it becomes possible to distinguishthe performance data produced by different iterations afterenclosing the timestep loop body with enter and exit markers.Translated to a call-path profile, this means that every iterationpopulates the call tree with its own set of metric values.

Let C be the set of call paths reached from within thetimestep loop body (those outside can be ignored because theyare irrelevant to draw comparisons between iterations),I theset of loop iterations,P the set of application processes, andM the directly measured metrics among the ones listed inTable I. Then we can define atime-series call-path profileasa mapping

t : (c, i, p) 7→ ~m

which maps a call pathc ∈ C, an iterationi ∈ I, and a processp ∈ P onto a vector of metric values~m ∈ M1 × . . . × Mm

(e.g., the time spent by threadt in call pathc while executingiteration i). Note that the derived metric values are impliedin this definition. The process dimensionP can be omittedwhen considering the data of only a single process, which will

Page 6: Space-Efficient Time-Series Call-Path Profiling of ... Webseite/pdf... · Space-Efficient Time-Series Call-Path Profiling of ... by a small group of processes with time-dependent

often be the case as our compression is a purely process-localoperation.

C. Analysis

In Scalasca, a time-series call-path profile can be analyzedin several ways. The most common methods are listed below,and illustrated in Figure 1 with one of the SPEC MPI2007benchmark applications,129.tera tf.

• A multi-dimensional tree browser [21] that allows auser to interactively explore the entire performance spacespanned by mappingt including the derived metrics and,in particular, to examine the performance behavior ofindividual call paths (Fig. 1(a)).

• 2-dimensional iteration graphs, which for each iteration(x-axis) show the maximum, median, and minimumvalues of a given metric (y-axis) calculated from allprocesses (Fig. 1(b)).

• Value maps, which display the value of a given metriccolor-encoded as a rectangular block in a 2-dimensionalgrid with thex-axis representing iterations and they-axisrepresenting processes (Fig. 1(c)).

The quantitative evaluation of our algorithm presented inSection V assesses the influence of the compression on theexpressiveness of these three analysis methods.

IV. COMPRESSIONALGORITHM

Our compression algorithm is based on the idea ofin-cremental clustering. Clustering is the process of groupinga set of physical or abstract objects into classes of similarobjects. A cluster is a collection of data objects that aresimilar to others within the same cluster and dissimilar tothe objects in other clusters [22]. In our case, the objects tobe clustered are the metric profiles collected for individualiterations as they are generated while the target applicationprogresses. As the goal is to achieve the best overall char-acterization, clustering is performed independently for eachiteration without considering sequence order. To accuratelycover process-dependent variations of temporal patterns,eachapplication process makes independent clustering decisions,which has the additional advantage that neither communicationnor synchronization among different processes are required atruntime. Thus, we can restrict our discussion to the actionsperformed by a single process. The algorithm is lossy in thesense that the compressed data no longer contains all theinformation necessary to restore the original data, althoughwe will show that the result of such a reverse-transformationcomes very close to the original data.

A. Clustering

With every new iterationi, a new process-localiterationprofile ti : c 7→ ~m is created, which maps a call path onto avector of directly measured metric values and which can bestored as a matrixT i

c,m. The clustering occurs incrementallybecause every new iteration initially constitutes a new cluster.Once a predefined maximum number of clusters has beenexceeded, the two clusters closest to each other are merged.

(a) Scalasca analysis report explorer with MPI point-to-point communicationtime metric selected (left pane). With 6.55 seconds iteration 232 of 942 (middlepane) is one of the more expensive ones, and the marked call path to theMPI Send operations is distinguished by a particularly pronounced imbalanceacross the 32 processes, with times varying from only 0.01s for rank 24 to 0.41sfor rank 27 (right pane).

0 100 200 300 400 500 600 700 800 9000

0.1

0.2

0.3

0.4

0.5

0.6

(b) MPI point-to-point communica-tion time iteration graph with maxi-mum (red), median (blue), and mini-mum (green) values.

(c) MPI point-to-point communica-tion time value map with the redend of the color spectrum indicatinghigher values.

Fig. 1. Different ways of analyzing a 32-process129.tera tf experimentwith Scalasca.

For each cluster, we store only the iteration indices assignedto it and the mean profile, which is obtained by performingan element-wise arithmetic mean operation on the constituentmatricesT i, exploiting the associativity of the mean operatorwhen merging clusters that represent more than one iteration.Using the disjoint-set forest data structure [23] makes clustercreation and merge plus retrieval of the constituent index set anasymptotically constant-time operation as opposed to a naivelinear-time implementation.

B. Distance Function

Which clusters are closest to each other is determined basedon the distance between cluster means. However, calculating adistance function based on entire profile matrices is inefficientand may also adversely affect the clustering quality, whichissensitive to the dimensionality of the distance function (|C| ∗

Page 7: Space-Efficient Time-Series Call-Path Profiling of ... Webseite/pdf... · Space-Efficient Time-Series Call-Path Profiling of ... by a small group of processes with time-dependent

|M | if the full matrix is considered). If too large, similaritiesbetween objects may be hidden. For this reason, the distanceiscomputed based on a condensed version of the profile matrix.The condensed version is a vector

~e = (m1, . . . , mm, d1, . . . , dd)

composed of an extended set of metric values aggregatedacross the entire iteration, such as the total time spent inthat iteration. The valuesm1, . . . , mm represent the directlymeasured metrics, whereasd1, . . . , dd represent the derivedmetrics, together comprising the full set of metrics listedinTable I. The idea behind the condensed profile is that~e stillcarries enough information to characterize the performancebehavior of an iteration accurately enough although it nolonger refers to individual call paths, thus eliminating theneed to compare full matricesT i

c,m. Instead differences withrespect to individual call paths are approximated by addingmore specific derived metrics. At the same time, the condensedprofile lowers the runtime overhead by making the distancecalculation much more efficient and enables more conclusiveclustering decisions by drastically reducing the dimensionalityof the distance function (|M |+ |D| usually≪ |C| ∗ |M |).

The distance between two vectors~e1 and~e2 is then calcu-lated using the Manhattan distance. Whenever a new clusteris created from a fresh iteration profile, we calculate thedistance between the new cluster and those that already exist.In addition, we need to calculate the distance between a clusterformed via merging and all remaining clusters, overall stillleading to linear computational complexityO(n) in terms ofthe maximum number of clustersn. Note that the calculationof derived metrics has to be done at most twice for everynew iteration (i.e., once for every new cluster). In contrast,the space complexity isO(n2) because we need to store thedistance between every pair of clusters.

Since metrics may have different domains and their valuesmay cover different ranges of magnitude, we face the questionof how much weight to assign to an individual metric whencalculating distances. Table I shows how Scalasca metrics areorganized in tree hierarchies defined by subset relationships,with general metrics as tree roots and more specific metricsas leaves in the tree (e.g., total time→ execution time→MPI time → MPI communication time→ MPI collectivecommunication time). On the one hand, the subset relationshipimplies that more general metrics have higher values thanmore specific metrics. On the other hand, more specific metricsare typically more indicative of performance problems, suchas communication overhead. We therefore decided to assignmetrics equal weights and normalize the vector elementsaccordingly.

As reference value for the normalization, we chose theprocess-local running average for all preceding iterationsbecause it is relatively stable with respect to noise-inducedoutliers. Where necessary, the average over all iterationscouldbe obtained from a prior run, and can be expected not to varysubstantially in the presence of minor run-to-run variations.The first distance calculation is performed only after the

desired number of clusters has been reached, therefore a fairnumber of iterations have passed before the current averageis determined for the first time. In principle, the first distancecalculation could be delayed even further, depending on theavailability of memory to hold iteration profiles representingunmerged clusters. Although the use of the running averagemay lead to premature distance calculations, which wouldbe hard to redo without incurring substantial runtime over-head, we show in Section V that the inaccuracy resultingfrom this limitation has little influence on the fidelity of ourcompression. We therefore believe that in most cases a singlemeasurement will be sufficient.

C. Emphasizing the Baseline

The algorithm discussed so far has a serious shortcoming. Itemphasizes noise and extrema much more than it characterizesthe baseline behavior. For example, in some applications thealgorithm might blur periodic low-amplitude changes in thebaseline by merging their constituent clusters, while preserv-ing a number of prominent but noise-induced extrema thatare distinct enough to have dedicated clusters reserved forthem. The problem occurs whenever the distance betweenextrema is large compared to the distances within the small-scale repetitive pattern. However, baseline changes oftencarryvaluable information on general performance trends and aretherefore desirable to keep, whereas extrema often turn outto be irreproducible noise and a minor contribution to overallperformance.

To better accentuate the baseline in comparison to extrema,we exploit the fact that the number of iterations that stand outis usually much smaller than the number of iterations on thebaseline level. Using a heuristic, we distort the distance metricin such a way that the distance among larger clusters becomeshigher, whereas the distance between smaller clusters withonly a few elements becomes lower. Two small clusters arethen much more likely to be merged than two larger clustersrepresenting many iterations. Using this method, distancesbetween clusters remain valid until one of the clusters ismerged with another one, leaving the complexity linear interms of the maximum number of clusters.

D. Call-Tree Equivalence

The algorithm explained so far is based on an elasticdistance criterion, allowing iterations with very different calltrees to be merged. This property may allow the occurrence ofphantom call pathsin the compressed profile, which are callpaths associated with an iteration, although they have neverbeen visited during this iteration. Since phantom call pathsmay lead to wrong conclusions they are a serious problem andshould be avoided whenever possible. Furthermore, call-treeequivalence between two iterations is often a good indicatorof similar performance characteristics. For these two reasons,we only merge iterations with equivalent call trees. Call-treeequivalence between two iterationsi1 and i2 can be definedin two ways:

Page 8: Space-Efficient Time-Series Call-Path Profiling of ... Webseite/pdf... · Space-Efficient Time-Series Call-Path Profiling of ... by a small group of processes with time-dependent

TABLE IIAPPLICATION CHARACTERISTICS: EXECUTION TIMES AND ITERATION, CALL -PATH AND EQUIVALENCE CLASS COUNTS

Call paths Call-treeExecution per iteration MPI per iteration equiv. classes

Application time (s) Iterations total min. avg. max. min. avg. max. weak strong104.milc 765 - - - - - - - - - -107.leslie3d 847 2,000 30,802 15 15.4 17 4 4.2 5 2 2113.GemsFDTD 1,051 1,000 46,001 46 46.0 47 1 1.0 2 2 2115.fds4 427 2,363 86,346 36 36.5 45 2 2.1 5 13 19121.pop2 242 2,000 182,233 84 91.1 103 20 23.0 28 3 11122.tachyon 3,905 - - - - - - - - - -126.lammps 930 500 38,397 76 76.8 96 36 36.9 56 23 24127.wrf2 1,282 1,375 158,075 111 115.0 489 3 3.1 37 7 348128.GAPgeofem 376 235 5,641 24 24.0 25 8 8. 8 2 7129.tera tf 1,217 942 10,369 11 11.0 18 4 4.0 8 2 2130.socorro 391 20 49,820 933 2491.0 2609 28 77.4 81 8 19132.zeusmp2 998 200 22,399 111 112.0 112 56 56. 56 2 2137.lu 554 180 2,881 16 16.0 17 7 7.0 8 2 2PEPC 13,643 1,300 67,470 51 52.0 66 28 29.0 35 4 1298

• Weak equivalence: every call path visited ini1 has alsobeen visited ini2 and vice versa.

• Strong equivalence: every call path has been visited asmany times ini1 as it has been visited ini2.

Weak equivalence excludes phantom call paths, while strongequivalence also considers quantitative similarity. To enforcecall-tree equivalence between clusters before they are merged,we partition the clusters into call-tree equivalence classes.Then every new iteration profile either forms a new class oris added to an existing one, depending on whether there isalready a cluster whose call tree is equivalent to the new one. Ifthe maximum number of clusters is exceeded, the two clustersclosest to each other that are located within the same partitionare merged.

Since call-tree equivalence is no longer an elastic criterion,the number of equivalence classes is not configurable. Es-pecially when strong call-tree equivalence is used, the totalnumber of clusters may exceed the predefined threshold,resulting in a large number of partitions each with only a singlecluster, increasing storage requirements and also potentiallydegrading the compression fidelity as distance-based clusteringis then suppressed. On the other hand, if their number issmall enough, enforcing call-tree equivalence can improvethecompression fidelity with respect to call-tree structure andcount-based metrics, as we will see in Section V. Whetherand to which degree call-tree equivalence should be enforcedis therefore highly application dependent and must be decidedbased on the total number of call-tree equivalence classes,which would have to be determined from a prior measurement.To avoid this extra measurement, a dynamic scheme can beemployed that starts with strong equivalence and switchesto weak equivalence after re-grouping the existent clustersif the number of partitions grows too large in comparisonto the desired number of clusters, thus keeping the numberof clusters in check. Since in our test cases the number ofweak equivalence classes never exceeded 23, as can be seenin Table II, we expect at least weak equivalence to be a viableoption for most applications.

While enforcing call-tree equivalence adds the cost of com-paring potentially large call trees to determine the equivalenceclass of an iteration (i.e,O(log(n)) comparisons forn existentclasses), it also reduces the number of distance calculationsbecause distances are calculated only within each class. Theperformance disadvantage therefore diminishes as the numberof clusters is increased.

E. Reconstructing Aggregate Profiles

Although our algorithm performs a lossy compression, wecan accurately reconstruct an aggregate profile that summa-rizes across all iterations. Adding up all clusters weighted bythe number of iterations they represent will yield an exactprofile without any errors introduced by the lossy compression,as if obtained without phase instrumentation. While smalldifferences between an original and a reconstructed summaryprofile may be caused by clustering overhead, compressionerrors manifest only when considering individual iterations.

V. EVALUATION

A. Measurements Used

For evaluation purposes, we focus on the applications of theSPEC MPI2007 benchmark suite [3], [24] in the medium-sizedreference configuration with 32 MPI processes on a dedicated32-core IBM p5-575 SMP node. Execution characteristics ofthe 13 applications are summarized in Table II, along with thePEPCapplication run on an IBM Blue Gene/P rack that willbe considered in Section VI. Only 10 of the 13 applicationsin the suite were evaluated, as in two of the applications nosuitable main loop could be identified for phase instrumen-tation (due to code complexity for104.milcand the absenceof such a loop for the farming-based122.tachyon), and oneapplication (130.socorro) had too few iterations in the suppliedtest case. In the medium-sized reference configuration, the121.pop2application executes 9,000 iterations, however, wereduced this to 2,000 iterations due to memory limitations forstoring measurement data because our reference data sets wereobtained without compression, as explained in Section V-B.

Page 9: Space-Efficient Time-Series Call-Path Profiling of ... Webseite/pdf... · Space-Efficient Time-Series Call-Path Profiling of ... by a small group of processes with time-dependent

0.000001

0.00001

0.0001

0.001

0.01

0.1

1

107.leslie3d

113.GemsFDTD

115.fds4

121.pop2

126.lamm

ps

127.wrf2

128.GAPgeofem

129.tera_tf

132.zeusmp2

137.lu

PEPC

pro

port

ion

8 clusters16 clusters32 clusters64 clusters

128 clusters256 clusters

(a) Call paths added.

0.000001

0.00001

0.0001

0.001

0.01

0.1

1

107.leslie3d

113.GemsFDTD

115.fds4

121.pop2

126.lamm

ps

127.wrf2

128.GAPgeofem

129.tera_tf

132.zeusmp2

137.lu

PEPC

pro

port

ion

8 clusters16 clusters32 clusters64 clusters

128 clusters256 clusters

(b) MPI call paths added.

Fig. 2. Proportion of erroneously added ‘phantom’ call paths in reconstructed profiles when call-tree equivalence is not enforced during clustering.

In addition to the wall-clock execution time and numberof iterations (or timesteps) in the main computational loopofeach application, full summary measurements with instrumen-tation distinguishing each iteration allow analysis of thecall-path complexity, which is included in Table II. Call paths foreach iteration are further distinguished to consider only thoseending with MPI communication and synchronization func-tions, and minimum, average and maximum call-path countstatistics calculated. (The number of call paths included in asummary profile depends on function in-lining by the compilerand filtering applied during measurement [3].) Furthermore, byenforcing call-tree equivalence among the sets of call paths,the number of classes resulting from weak and strong equiva-lence were determined. Where strong equivalence resulted inan excessive number of equivalence classes, weak equivalencewas used instead (i.e., for127.wrf2 and PEPC). The SPECMPI2007 benchmark suite and thePEPCapplication are seento have a rich variety of execution and call-tree characteristics.

Based on the numbers of iterations and call-tree equivalenceclasses found in these application executions, we chose tocompare clusterings using powers of two from 8 to 256.Eight clusters offer a relatively small storage overhead butrequire aggressive compression, and can be expected to onlycoarsely represent complex execution characteristics. Ontheother hand, 256 clusters should provide improved fidelity, butat a corresponding storage cost factor for the results.

B. Off-line Version

Evaluating the quality of the compressed data based onthe on-line approach would not be feasible by simply takingmeasurements with and without compression and comparingthe results. Differences due to run-to-run variation couldnot be distinguished from those due to the lossiness of thecompression. Additionally, accurate time and memory usagemeasurements of the algorithm could not be taken while it rantogether with the actual application measurement. Evaluationis therefore based on an off-line version of the algorithm which

works on already collected full phase-instrumented measure-ment results. This input data has the compression algorithmapplied to create measurement results which are equivalentto those that would be collected by the on-line version ofthe algorithm, and is performed on the same system wherethe original measurements were collected to make timingscomparable. The effectiveness of different configurationsofthe compression algorithm can also be readily compared, asthey are all based on the same real measurement data.

C. Quality Assessment

A variety of quality characteristics were investigated forPEPC and the SPEC MPI2007 applications with differentnumbers of clusters. Due to space limitations, only the mostimportant and illustrative assessments are shown and eval-uated, with discussion of results from associated analysisconcisely summarized.

Erroneous call paths:All execution call paths are accu-rately captured when call-tree equivalence is enforced, how-ever, without it call paths can be erroneously associated toclusters of iterations where they do not actually occur bythe merging operation on distinct call trees. Figure 2 showsthat there are indeed significant numbers of ‘phantom’ callpaths added for107.leslie3d, 115.fds4, 127.wrf2 and PEPC.While four of the applications never had ‘phantom’ call pathsintroduced, the remaining three applications each had a fewerroneous call paths (sometimes only with smaller numbersof clusters), which are also potentially a serious concern.Although many of the ‘phantom’ call paths are not MPI callpaths, and therefore less of a concern, ‘phantom’ MPI callpaths are found for107.leslie3d, 115.fds4and121.pop2withsmaller numbers of clusters, and predominate forPEPCevenwith 256 clusters. Since the call trees often differ, and candiffer significantly especially at low cluster counts, enforcingcall-tree equivalence is therefore essential for accuratetime-series call-path profiling. For this reason, call-tree equivalenceis always enforced in the remainder of our evaluation with

Page 10: Space-Efficient Time-Series Call-Path Profiling of ... Webseite/pdf... · Space-Efficient Time-Series Call-Path Profiling of ... by a small group of processes with time-dependent

the equivalence relation determined based on the number ofresulting classes, as indicated by the bold numbers in therightmost two columns of Table II. In most cases, we wereable to apply strong equivalence. Only for127.wrf2andPEPCwe had to resort to weak equivalence.

Error rates for entire iterations:Figure 3 shows the averageerror rates in the mean values for all metrics, considering onlythe aggregate metric values for entire iterations, as presentedin iteration graphs. Error rates for each metric are normalizedusing the average value from all iterations. (Squared errors arenot used as they would emphasize rare extrema which are notso interesting for our analysis.) Average error rates are highfor 115.fds4andPEPCwith only a few clusters, but improveas larger numbers of clusters are used, down to around 1% and3% respectively with 64 clusters, while the other applicationshave negligible errors in their mean values. Errors in maximumvalues for iterations are generally only marginally worse,though PEPC is again an exception with an average errorfor the maximum metric values reaching 10% for 64 clustersand still 6% for 256 clusters. Average error rates for count-based metrics are considerably better than those for time-basedmetrics, since call-path visits and bytes transfered are usuallynot subject to noise and small measurement variations like thetime-based metrics. ForPEPC, however, the count error ratesremain comparably high.

Error rates for individual call paths:To evaluate the qualityof the metric data for each call path within each iteration,a different method is employed. Figure 4 shows the averageerror rates in the value of time-based metrics for each call pathexcluding its children (i.e., the exclusive value). To emphasizethe most significant call paths, error rates for each metricare normalized using the maximum value from all iterations.Error rates are generally low (well below 0.1%) especiallywith larger numbers of clusters which suggests good char-acterization, and while129.tera tf is seen to be particularlyerror prone with small numbers of cluster it improves rapidlywith more clusters.129.tera tf has a continuously changingperformance behavior, as evident in Figure 1, which makes ita much more challenging case than the other SPEC MPI2007applications. As before, the count-based metrics have evenlower error rates, and a zero error rate for 8 of the 10applications (notably including129.tera tf ), apart fromPEPCwhere error rates are somewhat higher than for the time-basedmetrics.

Quantized call-path error rates:Where the very existenceof call paths is false, they contribute to the error present in allof the metrics at the call-path level, however, errors are alsoinherent in the metric values reconstructed for every true callpath from the compressed representations. For each call path,the magnitude of the error in the metrics can be determinedfrom the difference of metric values between the full-fidelityand reconstructed call-path profiles. The absolute differencesfor each metric have been normalized by the correspondingmaximum exclusive call-path value, before they are averagedand histogrammed in buckets for orders of magnitude: Figure5summarizes the results for the case using 64 clusters. (Call

0

1

2

3

4

5

6

7

8

9

107.leslie3d

113.GemsFDTD

115.fds4

121.pop2

126.lamm

ps

127.wrf2

128.GAPgeofem

129.tera_tf

132.zeusmp2

137.lu

PEPC

err

or

(%)

8 clusters16 clusters32 clusters64 clusters128 clusters256 clusters

Fig. 3. Average error rates of mean values of all metrics for entire iterations.

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

107.leslie3d

113.GemsFDTD

115.fds4

121.pop2

126.lamm

ps

127.wrf2

128.GAPgeofem

129.tera_tf

132.zeusmp2

137.lu

PEPC

ratio

(%

)

8 clusters16 clusters32 clusters64 clusters128 clusters256 clusters

Fig. 4. Average error rate of time-based metrics for each call path.

paths which actually have zero-valued metrics are not includedin this comparison to avoid dilution.)

With 64 clusters, only three of the studied applications haveany error in their count-based metrics, and while less than onein a million 127.wrf2call paths have errors of more than 0.1%(which is negligible), for126.lammpsaround one in everythousand call paths have errors exceeding 1%, and withPEPCaround one in every two hundred call paths have more than10% error. The127.wrf2 errors only occur on non-MPI callpaths, where they are less of a concern, however, the count-based errors in the latter two applications are serious sincethey are for MPI metrics and around 10% of all call pathsare affected by such error to a lesser degree. These sametwo applications are further distinguished from the othersinsometimes having marginally lower errors in the time-basedmetrics, which is unusual. Entirely error-free reconstructionsare rare for time-based metrics, and call-path error rates over10% for the time-based metrics are found forPEPCand halfof the SPEC MPI2007 applications. Such large errors affectless than one in every thousand call paths in these applications,

Page 11: Space-Efficient Time-Series Call-Path Profiling of ... Webseite/pdf... · Space-Efficient Time-Series Call-Path Profiling of ... by a small group of processes with time-dependent

0.000001

0.00001

0.0001

0.001

0.01

0.1

1

107.leslie3d

113.GemsFDTD

115.fds4

121.pop2

126.lamm

ps

127.wrf2

128.GAPgeofem

129.tera_tf

132.zeusmp2

137.lu

PEPC

pro

port

ion

(a) MPI time-based metrics.

0.000001

0.00001

0.0001

0.001

0.01

0.1

1

107.leslie3d

113.GemsFDTD

115.fds4

121.pop2

126.lamm

ps

127.wrf2

128.GAPgeofem

129.tera_tf

132.zeusmp2

137.lu

PEPC

pro

port

ion

Non-zeroOver 0.001%Over 0.01%Over 0.1%Over 1%Over 10%Over 100%

(b) MPI count-based metrics.

Fig. 5. Proportion of call paths in profiles reconstructed from 64 clusters having quantized error rates for MPI time-based and count-based metrics.

with MPI call paths somewhat more frequently impacted. Thelargest call-path errors are found to have a magnitude of 64%for PEPCand around 33% for121.pop2and132.zeusmp2. Onthe other hand, all applications (except113.GemsFDTD) havecall paths with more than 1% error in the time-based metrics,which affects one in five of129.tera tf MPI call paths. Largernumbers of clusters can be employed to reduce the magnitudeand frequency of these errors.

D. Cluster Processing Overheads

The cost of determining call-tree equivalence and clusteringcosts were measured on the same systems used when runningthe applications themselves, so that they can be related to theaverage duration of iterations to indicate processing overheadthat would be introduced. Figure 6 shows that on averagethis overhead is around 6% for121.pop2, around 1% for115.fds4and127.wrf2, and less than 0.5% forPEPCand theother SPEC MPI2007 applications. Overheads for particulariterations and processes are found up to twice the average

0

1

2

3

4

5

6

7

107.leslie3d

113.GemsFDTD

115.fds4

121.pop2

126.lamm

ps

127.wrf2

128.GAPgeofem

129.tera_tf

132.zeusmp2

137.lu

PEPC

ratio

(%

)

8 clusters16 clusters32 clusters64 clusters128 clusters256 clusters

Fig. 6. Average clustering time as a proportion of iterationexecution time.

overhead. Where application iterations are short and the calltree relatively large, processing time overhead thereforebe-comes considerable. While the processing time increases onlyslightly for larger numbers of clusters, storage requirementsfor the cluster distances growO(n2), but remain less than4MB with 256 clusters and are therefore negligible.

VI. D ETAILED EXAMPLE

As a large scale example, we chose thePEPC Coulomb-solver run with 1024 processes on an IBM Blue Gene/P, inwhich we were able to find the root causes of a serious perfor-mance problem using our phase-instrumentation approach [4].It is an adaptive code, re-balancing the workload after everyiteration, which makes its time-dependent behavior crucial inunderstanding its performance. This also makesPEPC muchharder to characterize than the previous considered SPECMPI2007 applications, in that its performance is completelydifferent on each process, its baseline for many metrics iscontinually growing, and the few bottleneck processes withvery different communication behavior from all the othersare constantly relocated by the computational load-balancingalgorithm. (As it turned out from our analysis, the algorithmbalancing the computational load is highly effective, but makesthe communication load increasingly imbalanced over time.)

It is therefore no surprise that the quality analysis andcomparison in the previous section showPEPC to be muchmore susceptible to ‘phantom’ call paths (Fig. 2), especially‘phantom’ MPI call paths, and substantial error rates requiring256 or more clusters for reasonable accuracy of both entire-iteration aggregate metric values (Fig. 3) and individual call-path exclusive metric values (Figs. 4 and 5). On the other hand,even with 256 clusters, processing overheads are reasonable(Fig. 6).

Figure 7 gives an impression of the quality of the dataproduced by different maximum cluster counts, by showing theiteration graphs and value maps of two of the key metrics, MPIpoint-to-point communication count and time, reconstructed

Page 12: Space-Efficient Time-Series Call-Path Profiling of ... Webseite/pdf... · Space-Efficient Time-Series Call-Path Profiling of ... by a small group of processes with time-dependent

16 clusters

0 200 400 600 800 1000 12000

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 200 400 600 800 1000 12000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

32 clusters

0 200 400 600 800 1000 12000

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 200 400 600 800 1000 12000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

64 clusters

0 200 400 600 800 1000 12000

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 200 400 600 800 1000 12000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

128 clusters

0 200 400 600 800 1000 12000

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 200 400 600 800 1000 12000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

256 clusters

0 200 400 600 800 1000 12000

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 200 400 600 800 1000 12000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Original

0 200 400 600 800 1000 12000

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 200 400 600 800 1000 12000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Fig. 7. Comparison of iteration graphs and value maps of MPI Point-to-point communication count (left) and time (right)for PEPCwith different clustercounts.

Page 13: Space-Efficient Time-Series Call-Path Profiling of ... Webseite/pdf... · Space-Efficient Time-Series Call-Path Profiling of ... by a small group of processes with time-dependent

Fig. 8. Scalasca analysis reports of the fullPEPC measurement (left) andthat reconstructed from 128 clusters (right), showing the MPI time metric,and with the call path selected for MPIWaitany in thetree_walk routineof the penultimate timestep.

from compressed data with different cluster counts, and com-pared to the full-fidelity originals. This shows thatPEPC is agood example of an application where the count-based metricgraphs and value maps (in the leftmost two columns) arecomplex, yet the reconstructions are surprisingly good andwith 32 clusters all the main features of the originals arealready present.

MPI point-to-point communication time metric graphs andvalue maps (in the rightmost two columns), however, show thateven 64 clusters is insufficient to reconstruct this applicationexecution behavior. With 128 clusters it is relatively close,especially the average values, but clearly 256 clusters areneeded for a really good characterization of the maximumvalues. The maximum value is difficult to characterize per-fectly, as a single outlier on any of the processes changesthe maximum for a given iteration. The value map of thesame metric reconstructed from 128 clusters is very closeto the original, however, and the important features neededto find the performance problem in this application (the darkdiagonally moving lines and the growth of the point-to-pointcommunication time with the process rank) are already clearlyvisible with 64 clusters.

Figure 8 compares the MPI time profile of the full 1300-iterationPEPCanalysis report with a reconstruction from 128clusters. Aggregate metric values for entire timesteps areinbroad agreement, with up to 5% error for timestep 1292, andeven the detailed call-path profiles show good agreement, withthe selected MPIWaitany call path in penultimate timestep1299 having only 9% error. The distribution of the selectedcall-path metric values over the 1024 processes of the BlueGene topology also shows reasonably good characterization,although significant differences for certain processes areevi-dent.

At least 256 clusters therefore seem to be required tocharacterize all important features ofPEPC, which means that

we need less than 20% of the buffer space to store the call-path profile data at run-time, and still get sufficiently highquality data. This is especially important forPEPCon a BlueGene machine with limited per-core memory: the developersnaturally optimize their code using as much of the machine’smemory as possible, leaving only a small amount for Scalascameasurement data.

VII. C ONCLUSION

A method for time-series call-path profiling is introducedand evaluated which uses incremental clustering of call treeswith similar metric values for space-efficient storage. Whileusing a distance function based on aggregate metric values al-lows efficient clustering decisions, accuracy of the compressedcall-path profiles can be ensured by enforcing call-tree equiva-lence based on visited call paths and heuristics that reducetheimpact of extrema. An implementation is demonstrated whichhas acceptable processing time and memory overheads.

Whereas the call-tree and execution characteristics ofSPEC MPI2007 applications113.GemsFDTD, 121.pop2, and137.lu lend themselves to accurate representation with asfew as 8 clusters, the more complex applications115.fds4and 129.tera tf are found to need 128 clusters (with 32 or64 clusters sufficient for the others). For the localPEPCapplication, however, at least 256 clusters are required forgood time-series call-path characterization of its complicatedexecution behavior. 64 clusters therefore generally seemstobe a reasonable default setting for a first measurement, asthis value is usually high enough to get a good quality forthe clustering, and still has relatively low time and memoryrequirements. When considered insufficient, measurement canbe repeated with a larger number of clusters.

On-line application of our method will open the possibilityof time-series call-path profile analysis of long-running codesconsisting of thousands of iterations or timesteps. Whilethis will offer benefits for many important applications, asdemonstrated by 10 of the 13 SPEC MPI2007 benchmarkcodes,121.pop2and PEPC are particular examples whereit will become practical to measure and analyze full-lengthexecutions for the first time.

With the on-line implementation it will be necessary tostudy the variation of processing costs on each process ineach iteration, and associated execution perturbation. Whereglobal synchronizations by the application are identified,thesemay be exploited for intermediate analysis processing withreduced execution disruption. It is also desirable to automati-cally determine the required number of clusters, and whetherenforcing call-tree equivalence can be safely omitted to lowerprocessing overheads. OpenMP and non-MPI metrics, such asfrom hardware counters, will also be incorporated to enrichthe analysis.

ACKNOWLEDGMENT

Financial support from the Deutsche Forschungsgemein-schaft (German Research Association) through grant GSC 111

Page 14: Space-Efficient Time-Series Call-Path Profiling of ... Webseite/pdf... · Space-Efficient Time-Series Call-Path Profiling of ... by a small group of processes with time-dependent

and the Helmholtz Association of German Research Centersthrough grant VH-NG-118 is gratefully acknowledged.

REFERENCES

[1] D. J. Kerbyson, K. J. Barker, and K. Davis, “Analysis of the weatherresearch and forecasting (WRF) model on large-scale systems,” inProc. of the Conference on Parallel Computing (ParCo, Aachen/Julich,Germany), September 2007.

[2] S. Shende, A. Malony, A. Morris, S. Parker, and J. Davisonde St.Germain, “Performance evaluation of adaptive scientific applicationsusing TAU,” in Proc. of the International Conference on ParallelComputational Fluid Dynamics (Washington DC, USA), May 2005.

[3] Z. Szebenyi, B. J. N. Wylie, and F. Wolf, “SCALASCA parallelperformance analyses of SPEC MPI2007 applications,” inProc. of the1st SPEC Int’l Performance Evaluation Workshop (SIPEW, Darmstadt,Germany), ser. Lecture Notes in Computer Science, vol. 5119. Springer,June 2008, pp. 99–123.

[4] ——, “Scalasca parallel performance analyses of PEPC,” in Proc. of theWorkshop on Productivity and Performance (PROPER) in conjunctionwith Euro-Par 2008 (Las Palmas de Gran Canaria, Spain), ser. LectureNotes in Computer Science. Springer, August 2008, (to appear).

[5] S. L. Graham, P. B. Kessler, and M. K. McKusick, “Gprof: A call graphexecution profiler,” inProceedings of the SIGPLAN’82 Symposium onCompiler Construction, SIGPLAN Notices, vol. 17, no. 6, June 1982,pp. 120–126.

[6] T. Ball and J. R. Larus, “Efficient path profiling,” inProc. of the29th annual ACM/IEEE International Symposium on Microarchitecture(MICRO, Paris, France). IEEE Computer Society, 1996, pp. 46–57.

[7] L. A. DeRose and F. Wolf, “CATCH – A call-graph based automatictool for capture of hardware performance metrics for MPI andOpenMPapplications,” in Proc. of the 8th Euro-Par Conference (Euro-Par,Paderborn, Germany), ser. Lecture Notes in Computer Science, vol.2400. Springer, August 2002, pp. 167–176.

[8] A. D. Malony, S. S. Shende, and A. Morris, “Phase-based parallel per-formance profiling,” inProc. of the Conference on Parallel Computing(ParCo, Malaga, Spain), September 2005.

[9] A. R. Bernat and B. P. Miller, “Incremental call-path profiling,” Con-currency and Computation: Practice and Experience, vol. 19, no. 11,pp. 1533–1547, 2007.

[10] K. Furlinger, M. Gerndt, and J. Dongarra, “On using incrementalprofiling for the performance analysis of shared memory parallel appli-cations,” inProc. of the 13th Euro-Par Conference (Euro-Par, Rennes,France), ser. Lecture Notes in Computer Science, no. 4641. Springer,August 2007, pp. 62–71.

[11] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automaticallycharacterizing large scale program behavior,” inProc. of the 10thInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (San Jose, CA, USA), 2002, pp. 45–57.

[12] J. S. Vetter and D. Reed, “Managing performance analysis with dy-namical statistical projection pursuit,” inProc. of the SupercomputingConference (SC1999, Portland, OR, USA). IEEE Computer Society,November 1999.

[13] O. Y. Nickolayev, P. C. Roth, and D. A. Reed, “Real-time statisticalclustering for event trace reduction,”The International Journal ofSupercomputer Applications and High Performance Computing, vol. 11,no. 2, pp. 144–159, 1997.

[14] D. H. Ahn and J. S. Vetter, “Scalable analysis techniques for micropro-cessor performance counter metrics,” inProc. of the SupercomputingConference (SC2002, Baltimore, MD, USA). IEEE Computer Society,November 2002.

[15] K. A. Huck and A. D. Malony, “PerfExplorer: A performance datamining framework for large-scale parallel computing,” inProc. ofthe Supercomputing Conference (SC2005, Seattle, WA, USA). IEEEComputer Society, November 2005.

[16] A. Knupfer and W. E. Nagel, “Construction and compression of com-plete call graphs for post-mortem program trace analysis,”in Proc. of theInternational Conference on Parallel Processing (ICPP, Oslo, Norway).IEEE Computer Society, June 2005, pp. 165–172.

[17] M. Casas, R. M. Badia, and J. Labarta, “Automatic phase detection ofMPI application,” in Proc. of the Conference on Parallel Computing(ParCo, Aachen/Julich, Germany), September 2007.

[18] T. Gamblin, R. Fowler, and D. A. Reed, “Scalable methodsfor monitor-ing and detecting behavioral equivalence classes in scientific codes,” inProc. of the 22nd Int’l Parallel and Distributed ProcessingSymposium(IPDPS, Miami, FL, USA). IEEE Computer Society, 2008.

[19] C. Lu and D. A. Reed, “Compact application signatures for parallel anddistributed scientific codes,” inProc. of the Supercomputing Conference(SC2002, Baltimore, MD, USA). IEEE Computer Society, November2002.

[20] Scalasca, http://www.scalasca.org/.[21] M. Geimer, B. Kuhlmann, F. Pulatova, F. Wolf, and B. J. N.Wylie, “Scal-

able collation and presentation of call-path profile data with CUBE,” inProc. of the Conference on Parallel Computing (ParCo, Aachen/Julich,Germany), September 2007, pp. 645–652.

[22] J. Han and M. Kamber,Data Mining: Concepts and Techniques, 2nd ed.Morgan Kaufmann, 2006.

[23] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,Introductionto Algorithms, 2nd ed. The MIT Press, 2001.

[24] Standard Performance Evaluation Corporation. (2007)SPEC MPI2007benchmark suite. [Online]. Available: http://www.spec.org/mpi2007/

Page 15: Space-Efficient Time-Series Call-Path Profiling of ... Webseite/pdf... · Space-Efficient Time-Series Call-Path Profiling of ... by a small group of processes with time-dependent
Page 16: Space-Efficient Time-Series Call-Path Profiling of ... Webseite/pdf... · Space-Efficient Time-Series Call-Path Profiling of ... by a small group of processes with time-dependent