Seminar ”Software Visualization” · PDF file• Programming languages,...

Seminar ”Software Visualization”

Visualizing Software Evolution

Adrian Ulges

October 29, 2005University of Kaiserslautern

Advisor: Dr. Andreas Kerren

Contents

1 Introduction 1

2 Software Metrics 5

3 First Idea: Aggregation - RELVIS 63.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Star Plot-based Visualization in RELVIS . . . . . . . . . . . . . . . . . . . . . . . . . 73.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Second Idea: Data Mining - EPOSEE 104.1 Related Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.2 Evolutionary Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3 Visualizing Evolutionary Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Third idea: Visual Compression - CVSSCAN 155.1 A Snapshot Approach: SEESOFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.2 Evolutionary Visual Compression: CVSSCAN . . . . . . . . . . . . . . . . . . . . . . 175.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6 Conclusions and Discussion 20

Abstract

The maintenance of software systems is a time-consuming and cost-sensitive task, especiallywhen confronted with so-called legacy systems—large, long-term systems characterized by unsta-ble requirements and outdated development methods. For the understanding of such systems, theirrelease history—usually encoded in software repositories—provides interesting information concern-ing regions of stability, hidden module dependencies and programmer competence. Vast amountsof information have to be analyzed such that visualization of the software evolution process playsa key role, supporting the user with searching interesting patterns and forming a mental model ofthe development process.

This seminar paper presents new research approaches in this area of software evolution visual-ization. Methods are divided into three classes according to the strategy of overcoming the size ofthe underlying information space: Aggregation, Data Mining, and Visual Compression.

October 29, 2005 1

1 Introduction

It has been stated extensively that software development is in a ”chronic crisis” [9] . SoftwareSystems—while growing larger and more complex—suffer from bugs, unfulfilled requirements, andlate delivery. One key problem related to this crisis the maintenance of existing systems, wherechanging requirements, the portation to new platforms, or the need to remove bugs usually inducea so-called reverse engineering process [17]: models of the system architecture are constructed withthe purpose to ”increase the overall comprehensibility of the system for both maintenance and newdevelopment” [12]. Consequently, the first step towards the maintenance of a software system is tounderstand it. This is a difficult task due to several reasons:

• Systems may be several Mio. LOC (”lines of code”) large, poorly documented, and have along lifetime in which they undergo lots of maintenance cycles. Such systems are referred toas legacy systems, and their maintenance is a difficult challenge

• Due to employee turnover, the developers of the generic system may not be at hand

• Programming languages, architecture principles, and development methods may have beenoutdated since the the first release of the system.

These facts make software maintenance a challenging task demanding 90 % of worldwide softwaredevelopment cost [18]. Consequently, software’s chronic crisis can in fact be seen a crisis of softwaremaintenance.

A typical maintenance situation might be the following: a developer is told to observe anunknown software system, to add or remove features, to port it to a new operating system, or tofix a bug. It is likely that the developer will soon get lost in the depth of the code due to the vastamount of rather unstructured information and lacking documentation. Furthermore, the objectoriented-paradigm—despite its benefits for re-engineering—does not support a sequential readingof the code, such that a developer not familiar with the code simply does not know where to start.

Code is usually modularized, which means that visibilities and effects of changes are limitedto the interior of a module. However, modularization may be incomplete in practice. This has adisturbing consequence for the maintainer: even if he tries to make a small local change, he cannotpredict its global effects. This is particularly true for poorly modularized code containing manyinter-component dependencies [11].

Consequently, it is not surprising that according to experts 50 % of maintenance time is requiredfor the understanding of a system and the construction of a mental model of its architecture [18].

A well-known way to overcome this problem is to visualize the entities of a software system—ranging from coarse-grained subsystems and modules over classes to fine-grained ones like methods,attributes, or even single code lines. Software metrics are used measure the complexity and influenceof entities. The resulting numerical values are mapped to visual attributes like color and shape, andthe resulting pictures give an overview of the system without digging into the source code. Figure 1illustrates this process in a visualization pipeline: for each software entity, some measurements arederived using software metrics, and each metric is mapped to a visual attribute afterwards (e.g., theheight, width, and color of rectangles in a resulting graph). The benefits of the resulting visualizationlie in the fundamental laws of human perception: while reading source code is a sequential process,image information can be processed at overwhelming rates in parallel [19].

Furthermore, visualizations are good abstraction of the code giving an overviews of the underly-ing system and guiding the human eye to anomalies, clusters, and outliers. They have proven usefulin other areas like Scientific Visualization and Information Visualization. A good illustration of a

2 Adrian Ulges, Matr.Nr. 343 114—”Visualizing Software Evolution”

Figure 1: A processing pipeline of a typical software visualization system—metrics are used to extractnumerical information from the underlying software entities. Afterwards, the results are displayed usingvisual attributes.

software system can be characterized by several criteria like the capacity of information displayed,the possibility of user interaction, or—according to the well-known detail-on-demand principle—thepossibility of having a fast look at the underlying code (which is referred to as code proximity [12]).Several conceptual thoughts have been made concerning the quality of visualizations, leading towidely accepted principles like ”focus & context”.

However, software has special properties—e.g., its structure consisting of a text corpus on theone hand and hierarchical entities on the other—such that elaborate approaches have been designedin this field of software visualization.

Snapshot approaches Lots of former software visualization approaches (e.g., [12, 21]) have ad-dressed snapshots of a system and displayed the status quo of its architecture. Such visualizationscan help to understand the overall system structure, identify important classes, and detect couplingbetween modules. Well-known examples are UML diagrams. Often the shear amount of code forcesdesigners to split the visualization into multiple views for certain aspects. An interesting approachfor this is given by Lanza and Ducasse [12], who stated that several entity granularities, combinedwith lots of metrics and possible visualization layouts lead to multiple views on the system, referredto as Polymetric Views. The authors describe special views visualizing certain system aspects (rang-ing from general views for ”system hotspots” to focused ones for special design patterns like ”storageclass detection”). They also present a detailed methodology for a possible browsing through viewsfrom general aspects to a detailed examination. A similar approach is followed by Wu [21], whosespectrograph views can be tailored using varying color schemes, entities, and time units to emphasizecertain aspects of a software system.

Other researchers have focused on the design of visualization toolkits themselves: since softwaresystems can be viewed from many aspects, the visualization architecture should be adaptive for theuser and allow for the definition of customizable views [16]. Telea et al. [17] suggest an architecturewith an upgradable core of generic components written in C++, surrounded by easily modifiableTCL scripts. This allows for an easy combination of layouts, attributes, and entities.

All this work shows that static visualization of software systems is a difficult task and an activearea of research. The key problem is that the amount of information pushes both the display res-

October 29, 2005 3

olution and the capacity of human interception to its limits. This is from now on referred as thesize problem of the underlying information space.

Exploiting software evolution Unfortunately, the approaches introduced so far neglect valuableinformation about software, more precisely about its evolution process. In most cases, only onerelease (a ”snapshot”) of the system is displayed. On the other hand, information on the systemevolution is usually present in form of repositories like CVS1 such that it can be derived withouthigh extra efforts.

Consequently, the key idea addressed in this paper is to understand the evolution of softwaresystems by exploiting their release histories. This can help answer lots of questions that cannot becovered by a static analysis of the system:

• When and why did a certain module of the system grow?

• What are unstable parts of the system that have undergone lots of changes in the past (thesemight make good refactoring candidates)?

• Which modules are the most critical cost factors (this is where the most manpower should beassigned to)?

• Which parts have been buggy in the past (they need to be checked more intensively, since theyare more error-prone)

• How is the productivity of developers?

• Who is an expert for a certain part of the system (this is probably the developer who wrotethe corresponding code)?

• Why is the structure as it is in the current system (the motivation of former design decisionsmight be recovered)?

• Is the project on schedule? What work has been done in last 2 weeks?

While offering these clear benefits on the one hand, the analysis of software histories suffers evenmore from the size problem than snapshot visualizations on the other hand. For the latter, the sizeof the informations depends on the number of entities representing the system #ENT, e.g., code linesor classes. Furthermore, a number of metrics #MET can be assigned for each entity (see Section 2for more information). This leads to an information space of #ENT · #MET.

When analyzing the evolution of a system, a number of releases #REL has to be analyzed, yieldingamounts of information that is the multiple of the static case: #ENT · #MET · #REL. For typical legacysystems, hundreds of releases may exist, which simply cannot be managed by just inspecting thecode. Rather, elaborate visualizations are demanded.

Extending snapshot approaches One straightforward idea to visualize software evolution isto extend snapshot approaches with animation such that the user can click through the sequenceof release images [6]. An example for this is the dynamic graph drawing approach by Collberg etal. [4]: for each day, a graph is layouted (e.g., a call graph, from which strong couplings betweencomponents of the system can be derived), and the user can navigate forward and backward in timeto follow the development process of the system. The basic problem is the layout of these graphs:

1https://www.cvshome.org/


• on the one hand, the shape of the graph should not change abruptly, and nodes should betraceable for the user. This eases the construction of a mental model.

• on the other hand, node positions should not remain fixed, since this gives a poor layout forsingle graphs where parts of the screen remain unused.

Collberg et al. suggest a correction step called ”hierarchical graph drawing” based on spring-basedlayout, where nodes repel each other such that the layout of a graph changes smoothly. A colormap is used to encode the age of entities and emphasize recent changes: an added component fadesfrom red to pale blue over time. The result can be observed in Figure 2, where two consecutiveversions of an inheritance graph are illustrated. It can be seen that some branches of the graphshave been slightly shifted, and that new components added are highlighted in red.

Figure 2: Two versions of an inheritance graph, associated with consecutive releases. Slight changes in thelayout can be observed, as well as new components highlighted red (picture taken from [4]).

The basic shortcoming of this approach is that the evolution of an entity cannot be viewed ata single glance. In many cases, a time-consuming and exhausting search process is demanded, forexample if the user wants to go back to some release that he has already viewed, or if he wants toknow when a component was deleted.

New approaches for software evolution Due to these shortcomings of animated approaches,more elaborate visualization methods have to be designed particularly to provide insight into theevolution of a software system. This paper provides an overview of new techniques in this field. Thedatabase used in the approaches is in most cases a software repository like CVS, which is parsedto extract information on the entities of the software system and their relationships. The resultingmodel yields the lifetime of all entities over time. Ways are described to feed this information tothe visualization pipeline.

The methods introduced differ in their strategy of overcoming the size problem, such that theycan be subdivided into three different classes corresponding to three different key ideas

1. Aggregation The first approach is to visualize not the code, but aggregate the informationto coarse-grained entities like subsystems, components, modules, files, or classes. Softwaremetrics are used to describe entity properties, according to the software visualization pipeline(see Figure 1). Mostly, graph-based approaches have been tried to visualize the entities andtheir relationships. This topic will be addressed in Section 3.

October 29, 2005 5

entity metrics relationship metricscoarse-grained(system,subsystem,module)

LOC,# files,# classes (size),execution time in tests (influ-ence)

# couplings,# calls in test runs

fine-grained(class,method,attribute)

# LOC (size),# calls from other entities,# calls of other entities,# accesses of attributes,# superclasses/subclasses,# inherited methods (influence)

fan-in, fan-out2,# calls in test runs

Table 1: Some examples for software metrics, structured according to the granularity of the underlyingentities.

2. Data Mining The second idea is to filter the vast amounts of information and present onlyinteresting trends to the user. Methods of Data Mining that are well-known in the databasecommunity are used to derive significant patterns in the software development process in formof rules. Despite the reduction of information to be visualized, the illustration of rules is stilla difficult task, which will be addressed in Section 4.

3. Visual Compression The third approach is to stick to a visualization on code level. The sizeproblem is overcome to some extent by a strong compression of the illustration, yielding a”microfilm view” of the code. This approach will be discussed in Section 5.

The remainder of this paper is organized as follows: in Section 2, a short overview of metricsused in the area of software evolution visualization is given. After this, each of the three approachesis introduced and discussed in an own section. In the last section, a general discussion is given.

2 Software Metrics

It has been stated by DeMarco [5] that ”you cannot control what you cannot measure”. Since oneof the fundamental purposes of software engineering is to gain control over software development,many attempts have been made to ”measure”the system design process and its products, to estimaterisks, and to learn clearer strategies for design decisions from former projects.

Software metrics are a fundamental concept in this area measuring the underlying textual soft-ware system and describing it using numerical attributes. The most famous example for this is theLOC (”lines of code”) metric, by which the complexity of a software system is measured based onthe number of its code lines. A selection of typical software metrics is given in Table 1. This paperfocuses on design metrics, which are derived from an analysis of the code. Metrics have been dividedinto two groups: the first describing the complexity and influence of entities in the software system(like classes, modules, or routines), the second measuring the relations between them, like couplingbetween modules, or the calls between classes (note that is was focused on binary relationships).For a more extensive overview of software metrics, see other listings in [7, 12, 15].

It should be kept in mind that software metrics aggregate knowledge such that information lossoccurs, and that metrics can even lead wrong: for example, the LOC metric does not tell anything

2no. of external entities called, describing the dependency of an entity from external code


about the complexity of the underlying code. Two classes with the same LOC may differ strongly—one might consist mostly of get() and set() methods, while the other performs complicated tasksand therefore does many external calls.

Nevertheless, software metrics are very widespread and mostly supported by the management,since they provide an easy way to check the complexity and structure of a software system. Metricsalso play a key role in software visualization: they map entity properties to numerical values, whichcan then easily be mapped to pictures. When speaking of the visualization of a software system,actually software metric values are displayed in most cases.

3 First Idea: Aggregation - RELVIS

Figure 3: A typical aggregation-basedvisualization. Information in form ofsoftware metric values is associatedwith coarse-grained software entities, inthis case modules. For the visualiza-tion, optical attributes like color andobject shape are used (picture takenfrom [15]).

The first approach to overcome the size problem addressedin this paper has already been depicted in the introduction:information is aggregated to coarse-grained entities of thesoftware system, whose size, influence, and relationships arethen described using software metrics. Graph-based visual-ization techniques are used to visualize the entities and therelationships between them (well-known examples for this areUML and ERM diagrams). Metric values are encoded as vi-sual attributes. A typical example for this can be viewed infigure 3: seven rectangles correspond to seven modules, andmetrics are encoded using the shape and the color of rectan-gles.

It is obvious that the resulting illustrations are often verysimple due to the strong reduction of information. However,expanding aggregation approaches to visualize the evolutionof the software system is not trivial, since the amounts ofinformation are significantly higher than for the static case.A promising approach for this has been introduced with theRelVis system by Pinzger et al.[15]. The key idea of theapproach is not to animate static approaches (as has beendepicted in Section 1), but compress the evolution of entities in so-called Kiviat diagrams (aka StarPlots)

Figure 4: An evolution matrix for an entity in a software system. There are n releases and k metrics (likeLOC or fan-out) used, yielding kn scalar values describing the evolution of the entity.

October 29, 2005 7

3.1 Data Model

The underlying data model for RelVis is the so-called evolution matrix, which is obtained byparsing the release repository of a system, identifying entities, and measuring entities and theirrelationships by static code analysis. For each entity and each relationships between two entities,an evolution matrix is obtained providing all metric information that is to be visualized.

See Figure 4 for a typical evolution matrix of an software entity. Note that each row in the matriccorresponds to the history of a metric, while each column represents one release as a snapshot ofthe system during its development.

Since each matrix contains #REL · #MET scalar values, the overall amount of information tobe visualized is #ENT · #REL · #MET for the entities and #ENT2· #REL · #MET for relationships. Ifcoarse-grained entities are used, this amount of data is moderate.

3.2 Star Plot-based Visualization in RELVIS

RelVis bases the visualization of software entities on Kiviat diagrams instead of rectangle-basedshapes like in Figure 3. The basic principle is displayed in Figure 5: axes associated with metricsare arranged around a center, and each attribute of a data record is encoded by an axis intersection.The data record is then visualized by connecting adjacent intersections.

Figure 5: A Kiviat diagram for a sin-gle data record. An axis is associatedwith each metric. Intersections withthese axis yield a star-like shape (pic-ture taken from [15]).

Since the resulting shape is similar to a star, Kiviat dia-grams are often also referred to as Star Plots. Note that twocircles restrict the distance of the intersections from the cen-ter of a diagram to a maximum and minimum value, whichprevents degraded diagrams with extreme peeks or valuescollapsed into the center.

Visualizing entities Kiviat Diagrams have already beenused for snapshot visualizations in the static case [14], butthey are even more suitable for visualizing a developmentprocess. Therefore, all stars associated with releases are in-serted into the same diagram. See Figure 6 for an illustrationof the concept: there are three stars associated with three re-leases. Note that the order of the releases is not clear a priori:it is encoded separately using color (an alternative would beto introduce a separate axis encoding the release number).

The resulting impression is similar to the annual rings ofa tree: in phases of strong growth of the star, the visual-ized entity has expanded. This visualization provides a fast,convenient overview of the development of a single entity.

Visualizing relationships Not only metrics associated with entities have to be encoded, but alsothe binary relationships between entities. This is usually done using edges connecting the entitiesin a graph. One scalar value can be encoded by the width of this edge. If there are further entitiesto be displayed, an additional Kiviat diagram is used.

This principle is illustrated in Figure 7: between the two nodes, a pink edge of a certain widthspans. Four metrics are encoded by connecting an additional Kiviat diagram with the edge. Notethat this Kiviat diagram is divided into two halfs—this is because relationships can be asymmetric(for example, module A may depend strongly depend on module B, but B not on A).


Figure 6: Using Kiviat diagrams to visualize the evolution of an entity. One star is drawn for each release,and the areas between adjacent stars are filled with color (picture taken from [15]).

Figure 7: A Kiviat diagram illustrating a relationship between the two modules A and B. The diagram islinked to the edge between A and B using an arrow (picture taken from [15]).

October 29, 2005 9

(a) The first part of a RelVis visualization,presenting the entity data...

(b) ...and the second one for the dependen-cies between entities.

Figure 8: A two-step RelVis visualization: entities and relations are visualized in two separate, butsimilarly layouted graphs (picture taken from [15]).

3.3 Results

The authors tested the approach in an informal case study: as a test system, the Mozilla Project3

was chosen. Several key milestones were selected as releases to be visualized. RelVis presents theresulting information in two steps: two separate graphs of the same structure, but with differentKiviat diagrams are displayed as in Figure 8: while the first graph shows the entities, the secondone illustrates the associated relations.

The resulting visualization provides many insights at one glance that could not be obtained bythe animation of snapshot approaches:

• Phases of strong growth can be identified for both entities and dependencies between them.

• God classes that suffer from extreme growth of responsibilities can be identified as goodrefactoring candidates.

• Important parts of the system as the most critical cost factors are made obvious.

• Discontinuities in the development process are visible.

• Strong coupling between modules can be detected.

3.4 Discussion

With the RelVis system, Pinzger et al. have developed an approach that clearly outperforms theanimation of static approaches when it comes to the visualization of software evolution. Doubtlessly,many insights into the basic system structure can be achieved by a single glance at the resultingpictures. Furthermore, visualizations are made more concise by the separation into two graphs and

3http://www.mozilla.org


by the trick of putting related metrics next to each other such that the annual rings obtain a smoothshape.

Despite its benefits, the limitations of the approach should be stated clearly. One problem isthat occlusions may occur if an entity does not develop monotonously (this can be observed inFigure 8). A more general problem is that the approach seems to scale badly. Even for only sevenmodules as displayed in Figure 8, the resulting images become disturbing, especially in case of therelations display. The graph-based visualization is not very compact. Wide areas are simply leftwhite.

These properties make RelVis a good approach for a very first, high-level view of an unknownsystem. To really understand a software system, the maintainer needs to go from this overview toa more detailed level. Note also that information may get lost due to aggregation: for example, anillustration on module level does not disclose awkward dependencies between submodules withinone of the illustrated modules.

4 Second Idea: Data Mining - EPOSEE

One phenomenon crucially influencing the maintainability of software systems is inter-modular cou-pling. This term refers to dependencies between components of a software system—when couplingis high between two such components, it is a sign of poor modularization of the code. Changesmade can have a global effect on other components, which makes maintenance a painful task. Con-sequently, an often referred purpose of software design is low coupling such that components arekept as independent from each other as possible.

Conventional approaches to detect coupling measure it a static program analysis of the sourcecode [22] (e.g., by analyzing caller-callee-relationships). Some derived metrics can be found inTable 1, for example the fan-in and fan-out. The problem with this approach is that it cannotrecover all dependencies between components in general. Influences may be transitive and only berecoverable in dynamic test cases, which can regularly not be performed exhaustively.

In this section, an alternative way is depicted, namely to derive inter-module dependencies fromthe evolution of the underlying software system. Therefore, methods of data mining are used. Notethat like in the last section, these approaches are based on a reduction of information, but—incontrast to aggregation—intelligent algorithms are used to detect interesting information.

This section is organized as follows: Some related approaches are depicted [8, 10] before thefundamental concept of evolutionary coupling [3] is introduced. The underlying theory derived fromdata mining is outlined, based on the concept of rules. The last part of this subsection describesapproaches to visualize detected dependencies, which have been subsumed by Burch et al. in theirEpoSee system [3].

4.1 Related Approaches

In their work, Girba et al. [10] address the hypothesis that the parts of the system that have beenchanged very often recently usually make good refactoring candidates. However, this does not holdin general, but depends on the development process: for some systems, changes may focus on certainareas of the system for a longer time, while for others programmers may work on many componentsin parallel. The authors present an interesting approach to examine the associated ”change climate”of a software system. They describe the predictability of future changes by deriving a value called”Yesterday’s Weather” (for a low value, changes are very discontinuous, while more predictable fora higher one).

Unfortunately, only a poor, informal evaluation is presented and concrete applications for theapproach are hard to find.

October 29, 2005 11

Another approach by Gall et al.[8] called Caesar promises more relevance for practical appli-cations, since it detects coupling between system components exploiting software evolution. Thebasic idea is to detect dependencies based on the fact that components are changed simultaneouslyduring the development process. The authors examined a typical legacy system using the follow-ing formal description: the approach associates every system component ci with the sequence ofreleases si =< ri

1, ri2, .., r

ini

> in which the ci has been changed. Furthermore, a component is saidto support a sequence s of releases if s is a subsequence of si (s � si).

The definition of coupling between two components is now based on sequences of changes: twocomponents ci, cj are said to be connected via a sequence s if s � si ∧ s � sj

To detect hidden dependencies in the examined legacy system, the authors used a two-stepprocedure:

1. Change Sequence Analysis: Mine the release database for sequences that connect pairs ofcomponents. Particularly, such sequences should be as long as possible and strongly supportedby many components.

2. Change Report Analysis: Exploit meta information to verify that the dependencies resultingfrom Step 1 are valid and have not been derived accidently. Therefore, meta informationis exploited in form of change reports that give the specific reason for a code change (like”removing bug no. BR456”). If two components have not only been changed at the sametime, but also due to the same reason, a dependency is detected.

Although interesting insights were obtained in the case study [8], some shortcomings make theapproach unsuitable for practical applications: no visualization concept is provided, and dependencyis only defined via sequences of releases. A more general concept for such dependencies is providedby evolutionary coupling addressed in the next subsection.

4.2 Evolutionary Coupling

The most famous concept for the detection of hidden dependencies in software histories is evolu-tionary coupling [3] (also referred to as logical coupling [8]). Its key idea is to detect software entitiesthat have been changed in the same releases very often during the evolution of a system. Rulesexpress this fact, and methods of data mining are used to detect hidden dependencies in form ofsuch rules from a release database.

Formal Description Evolutionary coupling is a widespread concept successfully used in severalareas, e.g., economics4, DNA analysis [13], and text indexing [20]. Consequently, a formal descrip-tion can be found in many papers (e.g., see [3, 20, 22]): one release r of an underlying softwaresystem is modeled as the set of all entities changed (usually, rather fine-grained entities like classes,methods, or even code lines are chosen). The system history H is viewed as a sequence of releases< r1, .., rn >. Given a set of entities R, its frequency is defined as the number of releases in Hcontaining R:

freq(R) = |{ri ∈ H|R ⊆ ri}|

To detect evolutionary coupling between entities, H is mined for two kinds of rules:

1. Association rules like A → B: ”if A is changed, B is usually changed in the same release”(A,B are sets of entities). There are two fundamental measures describing the relevance ofsuch an association rule:

4http://www.iiit.net/∼vikram/mining.html


• supp(A → B) = 1n freq(A ∪ B): the support describes how often A and B have been

changed together and thus indicates the statistical significance of a rule.

• conf (A → B) = freq(A∪B)freq(A) : the confidence is an equivalent to a conditional probability:

it indicates the probability that B is changed if A is changed.

2. Sequence rules like < AB >→< ABC >: in this case, both sides are sequences such thatthe order of changes is also included. The example can be read as ”if B has been changedafter A has been changed, then C will be changed afterwards”. The definition of support andconfidence is straightforward for these rules.

Figure 9: A release history showing an interesting trend: modules A and B are mostly changed at the sametime, such that the rule B → A holds with support 3

6and confidence 3

4.

Mining Evolutionary Coupling ”Mining” is equivalent to searching the database for rules withhigh confidence and high support. An example is illustrated in Figure 9: while C is changedcompletely independent from A and B (e.g., (C → B) has supp = 0, conf = 0), the rules A →B(supp = 3

6 , conf = 1) and B → A(supp = 36 , conf = 3

4 ) express a strong dependency betweenA and B. Note also that rules are not symmetric in general, as can be seen due to the differentconfidence of B → A and A → B.

Algorithms for the efficient detection of associative and sequential patterns in large databasesare well-known. One popular example is the Apriori algorithm introduced by Agrawal and Srikant.The authors provide a detailed description of the algorithm, a proof of correctness, and performanceevaluations. The interested reader is referred to the publications [1, 2].

Using Evolutionary Coupling For evolutionary coupling—in contrast to static program analysis—not the code itself is used, but change logs from the system history. Note that this approach alsocovers metadata such that e.g., dependencies of the documentation from the code may be detected.Despite the differences, both evolutionary coupling and static code analysis basically describe de-pendencies between entities in a software system. Consequently, applications are similar:

• design weaknesses like redundant code or strong coupling are indicated.

• designers may use the knowledge to control if the system corresponds to their modularization,or if hidden dependencies exist [3]. Therefore, a concise visualization of rules is needed (seeSection 4.3).

• a maintainer may be guided with changing software. This idea has been followed by Zim-mermann et al. [22] with their Rose system: if a developer does a change at a certain entity

October 29, 2005 13

A, the system—implemented as an Eclipse5 plugin—suggests according to the rule A → Ba list of locations B where related changes might be necessary. This makes more efficientsoftware development possible and prevents bugs due to incomplete or inconsistent changes.Unfortunately, practical evaluations disclosed that the recall and precision of suggestions wererather low (for stable systems, 28% / 40 % were reached)—suggestions by the Rose systemdelivered many false alarms, and of course only locations could be predicted that had alreadybeen visited in the past. Nevertheless, the quality of Rose suggestions increases with the his-tory, such that the approach may especially be interesting for the maintenance of problematiclegacy systems.

4.3 Visualizing Evolutionary Coupling

Data mining is usually followed by a second visual data mining: the set of all evolutionary cou-plings derived is so large that it needs to be inspected by a human viewer to discover the mostimportant trends (e.g., inter-module dependencies). Rules are usually numerous, such that elab-orate visualization methods are demanded that illustrate trends as visual patterns (as outliers ingraphs, for example). Different visualization concepts have been developed depending on the typeof the underlying rules, and Burch et al. provide several rule views in their data mining systemEpoSee [3]:

Unary association rules The simplest class of rules are unary association rules Ai → Aj , whereboth Ai and Aj are singletons. In this case, rules can be visualized by a matrix with the entitiesas columns as rows. In the matrix cell cij , the confidence of the associated rule Ai → Aj canbe encoded using color maps, as is illustrated in Figure 10 (3D bar charts are an alternative, butunfortunately they suffer from occlusion problems [20]).

Note also that in case of Figure 10 rule entities are hierarchical: the term ”a/a/c” correspondsto ”method c in class a in module a”. Burch et al. adapt the visualization to this case by sortingthe columns and rows of a matrix hierarchically. Due to this, quadratic, diagonal-centered areasindicate dependencies between components within the same module—such an intra-module area ishighlighted with a green box in Figure 10. In contrast, rules with high confidence that lie outsidethese intra-module areas indicate disadvantageous inter-module dependencies. Such outliers in thematrix can be detected easily, as also illustrated in Figure 10. The authors successfully appliedtheir method in a case study, where dependencies between modules could be discovered using thetechnique of data mining followed by visual mining, and the resulting observations were related tothe system design. As a test object, the CVS repository of the Mozilla Project was chosen.

n-ary association rules A more general rule type are so-called n-ary association rules with setsof entities on the left. In this case, one straightforward idea would be to choose the rows andcolumns of the matrix as sets of entities. The problem with this approach is that the resultingnumber of sets is exponential such that this visualization tends to scale poorly. Furthermore, entityidentity is lost: it is very difficult to find all rules in which a certain entity occurs.

An elegant way out is provided by switching from an entity-to-entity to a rule-to-entity designof the matrix, obtaining so-called association rule matrices [20]. The rows of the matrix correspondto entities, and each column is used to encode a rule A1, .., An → B. In this column, the entitiesAi are highlighted blue, and B is highlighted red. An example is illustrated in Figure 11: here, a3D design is used to further indicate the confidence and support of rules. Rules can also be sorted,and entity identity is preserved well.

5http://www.eclipse.org


Figure 10: A matrix-based visualization of unary association rules like a/a/c → a/a/a. The green boxindicates an intra-module area—high-confidence rules outside this area (like b/a/a → a/b/a) indicate inter-module dependencies.

Figure 11: An illustration of an association rule matrix due to Wong [20]. Advantages: no occlusion,compact (picture taken from [20]).

October 29, 2005 15

Alternative visualizations are graph representations: entities are mapped to nodes of a graphwith directed edges between them in case of a evolutionary coupling. These displays are well suitableto detect outliers and clusters, but they also tend to scale poorly for many nodes.

sequence rules Another class of rules already introduced in Section 2 are sequence rules, wheresequences of entities stand on both rule sides. For an overview of sequence rules, Burch et al. [3] useparallel coordinates as illustrated in Figure 12. Entities are arranged in columns, and a rule is visu-alized by linking entities in different columns. For example, the rule < a/a/a, a/a/b >→< a/a/c >in the illustration is highlighted. The red connections between the three entities represent the rule.

The limitations of the approach are obvious. First of all, the maximum number of entities perrule is limited by the number of columns. Another problem is that the approach scales poorly.Even in case of very small illustrations, it is difficult to track the single rules. On the other hand,the visualization allows to detect inter-module dependencies, which are indicated as outliers withvery steep connections.

Figure 12: Parallel coordinate views are used to visualize sequence rules. The rule< a/a/a, a/a/b >→< a/a/c > is highlighted (picture taken from [3]).

4.4 Discussion

In their publication [3], Burch et al. address the visualization of rules mined from software repos-itories. No new visualization techniques are presented, but only well-known methods are slightlymodified to suit hierarchical entities.

However, the presented techniques may be useful in practice to support ”visual mining”, thedetection of visual patterns like clusters and outliers indicating shortcomings of the system design(e.g., outliers in matrices and graph visualizations may indicate inter-module dependencies).

Furthermore, the implementation of the approach provides some nice features: according to thebasic principles of visualization, the user may go from an overview into detail by filtering rules andzooming to more detailed illustrations. Furthermore, subwindows are connected such that rulesselected by brushing are highlighted in all views simultaneously.

5 Third idea: Visual Compression - CVSSCAN

Note that both basic ideas introduced before—aggregation as well as data mining—were based ona reduction of the displayed information causing a potential information loss. Furthermore, thedisplayed structures (mostly graphs and matrices)—although derived from a textual basis —giveno insight into the underlying code. It has been stated clearly by several authors [3, 12, 21] thatvisualization gives an overview and points out interesting trends, but cannot replace the study of


the code completely. Consequently, a good visualization should provide a way to easily view thecode related to the displayed graphical objects. This is referred to as code proximity. A good codeproximity is hard to achieve, especially for high-level visualizations with coarse-grain entities likeRelVis (Section 3).

Approaches described in this section visualize the most fine-grained entities at the other end ofthe granularity spectrum [15], namely the code-lines themselves. Of course, a high visual compressionis necessary for this.

5.1 A Snapshot Approach: SEESOFT

One popular visual compression approach named SeeSoft has been developed by Eick et al. [6]:the displayed entities are the miniaturized code-lines shrunk to small strokes of pixels. The resultingpictures give a ”microfilm view” of the code. A typical sample is illustrated in Figure 13. A filecorresponds to a rectangle including a sequence of small code line strokes. Color is used to encodeadditional information (in this case, the author who wrote the code). For code proximity issues, acode reading window is integrated to read the lines the cursor is focused on.

The basic benefit of such visualizations is that they—though giving a better overview thanwatching the code itself—allow programmers to ”use the same spatial context as in which theyconstruct the code” [18]: the illustration provides insight into the files used, their structure, andpossibly additional information like the stability of code. Visual patterns like copied code, fieldsof get() and set() methods and comment areas may be discovered easily. The authors report anenthusiastic feedback from informal practical tests.

Nevertheless, the limitations of the approach should be kept in mind. Due to the authors, up to50.000 lines of code can be displayed simultaneously. A more general limitation is the resolution ofmonitors (usually, displays do not provide more than 1 Mio. pixels—in contrast, software systemsmay consist of several Mio. LOC).

Figure 13: A typical SeeSoft visualization of several files. The impression is the one of a miniaturizationof the code. Color coding provides insight into additional statistical measurements.

Griswold et al. [11] examined the benefits of the SeeSoft approach in another practical casestudy. A test subject was given the task to remove a feature from a concrete software system.Enhanced search facilities for regular expressions (for example, to find functions with certain pa-rameters) and user-defined highlighted aspects were used to guide the developer through this main-tenance task.

It was discovered that the SeeSoft view was experienced as a spatial map that the developernavigated through, using typical map facilities like zooming, scrolling, and folding. For the proband,working with the system felt like ”walking through the code”, using highlighted parts as landmarks

October 29, 2005 17

where changes were left to be made. The SeeSoft view (see Figure 13) provided an excellentgranularity level for this work.

5.2 Evolutionary Visual Compression: CVSSCAN

The SeeSoft approach described so far is limited to the visualization of one system snapshot only.In the following, CvsScan will be introduced as a method to display the evolution of single codelines over time [18].

Figure 14: The general idea of CvsScan: the SeeSoft view (left) is modified such that the length of linesis traded off for the time dimension (picture taken from [18]).

CVSSCAN: General idea The general idea of CvsScan is to extend SeeSoft with time asa new dimension. This is illustrated in Figure 14: while in SeeSoft the x-dimension is used toencode the length of text lines, this is traded off for the capability of displaying the evolution of eachcode line, e.g., when it has been written, when it was deleted, or when it was modified. Each columnin the resulting display matrix represents one release of the visualized file, and rows correspond tocode lines.

This has two fundamental consequences: the shape of the code is not displayed any more, andthe visualization usually fills the whole screen such that only one file can be displayed.

CVSSCAN: Layout There are two layout schemes in CvsScan illustrated in Figure 15: file-based layout handles the releases independent from each other and displays a line at its local positionwithin a file. Consequently, different columns of the matrix have a different height in general,corresponding to the varying size of the underlying source file. Periods of strong file growth maybe tracked well using file-based layout.

In contrast to this, line-based layout maps each global line of code to a row in the displayedmatrix. Areas where code has been deleted remain white. This allows a user to track the evolutionof a code line.

CVSSCAN: Acquisition of code line information Note that for the presentation of a line-based CvsScan layout, local code lines have to be tracked to global ones. This is a non-trivialproblem for which the UNIX diff tool is used. Given a sequence of versions v1, .., vn of the samefile associated with system releases, diff is used to compare subsequent versions vi ↔ vi+1. Thetool detects inserted and deleted lines and thus makes it possible to identify local text lines asoccurrences of global ones. This is done via a function L : {1, .., n}× {1, .., ni} that maps a versionnumber and a line number within that file version to a global line label.


(a) file-based... (b) ... vs. line-based layout

Figure 15: Two layout schemes in CvsScan: while file-based layout represents the look of a single codefile, line-based layout is used to track the evolution of code lines (picture taken from [18]).

Unfortunately, the diff tool suffers from strong limitations: for example, it cannot detect if twolines are swapped. Furthermore, no similarity of lines is checked. For example, the fact that onlythe type of a variable is changed is neglected by diff.

Given the global line function L, a graph G can be derived from the history v1, .., vn as

G := (L({1, .., n} × {1, .., ni}), {(L(lik), L(lik+1))}) (1)

Figure 16: From the order of local textlines and the information obtained fromtracking, a graph G (see Equation (1)) isderived. The global order of code linescan be obtained by a topological sort of thegraph (picture taken from [18]).

The global lines are the nodes of this graph. A directededge exists between two global lines if there is a versionwith two corresponding local lines that follow each otherdirectly. An example for two versions of a small piece ofcode is given in Figure 16, including the resulting graphG. From a topological sort of G, a global order of codelines can be achieved. This order is used to sort the rowsof the display matrix in the line-layout based visualization(see Figure 15).

Furthermore, information on the status of code linesis derived: a line can be constant, inserted, deleted, ormodified (which is true if a line has been deleted and an-other one has been inserted at the same position). Thisinformation is also derived using the diff tool.

Unfortunately, the corresponding publication [18]leaves some open questions concerning the derivation ofthe function L. For example, it is not made clear why thegraph G induces a linear order and not a partial one (infact, counter-examples can be derived easily in which nolinear order is given).

CVSSCAN: Additional Features Beside its detailed view main window (a typical sampleis illustrated in Figure 17), CvsScan provides a number of additional features to display extrainformation and possibilities of user interaction to enhance the handling of the tool. This wasvalidated in an informal case study.

October 29, 2005 19

Figure 17: A typical main window in CvsScan displaying the development of a file on one screen. Line-based layout is used, as well as a certain color scheme to encode the status of a line (picture taken from [18]).

• Colored bars as illustrated at the left and on the bottom of Figure 17 can be used to displayadditional metric information associated with code lines (vertical bar—e.g., the lifetime orthe number of modifications) or releases (horizontal bar, e.g., LOC)

• Customizable color schemes are provided to display properties of a code line. For example, inFigure 17 blue indicates a line that has not been inserted yet, green an existing line, and pinka deleted one.

• If the cursor is moved across the cells of the main window, an additional window A is used todisplay the underlying code, such that an optimal code proximity is guaranteed. A problemis the browsing of ”empty” areas with code that does not exist in the current release. Inthese cases, an additional window pops up: while the code in window A freezes, window Bscrolls over the virtual code. Grayscale coding of the window background indicates when thecorresponding code line exists.

• The whole file information can be scaled to screen size by zooming. Since resolution maynot be sufficient in case of very large files, downsampling is necessary such that several lineshave to be combined to one pixel. Therefore, a simple anti-aliasing as well-known in imageprocessing is applied.

• Other features allow for further user interaction: lines can be filtered and time intervals canbe selected. Skipped lines are deleted from the presentation, which yields pictures of a higherresolution.

5.3 Discussion

Obviously, CvsScan provides a very detailed view near to the code itself. It demands no reductionof information, and the visualization takes place directly in the textual surrounding. On the otherhand, the approach is strongly limited by the size of the display used. This is why the authorssuggest that the CvsScan view is to be integrated in a context of higher-level visualizations.Another weakness is the diff tool, which performs rather poorly when tracking local code lines toglobal ones.

Nevertheless, CvsScan offers a high usability due to multiple user interaction facilities. Theresulting visualizations are very near to the code, but provide useful extra insights that cannot be


derived from the source code only: unstable parts may be identified, the authors of the code maybe displayed, or correlated changes made at about the same time can be tracked.

6 Conclusions and Discussion

In this seminar paper, it was first motivated that software maintenance can benefit strongly frominformation on the evolution of the system to be maintained, which is usually present in softwarerepositories like CVS or Subversion. The basic challenge when exploiting this knowledge is the sizeof the underlying information space, which causes extra problems due to the additional dimensionof time: for long-term legacy systems, more than hundred releases may exist.

Three basic ideas have been introduced to overcome this size problem on different levels: thefirst one (RelVis [15]) is to aggregate information in coarse-grained entities and display metricalattributes in a graph. Such approaches give a good overview of an observed system. To visualizethe evolution of entities, Kiviat diagrams have emerged as a suitable concept.

The second idea in this context was to use algorithms of data mining to extract hidden de-pendencies between software entities. Burch et al. [3] present some visualizations for the resultingrules implemented their EpoSee system. A promising future direction of this approach might be tocombine dependencies derived from the system history with the results of static program analyses.

Last, visual compression as an idea to preserve all information and design compact views oncode level was introduced (CvsScan [18]). The development of a code line is displayed as a row ina compact matrix. The approach offers manifold user interaction facilities, as well as multiple colorschemes to display several aspects of information.

As the reader has seen from the discussion sections for each approach, none of the methodsis a ”silver bullet”. Rather, a good visualization is characterized by its adaptivity and flexibility,displaying various aspects of a software system and ranging from a first overview to more detail.

A negative aspect concerning most papers is the poor evaluation of methods. In most cases,only informal case studies with a single test system were presented. A detailed discussion of the ap-proach including possible shortcomings was also hardly provided, nor was a competitive evaluationbetween methods done. The author is aware that such case studies are—particularly in the field ofvisualization, where the quality of an approach is significantly determined by the like or dislike ofthe user—a painful, time-consuming task. However, more detailed evaluations should be providedto improve the acceptance of promising visualization systems for practical applications.

References

[1] R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. In Proc. 20thInternational Conference on Very Large Data Bases, pages 487–499. Morgan Kaufmann, 1994.

[2] R. Agrawal and R. Srikant. Mining Sequential Patterns. In 11th International Conference onData Engineering, pages 3–14, Taipei, Taiwan, 1995. IEEE Computer Society Press.

[3] M. Burch, S. Diehl, and P. Wei”sgerber. Visual Data Mining in Software Archives. In SoftVis’05: Proceedings of the 2005 ACM Symposium on Software Visualization, pages 37–46, NewYork, NY, USA, 2005. ACM Press.

[4] C. Collberg, S. Kobourov, J. Nagra, J. Pitts, and K. Wampler. A System for Graph-basedVisualization of the Evolution of Software. In SoftVis ’03: Proceedings of the 2003 ACMSymposium on Software Visualization, pages 77–ff, New York, NY, USA, 2003. ACM Press.

[5] T. DeMarco. Controlling Software Projects. Yourden Press, New York, NY, 1982.

[6] S. G. Eick, J. L. Steffen, and J. Eric E. Sumner. Seesoft-A Tool for Visualizing Line OrientedSoftware Statistics. IEEE Transactions on Software Engineering, 18(11):957–968, 1992.

[7] N. Fenton and S. L. Pfleeger. Software Metrics (2nd ed.): a Rigorous and Practical Approach.PWS Publishing Co., Boston, MA, USA, 1997.

[8] H. Gall, K. Hajek, and M. Jazayeri. Detection of Logical Coupling Based on Product ReleaseHistory. In Proceedings of the International Conference on Software Maintenance, Bethesda,Washington D.C., 1998.

[9] W. W. Gibbs. Software’s Chronic Crisis. Scientific American, 271(3):86–95, 1994.

[10] T. Girba, S. Ducasse, and M. Lanza. Yesterday’s Weather: Guiding early Reverse Engineer-ing Efforts by Summarizing the Evolution of Changes. In Proceedings of the InternationalConference on Software Maintenance, 2004.

[11] W. G. Griswold, J. J. Yuan, and Y. Kato. Exploiting the Map Metaphor in a Tool for SoftwareEvolution. In Proceedings of the 23rd International Conference on Software Engineering, pages265–274, Washington, DC, USA, 2001. IEEE Computer Society.

[12] M. Lanza and S. Ducasse. Polymetric views — a Lightweight Visual Approach to ReverseEngineering. IEEE Transactions on Software Engineering, 29(9):782–795, 2003.

[13] N. Lesh, M. J. Zaki, and M. Ogihara. Scalable Feature Mining for Sequential Data. IEEEIntelligent Systems, 15(2):48–56, 2000.

[14] C. Lewerentz and F. Simon. A Product Metrics Tool Integrated into a Software DevelopmentEnvironment. In Proceedings of the Object-Oriented Technology Ecoop’98 Workshop Rader,1998.

[15] M. Pinzger, H. Gall, M. Fischer, and M. Lanza. Visualizing Multiple Evolution Metrics. InSoftVis ’05: Proceedings of the 2005 ACM Symposium on Software Visualization, pages 67–75,New York, NY, USA, 2005. ACM Press.

[16] S. P. Reiss. Bee/Hive: A Software Visualization Backend. In IEEE Workshop on SoftwareVisualization, 2001.

[17] A. Telea, A. Maccari, and C. Riva. An Open Toolkit for Prototyping Reverse EngineeringVisualizations. In Proceedings of the Symposium on Data Visualisation 2002, pages 241–ff,Aire-la-Ville, Switzerland, 2002. Eurographics Association.

[18] L. Voinea, A. Telea, and J. J. van Wijk. CVSScan: Visualization of Code Evolution. In SoftVis’05: Proceedings of the 2005 ACM Symposium on Software Visualization, pages 47–56, NewYork, NY, USA, 2005. ACM Press.

[19] J. K. Walter Bauegg-Wabnegg. Skript ”Visualisierung und Design – Grundlagen von Softwa-reergonomie und Mediendesign”, 2003.

[20] P. C. Wong, P. Whitney, and J. Thomas. Visualizing Association Rules for Text Mining. InProceedings of the 1999 IEEE Symposium on Information Visualization, page 120, Washington,DC, USA, 1999. IEEE Computer Society.

[21] J. Wu, R. C. Holt, and A. E. Hassan. Exploring Software Evolution Using Spectrographs.In Proceedings of the 11th IEEE Working Conference on Reverse Engineering (WCRE 2004),pages 80–89, Delft, Netherlands, 2004.

[22] T. Zimmermann, P. WeiSSgerber, S. Diehl, and A. Zeller. Mining Version Histories to guideSoftware Changes. In Proceedings of the International Conference on Software Engineering,Edinburgh, UK, 2004.

Seminar ”Software Visualization” · PDF file• Programming languages,...

Documents

Transcript of Seminar ”Software Visualization” · PDF file• Programming languages,...