Factor analysis of source code metrics

J. SYSTEMS SOFTWARE 263 1990; 12: 263-269

Factor Analysis of Source Code Metrics

Daniel Coupal and Pierre N. Robillard Lkole Polytechnique de Mont&al, Mont&al, Qubbec, Canada

Conventional metrics based on Halstead’s Software Sci- ence and McCabe’s Cyclomatic Complexity have been studied extensively. A statistical procedure for validating these metrics is presented. This procedure is based on the factor analysis approach. This model also proposes a methodology to analyze large software projects with a sin- gle set of metrics. Results are based on 19 projects con- sisting of 14,348 routines. Data are taken from commercial software and previous articles. Projects show that few dimensions of complexity are measured by actual static metrics.

1. INTRODUCTION

Research in software engineering in last decade has shown the importance of software metrics and the value of the quality assurance policy for projects [ 11. These policies are based on the measurement of some aspects of software projects, some of which might be program volume, function enhancement effort, development time, maintenance time, test effort, number of tests to be carried out, and mean time between failures. DeRose and Nyman [2] have said that 60%-70% of software costs are attributable to maintenance, while Cashman and Holt [3] propose that the maintenance phase consumes up to 80% of total costs. One of the goals might be to identify those routines that will result in high maintenance costs.

Software engineering research has provided a number of techniques for evaluating these various aspects of software quality. Static code metrics is one of the techniques that is, for the most part, derived from Halstead’s software science parameters [4], McCabe’s Cyclomatic Complexity measure [5], and variations of these [6].

Research carried out on classical metrics has gener- ated so many metrics that it is very difficult to choose from among them. Actually there is no consensus on the use of any particular metric.

Address correspondence to Daniel Coupal, Department of Elec- trical Engineering, Computer Section, k&e Polytechnique, Uni- versity of MontrPal, P.O. Box 6079, Station A, Mont&al, QuPbec, H3C 3A7 Canada.

@ Elsevier Science Publishing Co., Inc.

655 Avenue of the Americas, New York, NY 10010

Another fuzzy point concerns the conclusions drawn from project data. Some data sets have led authors to conclude that nothing more than lines of code are measured by these metrics [7], but other data sets have been fitted correctly to models predicted by metrics [8]. Con- clusions seem different depending on the data used.

Many researches have been done on small projects in the past. Through the evolution of tools and methods, we are now able to tackle larger projects. It is also easier to obtain metrics on these projects. For software quality assurance engineers, analysis of these projects is not a trivial matter. In this article, we present a method based on factor analysis theory, a multidimensional statistical procedure. We also demonstrate the limited usefulness of any one set of metrics alone. Factor analysis shows that, in fact, only a few dimensions are measured by conventional metrics’ systems.

2. FACTOR ANALYSIS MODEL

Factor analysis is a field of statistical analysis that deals with multidimensional observations. The development of the principal component model is attributed to Spearman (1904), while generalized factor analysis is credited to Hotelling ( 1933) [9].

The goal of factor analysis is to transpose observations measured in an N-dimensional space into a reduced space in which interpretation may be facilitated. With use of the factor model with extraction by the principal component method, the first axis represents the maximum variance between observations. The second axis is found under the constraint of maximizing the variance left, and so forth for the other axes. We call these axis factors. There is no correlation between factors, which means that each axis has a unique information content and is orthogonal to the other axes. The problem to be solved can be expressed in matrix form by the following equation [lo] :

DATA = PROJECTIONS*PATTERN + ERROR (1)

Where DATA is the observation matrix in which each

0164-1212&O/$3.50

264 J. SYSTEMS SOFTWARE 1990; 12: 263-269

variable has been normalized to a mean equal to 0 and the variance equal to 1 (Eq. [2]), PROJECTIONS are values of the normalized observed data on the factor axes, PATTERN is the regressor matrix from the factor space to the data space , and ERROR is the residual for the transformation of each observation. This error matrix represents parts of the variance of the observations that are not explained by the model.

DATA = (RAW DATA - MEANS)

STANDARD DEVIATION (2)

Because PATTERN is not a square matrix, we cannot refer to PATTERN-’ for the inverse transfo~ation. The matrix of regressors from the data space to the factor space is called the SCORE matrix. So the product of these two matrices gives the identity matrix (Eq. [3]).

PATTERN*SCORE = I (3)

The model of factor analysis could be derived from the correlation or covariance matrix. There is no need to have complete observations. In the case of the correlation matrix, the model is found under Equation 4. We have to solve this equation under these conditions:

l Maximizing variance for next factor to be extracted. l Having factors uncorrelated with each other. l Having factors uncorrelated with residuals.

PATTERN’*PATTERN+RESIDUALS = CORR (4)

The RESIDUALS matrix represents parts of the covariance between variables that are not explained by the model, while CORR is the correlation matrix of variables, and PATTERN’ is the transposed matrix of PAT- TERN. Factor models found from correlation or covari- ante matrix will be identical to the one found from observations matrix. Having complete observations would be preferable since Eqs, ( 1) and (5) could be applied to verify the correctness of the model on observations provided by the ERROR matrix; otherwise only the correctness of the transformation of the variables could be verified with the RESIDUALS matrix.

FACTOR = DATA*SCORJZ (5)

With scoring coefficients, the SCORE matrix, the user will be able to project observations in the factor space. Because the factor space has fewer dimensions, and we know the relative importance of these dimensions, plot- ting observations in two dimensions will provide more information. These graphs are interpreted easily when the significance of the factor can be seen.

The percentage of variability explained is often used as a measure of correctness for the model deduced (eq. [6]). The sum of the eigenvalues for the covariance or correlation matrix equals to the number of variables.

D. Coupal and P. N. Robillard

L

Data space Factor space

Figure 1. Projecting the oblation in the factor space.

Each factor is found with these proportion of variance explained portional to the eigenvalue.

retained factors

lg eigenvalue,

eigenvalues, and the by the factor is pro-

%Var. Exp. = ‘=I nb. of variables

x loo

If the covariance matrix of the observation matrix has X dimensions (X variables per observation) and a rank of Y, then only Y factors explain the totality of the observations’s variance. In most practical cases, the dimensions of a set of variables can be reduced under Y because the variances explained by the last factors are insignificant for the model. Figure 1 depicts the visual process of finding factors showing the transformation from a two-dimension data space into a two-dimension factor space. The principal axis is labeled Factor 1 and oriented in the maximum variance direction, while Fac- tor 2, uncorrelated with the first one, explains the variability left. In an N-dimensional space with high correlation between variables, a factor space could be as the one depicted in Figure 2.

Figure 2. Variables in the factor space.

Factor 1

Factor 3

Fat to I 2


* 3 a

Figure 3. The goal of the factor analysis for a study with metrics.

3. METHODOLOGY FOR METRICS M*P.

Analyzing source code metrics for large projects could be difficult. Taking a project of IV routines for which P metrics have been computed will generate an N*P matrix. Looking at 20 metrics over 1,000 routines gives a 20,OCKl value data set!

So. the reduced ratio for the RAW DATA matrix is given by Eq. (9).

N*P - (N*Q - M*Q + M*P) N*P

We propose a method for reducing this amount of data (Figure 3). This method significantly reduces the data to be observed without losing large amounts of the information carried out by the whole set of metrics.

4. DATA

Figure 4 shows an example of this method. Start- ing with a complete RAW DATA set, principal factors are extracted. We keep only a subset of these factors based on one criterion. Usually this criterion is either a fixed number of factors to ensure a comparison between projects or a criterion based on eigenvalues of factors. Statisticians prefer the latter. A factor will be retained if its eigenvalue is over a minimum value. It is common, but not mandatory, to use 1 as the minimum. This number is justified by the fact that an N-matrix will have the sum of eigenvalue equals to N. A factor with at least 1 as its eigenvalue is considered to at least explain the information provided by an independent variable.

As we have seen, factors are found by using the complete data from observations or by using the covariance or correlation matrix for the variables. Two approaches are used here, one from the source code of 7 commercial software programs and the other from 12 correlation matrices of data in published articles.

With the regressor matrix (SCORE), we find PRO- JECTIONS on principal factors. Experiments have shown that transformations from data space to factor space reduced by about 80% the volume of initial observations. Because factor analysis is a linear model, some routines may be ill-represented by the model. By rewriting Eq. (1) we compute the ERROR matrix:

DATRIX, a CASE tool developed at the Software Engineering Laboratory of the Ecole Polytechnique de Montreal with the support of Bell Canada, Inc., is used to analyze the seven source code projects. These commercial projects include an engineering application, an operating system, an editor, two simulators, a data base application, and DATRIX itself. The tool produces about 32 metrics from the source code. C, FORTRAN, Pascal, and dialects of these languages can be parsed. Source code is translated into a language-independent representation. Metrics are computed on this representation.

Other data are correlation matrices found in the liter- ature. Table 1 lists these projects, which are identified by the source of the article. The names are built using the first four letters of the main author’s name and the year of publication of the article. The articles are ERROR = DATA - PROJECTIONS’PATTERN (7)

J. SYSTEMS SOFTWARE 265 1990; 12: 263-269

Using least-square approximations on the ERROR matrix, we find the routines ILL-MODELIZED by the factor pattern.

Information from RAW DATA and ILL-MODEL- IZED ROUTINES will give the DATA OF ILL- MODELIZED ROUTINES matrix. Overall, for a complete project, we only have to look at the PROJEC- TIONS matrix for almost all routines except the ones ill-modelized. For these, we look at RAW DATA OF ILL-MODELIZED ROUTINES.

For example, a project of N routines for which P metrics have been computed and translate it into a Q-dimensional factor space for which M routines are ill-conditioned. The number of elements of a well- modelized routines matrix in the factor space and of an ill-modelized routines matrix in the metric space are respectively

N*Q - M*Q

and

(9)

J. SYSTEMS SOFTWARE 1990; 12: 263-269


Raw data Projections

ERROR

Raw data of ill Projections of ill modelized routines modelized routines

~~ Routines ;;;y{y Ij ,i :I :I l ‘:::::‘::: d- . . .._... : *: : ‘: 1:

........ ~, .a

........ .. ., ......

....... .....

....... , . . ....

Metric space Factor space Figure 4. Processing the data set of a project with the factor analysis.

referenced in the bibliography by a number in square brackets. In cases where multiple sets are available from a publication, lowercase letters are used to differentiate them. Projects analyzed with DATRIX are referenced by DATR89. Number of routines, language used, type of data available, and number of metrics are reported in this table.

Table 2 gives the names and the definitions of metrics for projects analyzed with DATRIX. Metrics for other projects are subsets of these, including Software Science and the metrics defined by the author. Almost all the authors include metrics for lines of codes. Comparison of this metric from project to project is difficult because its definition differs with the author. There is a need for formal definition of basic elements of software such as this.

5. RESULTS

The factor analysis is carried out using SAS Software version 6.03 with the SAWSTAT package. The FACTOR

Table 1. Data Sets Used

Source Routines Language OBS Metrics

DATR89_a

DATR89_b

DATR89_c

DATR89_d

DATR89_e

DATR89_f

DATR89_g

ELSH84 ill1

HENR84 1121

L187 031

LIND89_a [141

LIND89_b [141

SCHR84 [151

SUN081 [161

VERI88_a 1171

VERI88_b [171

VERI88_c t 171

VERI88_d 1171

VERI88_e [171

187 FORTRAN Data 30

718 FORTRAN Data 30

216 FORTRAN Data 29

161 FORTRAN Data 27

255 C Data 28

503 C Data 31

2112 C Data 31

585 PL/l corr 20

136 C COIT 6

255 FORTRAN corr 18

3442 Pascal corr 11

1123 FORTRAN corr 11

921 Pascal corr 9

200 FORTRAN corr 8

1020 FORTRAN Corr 20

927 Pascal Corr 20

914 C corr 20

417 Modula-2 corr 20

256 COBOL corr 20

19 Projects 14,348 6 Types

Factor Analysis of Source Code Metrics 1. SYSTEMS SOFTWARE 267 1990; 12: 263-269

Table 2. Metric Set from DATRIX

Bs

Bsp

E K Ls

Nbt NC

Ncmax

Ncn

Ne

Nel Net max

Nelw

Ni Nl

NP

Nr Psc

Pscmax

Rae

RIS

RIIC

RVc V Ved

vcs

Vg VP

vs VW

Number of breaches of smcruw is the number of arc erossings in the control graph that violate the principles of structured programming, i.e. the usage of one-entry, one-exit control structures. Weighted number of breackes of structure is the sum of the weights of all pairs of arcs involved in breaches of structure.

Number of&w in the control graph.

Number of Knors is number of arc crossings in the control graph.

Mean length of v&able numes is the mean number of characters of all variables that are declared and/or used in the software unit. Num&er of ~ecufu~~e sf~ernen~ is a count of all executable statements in the software unit.

To&d number of Zincs is a count of all lines in the software unit.

Mean node complexity is the arithmetic mean of the complexity attribute for all nodes having an out-degree greater than one. The complexity is defmed as the number of operators and variables in the node, minus one.

Mornay node complewiry is the maximal value of the complexity attribute for all nodes having an out-degree greater than one.

Number of conditional nodes is the number of nodes in the control graph having an out-degree greater than one. Number of exit nodes is the number of nodes in the control graph where the control flow stops or returns to a calling software unit.

Mean nesf~~zg level is the ~it~etic mean of the nesting levels of all arcs.

Maximal nesting level is the maximal value of nesting level in the software unit.

Weighted mean nesting level is the arithmetic mean of the nesting levels of all arcs considering their weights.

Number of entry nodes in the control graph.

Number of loop cons~mcfs is the number of backward arcs in the control graph.

Number of inde~~zdenf paths is the number of paths that the control flow can follow from the entry node(s) to the exit node(s). Loops are visited only once (logarithm has been taken).

Number of recursive nodes in the control graph where the software unit calls itself.

Mean conditional no& spun is the mean number of nodes that are located within the span of conditional nodes.

M~mum conditions no& span is the maximum number of nodes that are located within the span of a conditional node.

Commented arcs ratio is the percentage of commented arcs over the total number of arcs with a weight greater then zero.

Loop sfmctural ratio is the ratio of the structural volume contained in subgraphs delimited by backward arcs (loops) on the total structural volume.

Loop weight ratio is the ratio of the sum of weights of the subgraphs delimited by backward arcs (loops) on the total sum of weights.

Commenfed nodes ratio is the percentage of commented nodes over the total number of nodes (only conditionals nodes are considered).

Comments volume ratio. RVc = Vs / (Vu! + Vcs), N~ber of nodes is the number of nodes in the control graph.

Comments volume in &clarations is the total number of alphabetical characters found in the comments located in the declaration section.

Comments volume in structures is the total number of alphabetical characters found in the comments located anywhere in the software unit except in the declaration section. C~iornad~ number. Vg = E - V + 2p; where p = 1.

Number of pending nodes is the number of nodes in the control graph having an in-degree of zero that are not entry nodes.

Structural volume. Vs = (E t V - 1) / 6. Sum of weights is the sum of the weight of all arcs in the control graph.

268 J. SYSTEMS SOFTWARE 1990; 12: 263-269


Table 3. Percent of VariabilityExplained for Each Project

Project

Fact. Sum with Var. % Var. Exp. by Factors

EV > 1 Exp. 1 2 3 4

DATR89 A 6 84.1 57.3 8.3 6.5 4.4

DATR89_B 8 81.9 37.2 12.9 8.8 5.8

DATR89_C 6 88.3 54.7 10.0 7.0 6.7

DATR89-D 5 82.2 55.4 11.3 6.0 5.3

DATR89_E 7 83.8 49.6 9.7 6.3 5.5

DATR89_F 7 83.8 38.0 15.7 11.1 6.0

DATR89_G 7 81.7 39.4 15.2 8.9 5.5

ELSH84 4 81.6 55.3 13.2 7.0 6.0 HENR84 1 79.1 79.1 L187 1 90.6 90.6 LIND89_A 1 76.2 76.2

LIND89_B 1 79.6 79.6

SCHR84 2 88.3 71.7 16.5

SUN08 1 1 87.2 87.2 VER188_A 4 86.6 60.3 12.1 8.9 5.3

VER188_B 3 81.2 62.5 10.7 8.0

VER188_C 4 84.6 58.4 10.6 8.1 7.5

VER188_D 4 84.0 58.0 13.4 7.5 5.0

VER188_E 4 85.4 62.5 10.0 7.3 5.6

Means 4.0 83.7 61.7

procedure is used with the principal component extrac-

tion method. The criterion for retaining a factor is that its eigenvalue > 1. Factors are not rotated. Because it is necessary to have the maximum variance explained by

each factor, it is a mistake to do a rotation in this situ-

ation. Rotations distribute variance from major axes to less significant ones to explain the meaning of factors.

Practitioners should be very careful when using rotation

techniques. Table 3 reports the results of the analysis. For each

project, the number of factors retained having an eigen-

value > 1 is listed. The next column shows the percentage of variability explained by the number of fac-

tors retained. The next four columns give percentage of variability explained by each factor. Because last factors are less interesting, only first four are shown for projects where more than four factors were retained.

The results of the analysis show that almost 63% of the variability in the measurement of classical metrics is represented by only one factor in the projects analyzed. This factor has been identified with volume because every volume metric has a high projection on it. Table 3 also shows that few dimensions of complexity are measured by the fact that an average of 4 factors are extracted per project. In addition, the last two factors

contributed less than 10%. The ratio of compression in dimension from data

space to factor space is about 5 starting with a mean

Regressions for DATR89 data - Nel vs Nelw

Nelw 16,

J 1 2 3 4 5 6 7 6 9 10 11 12 13 14 15 16

Nel

+a ++b +c ++d +e +-f *g

Figure 5. Example of slopes for different projects.

of 20.5 metrics and projecting them to 4 factors. These projections carried out 83.7% of the initial information.

We saw earlier that metrics can be projected on fac-

tor axes. High projections will indicate that information

carried out by a metric is similar to the information represented by the factor. Table 4 reports all projects mea-

sured by DATRIX. Projections of each metric on the

principal axis are reported in Table 4. Only numbers

Table 4. Metrics with High Projections on First Factor Axis

Data Sets from DATR89

METRIC A B C D E F G

VS .97 .96 .96 .97 .95 .94 .95 VW .82 .88 .92 .85 vcs .91 .87 .86 Bs .83 K .&I .89 .85

Vg .98 .94 .96 .90 .95 .94 .93

NP .84 .87 .82 Nel .93 .82 .91 .88 .86 .80 Nelmax .94 .82 .91 .87 .86 .80 Nelw .92 .91 .87 .84 .81 ” .98 .95 .95 .97 .95 .94 e .98 .96 .96 .96 .95 .94 .94 Ncn .96 .95 .90 .95 .93 .95 PSC .95 .83 .85 .94 Pscmax .96 .92 .92 .95 .90 .91 Nbt .89 .88 .92 .94 Nbe .90 .80 .95 .95


over .8 are printed. Metrics that do not appear in this

table are uncorrelated with factor 1 for each project. We

note that the metrics listed are almost all volume metrics. Because no rotation has been done, interpretation

of the other axes is difficult. However, the purpose of this analysis is to outline volume-related information.

Another fact emerges from the correlation matrix. Even if R* is high between two metrics, regressors are not identical for different projects. The metrics Nel

(mean nesting level) and Nelw (weighted mean nesting level) have an R2 value over .88 for six projects, the

other has a value of 0.58. Figure 5 shows regression slopes between these two metrics for all the projects in

the DATR89 data sets. Even if these two metrics are highly correlated, it is difficult to discard one in favor

of the other: There is no way to estimate the slope rate and predict values for the other metric.

6. CONCLUSIONS

Despite the fact that factor analysis is a linear model, we have found that many of the metrics that have been

proposed by software engineers measure the same as-

pects of software quality. Practitioners should be careful to try to find different meanings in metrics having the

same dominant factor. It may be of no value, e.g., to

work with a set of metrics that all have the same factor, meaning that their variabili~ will nearly always behave

in the same way.

It is also shown that volume is the principal aspect measured by these metrics.

Even if metrics are correlated, their slopes could dif- fer a lot for different projects. Keeping one metric in-

stead of another is, on the whole, dangerous. We could predict high correlations, but it is harder to predict val-

ues for the discarded metric because its slope cannot be predicted.

Even in published articles it is difficult to find una-

nimity on the definition of some metrics. There is a need to formalize basic software element definitions.

interference in statistical analysis is based on the as-

sumption that observations are independent. Because

routines in a project are not, it would be wrong to ex-

amine a subset of routines from a project and generalize

the results to the whole project or to generalize the results obtained from one project to all of the others. It

could be difficult to claim to have found the ultimate

metric. The future goal of metrics will be to explain aspects

of software that are not currently being measured. A subset of independent metrics will be more useful than

a number of different metrics that all measure the same software quality aspects. The use of multidimensional

J. SYSTEMS SOFTWARE 269 1990: 12: 263-269

statistical methods may prove very useful in this search

for new metrics or to validate the information aspect of

others.

REFERENCES

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

S. D. Code, H. E. Dunsmore, and V. Y. Shen, Software Engineering Metrics and Models, Benjamin/Cummings, Menlo Park California, 1986. B. DeRose and T. Nyman, The Software Life Cycle-A Management of Defense, IEEE Trans. Software Engi- neering SE-4, 309-3 18 ( 1978). P. M. Cashman and A. W. Holt, A Communication-

Oriented Approach to Structuring the Software Main- tenance Environment, Software Engineering Notes 5, 4-17 (1980).

M. H. Halstead, Elements of Software Science, Elsevier North-Holland, New-York, 1977. Thomas J. McCabe, A Complexity Measure, IEEE Trans. Software Engineering SE-2, December, 308-320

(1976). V. Cot& P. Bourque, S. Oligny and N. Rivard, Software

Metrics: An Overview of Recent Results, Syst. Software 8, 121-131 (1988). Martin Sheppard, A Critique of Cyclomatic Complexity, Software Engineering J. March 30-36 ( 1988).

Nicholas Beser, Foundations and Experiments in Software Science, Performance Evaluation Review 11, 48-72 (1982). Ci. A. F. Seber, Multivariate Observations, John Wiley

& Sons, New York, 1984.

SASISTAT Guide for Personal Computers, Sas Insti- tute Inc.. Gary, North Carolina, 1987, pp. 449-492. J. L. Elshoff, Characteristic Program Complexity Mea- sures, Proc. ht. Conf. on Software E~gin~ring 288-293 (1984). Sallie Henry and Dennis Kafura, The Evaluating of Soft-

ware Systems’ Structure Using Quantitative Software Met- rics, Software-Practice and Experience 14, 561-573 (1984). H. F. Li and W. K. Cheung. An Empirical Study of

Software Metrics, I‘EEE Trans. Software Engineering 697-708 (1987). Randy K. Lind and K. Vairavan, An Experimental In- vestigation of Software Metrics and Their Relationship to Software Development Effort, IEEE Trans. Software engineering SE-15, 649-653 (1989). A. Schroeder, Integrated Program Measurement and Doc- umentation, Proc. ht. Conf. Software Engineering 304-313 (1984). 1‘. Sunohara, A. Takano, K. Uehara, and T. Ohkawa,

Program Complexity Measure For Software Development Management, Proc. 5th ht. Conf. on Software Engi- neering, San Diego, 100-106 (1981).

Logiscope- Measurement Study, Verilog, Toulouse, October 1988, pp. 1.4-1.13.

Factor analysis of source code metrics

Documents

Transcript of Factor analysis of source code metrics