An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf ·...

61
SOFTWARE—PRACTICE AND EXPERIENCE Softw. Pract. Exper. 2007; 37:581–641 Published online 24 October 2006 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/spe.776 An empirical study of Java bytecode programs Christian Collberg, Ginger Myles *,† and Michael Stepp Department of Computer Science, University of Arizona, Tucson, AZ 85721, U.S.A. SUMMARY We present a study of the static structure of real Java bytecode programs. A total of 1132 Java jar-files were collected from the Internet and analyzed. In addition to simple counts (number of methods per class, number of bytecode instructions per method, etc.), structural metrics such as the complexity of control- flow and inheritance graphs were computed. We believe this study will be valuable in the design of future programming languages and virtual machine instruction sets, as well as in the efficient implementation of compilers and other language processors. Copyright c 2006 John Wiley & Sons, Ltd. Received 18 August 2004; Revised 28 June 2006; Accepted 28 June 2006 KEY WORDS: Java; bytecode; measure; software complexity metrics 1. INTRODUCTION In a much cited study [1], Knuth examined FORTRAN programs collected from printouts found in a computing center. Among other things, he found that arithmetic expressions tend to be small, which, he argued, has consequences for code-generation and optimization algorithms chosen in a compiler. Similar studies have been carried out for COBOL [2,3], Pascal [4], and APL [5,6] source code. In this paper we report on a study on the static structure of real Java bytecode programs. Using information gathered from an automated Google search, we collected a sample of 1132 Java programs, in the form of jar-files (collections of Java class files). The static structure of these programs was analyzed automatically using SandMark, a tool which, among other things, performs static analysis of Java bytecode. It is our hope that the information gathered and presented here will be of use in a variety of settings. For example, information about the structure of real programs in one language can be used to design * Correspondence to: Ginger Myles, Department of Computer Science, University of Arizona, Tucson, AZ 85721, U.S.A. E-mail: [email protected] Contract/grant sponsor: National Science Foundation; contract/grant number: 324360 Contract/grant sponsor: Air Force Research Lab; contract/grant number: F33615-02-C-1146 Copyright c 2006 John Wiley & Sons, Ltd.

Transcript of An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf ·...

Page 1: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

SOFTWARE—PRACTICE AND EXPERIENCESoftw. Pract. Exper. 2007; 37:581–641Published online 24 October 2006 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/spe.776

An empirical study of Javabytecode programs

Christian Collberg, Ginger Myles∗,† and Michael Stepp

Department of Computer Science, University of Arizona, Tucson, AZ 85721, U.S.A.

SUMMARY

We present a study of the static structure of real Java bytecode programs. A total of 1132 Java jar-fileswere collected from the Internet and analyzed. In addition to simple counts (number of methods per class,number of bytecode instructions per method, etc.), structural metrics such as the complexity of control-flow and inheritance graphs were computed. We believe this study will be valuable in the design of futureprogramming languages and virtual machine instruction sets, as well as in the efficient implementation ofcompilers and other language processors. Copyright c© 2006 John Wiley & Sons, Ltd.

Received 18 August 2004; Revised 28 June 2006; Accepted 28 June 2006

KEY WORDS: Java; bytecode; measure; software complexity metrics

1. INTRODUCTION

In a much cited study [1], Knuth examined FORTRAN programs collected from printouts found in a

computing center. Among other things, he found that arithmetic expressions tend to be small, which,

he argued, has consequences for code-generation and optimization algorithms chosen in a compiler.

Similar studies have been carried out for COBOL [2,3], Pascal [4], and APL [5,6] source code.

In this paper we report on a study on the static structure of real Java bytecode programs.

Using information gathered from an automated Google search, we collected a sample of 1132 Java

programs, in the form of jar-files (collections of Java class files). The static structure of these programs

was analyzed automatically using SandMark, a tool which, among other things, performs static

analysis of Java bytecode.

It is our hope that the information gathered and presented here will be of use in a variety of settings.

For example, information about the structure of real programs in one language can be used to design

∗Correspondence to: Ginger Myles, Department of Computer Science, University of Arizona, Tucson, AZ 85721, U.S.A.†E-mail: [email protected]

Contract/grant sponsor: National Science Foundation; contract/grant number: 324360

Contract/grant sponsor: Air Force Research Lab; contract/grant number: F33615-02-C-1146

Copyright c© 2006 John Wiley & Sons, Ltd.

Page 2: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

582 C. COLLBERG, G. MYLES AND M. STEPP

future languages in the same family. One example is the finally clause of Java exception handlers.

Special instructions (jsr and ret) were added to Java bytecode to handle this construct efficiently.

These instructions turn out to be a major source of complexity for the Java verifier [7]. If, instead,

the Java bytecode designers had known (from a study of MODULA-3 programs, for example) that the

finally clause is very unusual in real programs, they may have elected to keep jsr/ret out of the

instruction set. This would have simplified the Java bytecode verifier while imposing little overhead on

typical programs‡.

There are many types of tools that operate on programs. Compilers are an obvious example, but there

are many software engineering tools which transform programs in order to improve on their structure,

readability, modifiability, etc. Such language processors can benefit from knowing typical and extreme

counts of various aspects of real programs. For example, in our study we have found that while, on

average, a Java class file has 9.0 methods, in the extreme case we found a class with 570 methods.

This information can be used to select appropriate data structures, algorithms, and memory allocation

strategies.

As a further example, we have found that the average Java class has no more than one method that

overrides a method of its superclass. This means that most methods are written ‘from scratch’, and will

not be present in any of the superclasses of that class. Furthermore, it means that most methods written

in a given class are unlikely to be overridden in its subclasses. Thus, aggressive inlining appears to be

a good candidate for optimization. The usual obstacle to inlining in Java is virtual method invocation,

where a single method callsite could have many potential targets. However, given these data, we can see

that often this will not be a problem, because methods are rarely overridden. Combined with the fact

that the average method has 33.2 instructions and is thus quite small, we see that aggressive inlining is

an excellent candidate for optimization. We hope that this study will be useful in providing many other

insights that will facilitate the design of tools to study and improve programs.

Our own research is focused on the protection of software from piracy, tampering, and reverse

engineering, using code obfuscation and software watermarking [8]. Code obfuscation attempts to

introduce confusion in a program to slow down an adversary’s attempts at reverse engineering it.

Software watermarking inserts a copyright notice or customer identification number into a program

to allow the owner to assert their intellectual property rights. An important aspect of these techniques

is stealth. For example, a software watermarking algorithm should not embed a mark by inserting code

that is highly unusual, since that would make it easy to locate and remove. Our study of instruction

frequencies and instruction n-grams directly addresses this concern, by showing us exactly which

instruction sequences are common and which are not. For example, any code that contains the JSR Wor GOTO W instructions would be extremely unstealthy, since not a single one of our 1132 test jars

contain either of these instructions. We believe the information presented in this paper will be useful

in developing future evaluation models for the stealth of software protection algorithms.

The remainder of this paper is structured as follows. In Section 2 we describe how our statistics were

gathered. In Section 3 we give a brief overview of Java bytecode. In Sections 4, 5, 6, and 7, we present

application-level, class-level, method-level, and instruction-level statistics, respectively. In Section 8

we discuss related work, and in Section 9 we summarize our findings.

‡The finally clause can be implemented by copying code.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 3: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 583

Table I. Collected jar-file statistics.

Measure Count

Total number of jar-files 1132Total size of jar-files (bytes) 198 945 317Total number of class files 102 688Total number of packages 7682Total number of classes 90 500Total number of interfaces 12 188Total number of constant pool entries 12 538 316Total number of methods 874 115Total number of fields 422 491Total number of instructions 26 597 868

2. EXPERIMENTAL METHODOLOGY

Table I shows some statistics of the applications that were gathered. Figure 1 shows an overview of

how our statistics were collected.

To obtain a suitably random set of sample data, we queried the Google search engine using the key-

phrase ‘"index of" jar’. This query was designed to find Web pages that display server directory

listings that contain files with the extension .jar. In the resulting HTML pages we searched for any

<A> tag whose HREF attribute designated a jar-file. These files were then downloaded.

The initial collection of jar-files numbered in excess of 2000. An initial analysis discarded any files

that contained no Java classes, or were structurally invalid. Static statistics were next gathered using

the SandMark tool.

SandMark [9] is a tool developed to aid in the study of software-based software protection

techniques. The tool is implemented in Java and operates on Java bytecode. Included in SandMark

are algorithms for code obfuscation, software watermarking, and software birthmarking. A variety of

static analysis techniques are included to aid in the development of new algorithms and as a means to

study the effectiveness of these algorithms. Examples of such techniques are: class hierarchy, control-

flow, and call graphs; def-use and liveness analysis; stack simulation; forward and backward slicing;

various bytecode diffing algorithms; a bytecode viewer; and a variety of software complexity metrics.

Not all well-formed jar-files could be completely analyzed. In most cases this was because the jar-file

was not self-contained, i.e. it referenced classes that were not in the jar or in the Java standard library.

Missing class files prevent the class hierarchy from being constructed, for example. In these cases

we still computed as many statistics as possible. For example, while an incomplete class hierarchy

prevented us from gathering accurate statistics of class inheritance depth, it still allowed us to gather

control-flow graph (CFG) statistics. Our SandMark tool is also not perfect. In particular, it is known to

build erroneous CFGs for methods with complex subroutine structures (combinations of the jsr and

ret instructions used for Java’s finally clause). There are few such CFGs in our sample set, so this

problem is unlikely to adversely affect our data.

Owing to our random sampling of jar-files from the Internet, the collection is somewhat

idiosyncratic. We assume that any two jar-files with the same name are in fact the same program, and

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 4: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

584 C. COLLBERG, G. MYLES AND M. STEPP

"index of" jar

SCRIPT

classesPERpackage.jgrmethodsPERclass.jgrbasicblocksPERmethod.jgr

mosaic.jar

OligoWiz 1.0.4.jar mysql.jar

OligoWiz 1.0.2.jarOligoWiz 1.0.5.jar

OligoWiz 1.0.1.jar

OligoWiz 1.0.3.jar

Figure 1. Overview of how our statistics were gathered.

keep only one. However, we kept those files whose names indicated that they were different versions

of the same program, as shown by the OligoWiz files in Figure 1. Most likely, these files are very

similar and may contain methods that are identical between versions. It is reasonable to assume that

such redundancy will have somewhat skewed our results. An alternative strategy might have been to

guess (based on the file name) which files are versions of the same program, and keep only the higher-

numbered file. A less random sampling of programs could also have been collected from well-known

repositories of Java code, such as sourceforge.net.

Giving an informative presentation of this type of data turns out to be difficult. In many applications

we will only be interested in typical values (such as mode or mean) or extreme values (such as min

and max). Such values can easily be presented in tabular form. However, we would also like to be able

to quickly get a general ‘feel’ for the behavior of the data, and this is best presented in a visual form.

The visualization is complicated by the fact that most of our data have sharp ‘spikes’ and long ‘tails’.

That is, one or a few (typically small) values are very common, but there are a small number of large

outliers which by themselves are also interesting. This can be seen, for example, in Figure 30(b) below,

which shows that out of the 801 117 methods in our data, 99% have fewer than two subroutines but one

method has 29 subroutines.

We have chosen to visualize much of our data using binned bar graphs where extremely tall bars are

truncated to allow small values to be visualized. For example, consider the graph in Figure 2 which

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 5: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 585

0

5000

10000

15000 MIN: 4MAX: 26708AVG: 122.1

MODE: 24MEDIAN: 65STD DEV: 215

SAMPLES: 102688TOTAL: 12538316

0%

,1

1

4

0%

,2

86

6

0%

,6

1

7

1%

,1

14

7

8

1%

,3

75

9

11

%,1

02

29

10-19

22

%,1

10

57

20-29

31

%,9

12

8

30-39

38

%,7

49

1

40-49

44

%,6

02

0

50-59

49

%,5

14

7

60-69

54

%,4

57

0

70-79

58

%,4

09

0

80-89

61

%,3

98

5

90-998

3%

,2

24

13

100-1999

1%

,8

05

4200-299

95

%,3

70

4300-399

97

%,1

87

9

400-499

98

%,1

03

5

500-599

98

%,6

76

600-699

99

%,3

57

700-799

99

%,2

25

800-899

99

%,1

64

900-999

99

%,4

96

1000-1999

99

%,7

4

2000-2999

99

%,6

3000-3999

99

%,3

4000-4999

99

%,2

5000-5999

10

0%

,3

20000-29999

Cumulativepercentage

Striped barshave beentruncated

Values havebeen binned

Value

Figure 2. Illustration and explanation of the bar graphs used throughout the paper for data visualization.

shows the number of constants in the constant pool of the Java applications we studied. Most of our

graphs will have the same structure. Along the x-axis we show the bins into which our data have been

classified. On top of each bar the actual count and cumulative percentage are shown. Very tall bars are

truncated and shown striped. In a separate table to the top right of the graph we show the total number

of data points, the minimum, maximum, and average x-values, the mode§, the median (the middle

value), and the standard deviation. The SAMPLES value is the total number of items inspected for the

given statistic, and the TOTAL value is the total number of sub-items counted. For example, in the

above graph, the SAMPLES value will be the number of classes analyzed and the TOTAL value will

be the sum of all of the constant pool entries over all of the classes analyzed. The TOTAL value is

only included where appropriate. The FAILED value gives the number of unsuccessful measurements,

when appropriate.

3. THE STRUCTURE OF JAVA BYTECODE PROGRAMS

A Java application consists of a collection of classes and interfaces. Each class or interface is compiled

into a class file. A program consists of a number of class files which are collected together into a jar-file.

§The mode is the most frequently occurring value. This is often—but because of binning not always—the tallest bar of the graph.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 6: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

586 C. COLLBERG, G. MYLES AND M. STEPP

String: "HELLO"Method: "C.M(int)"Field: "int F"

.....

Constant Pool

Flags Name Type Attributes

[pub] "x" int

[priv] "y" C

Fields

ExceptionTable

Code[]= push,add,store...

[]= ...

MaxStack=5, MaxLocals=8

Attributes

Magic Number

Constant Pool

Access Flags

This Class

Super Class

Interfaces

Fields

Methods

Attributes

[pub]

Flags Name AttributesSig

"q" ()int

Methods

Code

Exceptions

Code

Figure 3. A view of the Java class file format.

A jar-file is directly executable by a Java virtual machine interpreter. The Java class file stores all

necessary data regarding the class. There is a symbol table (called the Constant Pool) which stores

strings, large literal integers and floats, and names and types of all fields and methods. Each method is

compiled to Java bytecode, a stack-based virtual machine instruction set. Figure 3 shows the structure

of the Java class file format. The JVM is defined by Lindholm and Yellin [10].

The Java bytecodes can manipulate data in several formats: integers (32 bits), longs (64 bits), floats

(32 bits), doubles (64 bits), shorts (16 bits), bytes (8 bits), Booleans (1 bit), chars (16 bit Unicode),

object references (32 bit pointers), and arrays. The Boolean, byte, char, and short types are compiled

down into integers.

Bytecode instructions have variable widths. Simple instructions such as iadd (integer addition) are

one byte wide, while some instructions (such as tableswitch) can be multiple bytes. Each method

can have up to 65 536 local variables and formal parameters, called slots. The bytecodes reference

slots by number. For example, the instruction ’iload 3’ pushes the third local variable onto the

stack. In order to access high-numbered slots, a special wide instruction can be used to modify load

and store instructions to use 16-bit indexes. The Java execution stack is 32 bits wide. Longs and doubles

take up two stack entries and two slot numbers.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 7: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 587

Table II. Notation used to refer to data values in the bytecode.

Notation Explanation

B An 8-bit integer valueS A 16-bit integer valueL A 32-bit integer valueCb An 8-bit constant pool indexCs A 16-bit constant pool indexFb An 8-bit local variable indexFs A 16-bit local variable indexC[i] The ith constant pool entryV[i] The ith variable/formal parameter in the current method

Local variable slots are untyped. In fact, a particular slot can hold different types of data at different

locations in a method. However, regardless of how execution reaches a given location in the method,

the type of data stored in a particular slot at that location will always be the same. A static analysis

known as a stack simulation can compute slot types without executing a method.

Some bytecodes reference data from the class’ constant pool, for example to push large constants

or to invoke methods. Constant pool references are 8 or 16 bits long. To push a reference to a literal

string with constant pool number 4567, the compiler would issue the instruction ’ldc w 4567’.

If the constant pool number instead fits into a byte (such as 123), the shorter instruction ’ldc 123’would suffice.

Some information is stored in attributes in the class file. This includes exception table ranges, and

(for debugging) line-number ranges and local variable names.

Table II explains the notation used in Tables III–VI, which give an overview of the JVM instruction

set.

4. PROGRAM-LEVEL STATISTICS

In this and the following three sections we will present the data collected about applications

(this section), classes (Section 5), methods (Section 6), and instructions (Section 7).

Figures 4–7 visualize application-level data about the programs we gathered.

4.1. Packages

Classes in Java are optionally organized into a hierarchy of packages. For example, Java’s Stringclass is in the package java.lang, and can be referred to as java.lang.String. As can be seen

from Figure 4(a), many Java programs put all classes into the same package. In fact, half of the 1132

applications we gathered have three or fewer packages, and only four have 50 or more.

A package α is counted if there exists some class β such that the fully qualified classname of β is

α.β. Thus, if an application has classes java.pack1.Class1 and java.pack2.Class2 then java.pack1 and

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 8: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

588 C. COLLBERG, G. MYLES AND M. STEPP

Table III. The first 87 Java bytecode instructions.

Opcode Mnemonic Args Stack Description

0 nop [] ⇒ []1 aconst null [] ⇒ [null] Push null object.2 iconst m1 [] ⇒ [−1] Push −1.3 . . . 8 iconst n [] ⇒ [n] Push integer constant n, 0 ≤ n ≤ 5.9 . . . 10 lconst n [] ⇒ [n] Push long constant n, 0 ≤ n ≤ 1.11 . . . 13 fconst n [] ⇒ [n] Push float constant n, 0 ≤ n ≤ 2.14 . . . 15 dconst n [] ⇒ [n] Push double constant n, 0 ≤ n ≤ 1.

16 bipush n:B [] ⇒ [n] Push 1-byte signed integer.17 sipush n:S [] ⇒ [n] Push 2-byte signed integer.18 ldc n:Cb [] ⇒ [C[n]] Push item from constant pool.19 ldc w n:Cs [] ⇒ [C[n]] Push item from constant pool.20 ldc2 w n:Cs [] ⇒ [C[n]] Push long/double from constant pool.

21 . . . 25 Xload n:Fb [] ⇒ [V [n]] X ∈{i,l,f,d,a}, Load int, long, float, double, objectfrom local var.

26 . . . 29 iload n [] ⇒ [V [n]] Load local integer var n, 0 ≤ n ≤ 3.30 . . . 33 lload n [] ⇒ [V [n]] Load local long var n, 0 ≤ n ≤ 3.34 . . . 37 fload n [] ⇒ [V [n]] Load local float var n, 0 ≤ n ≤ 3.38 . . . 41 dload n [] ⇒ [V [n]] Load local double var n, 0 ≤ n ≤ 3.42 . . . 45 aload n [] ⇒ [V [n]] Load local object var n, 0 ≤ n ≤ 3.46 . . . 53 Xload [A, I ] ⇒ [V ] X ∈{ia,la,fa,da,aa,ba,ca,sa}. Push the value V (an

int, long, etc.) stored at index I of array A.54 . . . 58 Xstore n:Fb [V [n]] ⇒ [] X ∈{i,l,f,d,a}, Store int, long, float, double, object

to local var.59 . . . 62 istore n [V [n]] ⇒ [] Store to local integer var n, 0 ≤ n ≤ 3.63 . . . 66 lstore n [V [n]] ⇒ [] Store to local long var n, 0 ≤ n ≤ 3.67 . . . 70 fstore n [V [n]] ⇒ [] Store to local float var n, 0 ≤ n ≤ 3.71 . . . 74 dstore n [V [n]] ⇒ [] Store to local double var n, 0 ≤ n ≤ 3.75 . . . 78 astore n [V [n]] ⇒ [] Store to local object var n, 0 ≤ n ≤ 3.79 . . . 86 Xstore [A, I, V ] ⇒ [] X ∈{ia,la,fa,da,aa,ba,ca,sa}. Store the value V (an

int, long, etc.) at index I of array A.

java.pack2 would be counted, but java would not. Also, the default or ‘null’ package is counted exactly

once, if there is a class in that package.

Figure 5(a) shows that while a small number of programs have packages with hundreds of classes,

the typical package will have only one, and the average is about 11.8.

Packages can be nested inside of other packages, allowing for the easy creation of unique names.

While it is possible to create a package hierarchy of arbitrary depth, Figure 4(b) shows that the

maximum depth for an application is 8, with an average depth of 3.9.

A Java interface is a special type of class that only contains constant declarations or method

signatures. A class which implements an interface must provide implementations of the methods.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 9: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 589

Table IV. Java bytecode instructions 87 to 169.

Opcode Mnemonic Args Stack Description

87 pop [A] ⇒ [] Pop top of stack.

88 pop2 [A, B] ⇒ [] Pop 2 elements.

89 dup [V ] ⇒ [V, V ] Duplicate top of stack.

90 dup x1 [B, V ] ⇒ [V, B, V ]

91 dup x2 [B, C, V ] ⇒ [V, B, C, V ]

92 dup2 [V, W ] ⇒ [V, W, V, W ]

93 dup2 x1 [A, V, W ] ⇒ [V, W, A, V, W ]

94 dup2 x2 [A, B, V, W ] ⇒ [V, W, A, B, V , W ]

95 swap [A, B] ⇒ [B, A] Swap top stack elements.

96 . . . 99 Xadd [A, B] ⇒ [R] X ∈{i,l,d,f}. R = A + B .

100 . . . 103 Xsub [A, B] ⇒ [R] X ∈{i,l,d,f}. R = A − B .

104 . . . 107 Xmul [A, B] ⇒ [R] X ∈{i,l,d,f}. R = A ∗ B .

108 . . . 111 Xdiv [A, B] ⇒ [R] X ∈{i,l,d,f}. R = A/B .

112 . . . 115 Xrem [A, B] ⇒ [R] X ∈{i,l,d,f}. R = A%B .

116 . . . 119 Xneg [A] ⇒ [R] X ∈{i,l,d,f}. R = −A.

120 . . . 121 Xshl [A, B] ⇒ [R] X ∈{i,l}. R = A << B .

122 . . . 123 Xshr [A, B] ⇒ [R] X ∈{i,l}. R = A >> B .

124 . . . 125 Xushr [A, B] ⇒ [R] X ∈{i,l}. R = A >>> B .

126 . . . 127 Xand [A, B] ⇒ [R] X ∈{i,l}. R = A&B .

128 . . . 129 Xor [A, B] ⇒ [R] X ∈{i,l}. R = A|B .

130 . . . 131 Xxor [A, B] ⇒ [R] X ∈{i,l}. R = AxorB .

132 iinc V :Fb ,B:B [] ⇒ [] V + = B .

133 . . . 144 X2Y [F ] ⇒ [T ] Convert F from type X to T of type Y . X ∈{i,l,f,d}, Y ∈{i,l,f,d}.145 . . . 147 i2X [F ] ⇒ [T ] X ∈{b,c,s}. Convert integer F to byte, char, or short.

148 lcmp [A, B] ⇒ [V ] Compare long values. A > B ⇒ V = 1, A < B ⇒ V = −1, A =B ⇒ V = 0.

149,151 Xcmpl [A, B] ⇒ [V ] Compare float or double values. X ∈{f,d}. A > B ⇒ V = 1, A <

B ⇒ V = −1, A = B ⇒ V = 0. A = NaN ∨ B = NaN ⇒ V =−1

150,152 Xcmpg [A, B] ⇒ [V ] Compare float or double values. X ∈{f,d}. A > B ⇒ V = 1, A <

B ⇒ V = −1, A = B ⇒ V = 0. A = NaN ∨ B = NaN ⇒ V = 1

153 . . . 158 if� L:S [A] ⇒ [] �={eq,ne,lt,ge,gt,le}. If A � 0 goto L + pc.

159 . . . 164 if icmp� L:S [A, B] ⇒ [] �={eq,ne,lt,ge,gt,le}. If A � B goto L + pc.

165 . . . 166 if acmp� L:S [A, B] ⇒ [] �={eq,ne}. A, B are object refs. If A � B goto L + pc.

167 goto I :S [] ⇒ [] Jump to I + pc.

168 jsr I :S [] ⇒ [A] Jump subroutine to instruction I + pc. A = the address of the

instruction after the jsr.

169 ret L:Fb [] ⇒ [] Return from subroutine. Address in local var L.

Interfaces are often used to compensate for Java’s lack of multiple inheritance. Figure 5(b) shows

that over 70% of Java packages contain 0 or 1 interface.

4.2. Protection

A Java class can be declared as abstract (it serves only as a superclass to classes which actually

implements its methods) or final (it cannot be extended). These declarations are, however, fairly

unusual. Figures 6(a) and 6(b) show that over 70% of all packages contain no abstract or final classes.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 10: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

590 C. COLLBERG, G. MYLES AND M. STEPP

Table V. Java bytecode instructions 170 to 195.

Opcode Mnemonic Args Stack

170 tableswitch D:L,l, h:L,oh−l+1 [K] ⇒ []JuLongDescrmp through the K :th offset. Else goto D.

171 lookupswitch D:L,n:L,(m, o)n [K] ⇒ []If, for one of the (m, o) pairs, K = m, then goto o. Else goto D.

172 . . . 176 Xreturn [V ] ⇒ []X ∈{i,f,l,d,a}. Return V .

177 return [] ⇒ []Return from void method.

178 getstatic F :Cs [] ⇒ [V ]Push value V of static field F .

179 putstatic F :Cs [V ] ⇒ []Store value V into static field F .

180 getfield F :Cs [R] ⇒ [V ]Push value V of field F in object R.

181 putfield F :Cs [R, V ] ⇒ []Store value V into field F of object R.

182 invokevirtual P :Cs [R, A1, A2, . . .] ⇒ []Call virtual method P , with arguments A1 · · · An, through object reference R.

183 invokespecial P :Cs [R, A1, A2, . . .] ⇒ []Call private/init/superclass method P , with arguments A1 · · · An, through object reference R.

184 invokestatic P :Cs [A1, A2, . . .] ⇒ []Call static method P with arguments A1 · · · An.

185 invokeinterface P :Cs ,n:S [R, A1, A2, . . .] ⇒ []Call interface method P , with n arguments A1 · · · An, through object reference R.

187 new T :Cs [] ⇒ [R]Create a new object R of type T .

188 newarray T :B [C] ⇒ [R]Allocate new array R, element type T , C elements long.

189 anewarray T :Cs [C] ⇒ [A]Allocate new array A of reference types, element type T , C elements long.

190 arraylength [A] ⇒ [L]Determines the length L of array A.

191 athrow [R] ⇒ [?]Throw exception.

192 checkcast C:Cs [R] ⇒ [R]Ensures that R is of type C.

193 instanceof C:Cs [R] ⇒ [V ]Push 1 if object R is an instance of class C, else push 0.

194 monitorenter [R] ⇒ []Get lock for object R.

195 monitorexit [R] ⇒ []Release lock for object R.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 11: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 591

Table VI. Java bytecode instructions 196 to 201.

Opcode Mnemonic Args Stack

196 wide C:B,I :Fs [] ⇒ []Perform opcode C on variable V[I]. C is one of the load/store instructions.

197 multianewarray T :Cs ,D:Cb [d1, d2, . . .] ⇒ [R]Create new D-dimensional multidimensional array R. d1, d2, . . . are the dimension sizes.

198 ifnull L:S [V ] ⇒ []If V = null goto L.

199 ifnonnull L:S [V ] ⇒ []If V �= null goto L.

200 goto w I :L [] ⇒ []Goto instruction I .

201 jsr w I :L [] ⇒ [A]Jump subroutine to instruction I . A is the address of the instruction right after the jsr w.

0

100

200

300

400 MIN: 1MAX: 74AVG: 6.8

MODE: 1MEDIAN: 2STD DEV: 9

SAMPLES: 1132TOTAL: 7682

36

%,

40

8

1

47

%,

12

6

2

53

%,

66

3

59

%,

73

4

65

%,

63

5

69

%,

53

6

73

%,

42

7

75

%,

23

8

78

%,

39

9

89

%,

12

4

10-19

96

%,

76

20-29

98

%,

21

30-39

99

%,

14

40-49

99

%,

2

50-59

99

%,

1

60-69

10

0%

,1

70-79

(a)

0

50

100

150

200

250MIN: 1

MAX: 8AVG: 3.9

MODE: 5MEDIAN: 3STD DEV: 1

SAMPLES: 1132

15

%,

17

2

1

18

%,

40

2

37

%,

21

3

3

59

%,

24

6

4

83

%,

27

2

5

97

%,

15

6

6

99

%,

26

7

10

0%

,7

8

(b)

Figure 4. Program-level statistics: (a) number of packages per application; (b) package depth per application.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 12: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

592 C. COLLBERG, G. MYLES AND M. STEPP

0

500

1000

MIN: 0MAX: 310AVG: 11.8

MODE: 1MEDIAN: 5STD DEV: 21

SAMPLES: 7682TOTAL: 90500

3%

,267

0

17%

,1077

1

27%

,778

2

36%

,682

3

45%

,692

4

52%

,502

5

57%

,432

6

61%

,306

7

64%

,248

8

67%

,217

9

84%

,1256

10-19

91%

,558

20-29

94%

,232

30-39

96%

,146

40-49

97%

,78

50-5997%

,43

60-6998%

,40

70-79

98%

,28

80-89

98%

,22

90-99

99%

,52

100-199

99%

,24

200-299

100%

,2

300-399

(a)

0

500

1000

1500

2000 MIN: 0MAX: 160AVG: 1.6

MODE: 0MEDIAN: 0STD DEV: 4

SAMPLES: 7682TOTAL: 12188

57%

,4449

0

75%

,1327

1

83%

,619

2

88%

,395

3

91%

,245

4

93%

,149

5

94%

,98

6

95%

,71

7

96%

,55

8

96%

,38

9

99%

,177

10-19

99%

,24

20-29

99%

,11

30-39

99%

,6

40-49

99%

,13

50-59

99%

,3

60-69

100%

,2

100-199

(b)

Figure 5. Program-level statistics: (a) number of classes per package; (b) number of interfaces per package.

4.3. Inheritance graphs

In addition to a class implementing an interface, a class can also extend another class. In this case

the subclass inherits all of the variables and methods of the class which it extends (the superclass),

thus creating an inheritance relationship. An inheritance graph can be constructed to represent the

superclass/subclass relationship. An inheritance graph is a rooted, directed, acyclic graph where the

nodes are classes and interfaces. There is an edge from node A to node B iff node B directly extends

or implements node A. Thus, classes will have edges to their direct subclasses, and interfaces will

have edges to the classes that implement them and to the interfaces that extend them. In order to make

this a rooted graph, we assert that all interfaces extend the class java.lang.Object (and, hence,

java.lang.Object becomes the root). Using this definition, we can see that node B is reachable

from node A iff it is possible to assign a value of type B to a variable of type A. The inheritance graph

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 13: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 593

0

100

200

300

400

500 MIN: 0MAX: 37AVG: 0.6

MODE: 0MEDIAN: 0STD DEV: 1

SAMPLES: 7682TOTAL: 4768

73%

,5679

0

87%

,1006

1

92%

,428

2

95%

,185

3

96

%,

14

2

4

97%

,83

5

98%

,51

6

99%

,47

7

99%

,13

8

99%

,21

9

99%

,22

10-1999%

,4

20-29

100%

,1

30-39

(a)

0

200

400

600

MIN: 0MAX: 270AVG: 1.3

MODE: 0MEDIAN: 0STD DEV: 6

SAMPLES: 7682TOTAL: 10299

74%

,5736

0

83%

,654

1

87%

,347

2

90%

,252

3

92

%,

14

9

4

94%

,1

04

5

95%

,77

6

96%

,65

7

96%

,35

8

97

%,

39

9

98%

,135

10-19

99%

,46

20-29

99%

,15

30-39

99%

,12

40-49

99%

,6

50-59

99

%,

3

60-69

99

%,

1

70-79

99

%,

4

100-199

100%

,2

200-299

(b)

Figure 6. Program-level statistics: (a) number of abstract classes per package; (b) number of final classes package.

height for a given application is the maximum number of superclasses that any class in the application

has. This will include some but not all of the Java library classes.

Figure 7(a) shows that on average the height of an inheritance graph for an application is 4.5 and

that over 90% of all applications have an inheritance graph with a height of less than 7.

It is important to note that for 303 of our applications, the inheritance graph construction failed due

to the jar-file not being self-contained. This means that some class in the application had a superclass

that was unavailable for analysis, therefore we could not place it correctly in the inheritance graph.

This is typical of applications that rely on external libraries which are not packaged with them.

Figure 7(b) shows the number of classes in each application that extend other classes in the same

application, as opposed to Java library classes. Some classes in each application must directly extend

a Java library class (most often java.lang.Object), but it is interesting to note that only about

one-third (32 899/90 500) extend other classes inside the same application. Table VII shows that most

of the classes in each application extend java.lang.Object directly.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 14: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

594 C. COLLBERG, G. MYLES AND M. STEPP

0

50

100

150

200 MIN: 1MAX: 10AVG: 4.5

MODE: 5MEDIAN: 4STD DEV: 1

SAMPLES: 1132FAILED: 303

10

%,8

6

1

14

%,3

5

2

24

%,8

6

3

45

%,1

71

4

69

%,2

00

5

92

%,1

86

6

97

%,4

7

7

99

%,1

5

89

9%

,2

9

10

0%

,1

10-19

(a)

0

50

100

150

200 MIN: 0MAX: 641AVG: 29.1

MODE: 0MEDIAN: 5STD DEV: 67

SAMPLES: 1132TOTAL: 32899

33

%,3

78

0

37

%,4

6

1

43

%,6

4

2

45

%,2

3

3

48

%,4

0

4

52

%,3

9

5

55

%,3

4

6

56

%,1

5

7

58

%,1

9

8

59

%,2

1

9

72

%,1

40

10-19

78

%,6

7

20-29

84

%,7

2

30-39

86

%,1

6

40-49

88

%,2

3

50-59

89

%,1

4

60-69

89

%,6

70-79

90

%,1

0

80-89

91

%,1

4

90-99

95

%,4

4

100-199

97

%,2

2

200-299

99

%,2

1

300-399

99

%,1

400-499

99

%,2

500-599

10

0%

,1

600-699

(b)

Figure 7. Inheritance graphs: (a) height per application; (b) number of user-class extenders per application.

5. CLASS-LEVEL STATISTICS

In this section we present data regarding the top-level structure of class files. This includes the number,

type/signature, and protection of fields and methods, and the class’ or interface’s position in the

application’s inheritance graph.

5.1. Fields

A Java class can contain data members, called fields. Fields are either class variables (they are declared

static and only one instance exists at runtime) or instance variables (every instantiation of the class

contains a unique copy).

Figures 8–10 show field statistics. In Figure 8(a) we see that 60% of all classes have two or fewer

fields, but in one extreme case a class declared almost a thousand fields. Instance variables are more

common than class variables. On average, a class will contain 2.8 instance variables and 1.6 class

variables, and 44% of all classes have more than one instance variable but only 17% have more than

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 15: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 595

Table VII. Most common standard classes to be extended by application classes.

Class Count %

java.lang.Object 42 629 47.1user class 34 805 38.5java.lang.Exception 1089 1.2javax.swing.AbstractAction 893 1.0java.lang.Thread 738 0.8javax.swing.JPanel 691 0.8java.lang.RuntimeException 464 0.5java.awt.event.WindowAdapter 341 0.4java.awt.Panel 313 0.3java.awt.event.MouseAdapter 309 0.3java.util.ListResourceBundle 276 0.3java.util.EventObject 248 0.3java.io.FilterInputStream 232 0.3org.omg.CORBA.portable.ObjectImpl 226 0.2org.omg.CORBA.SystemException 217 0.2org.xml.sax.helpers.DefaultHandler 203 0.2java.awt.Dialog 203 0.2java.io.FilterOutputStream 202 0.2java.applet.Applet 202 0.2java.awt.Canvas 197 0.2java.io.OutputStream 196 0.2java.awt.Frame 194 0.2java.io.IOException 192 0.2java.io.InputStream 183 0.2javax.swing.JFrame 149 0.2javax.swing.JDialog 135 0.1org.omg.CORBA.UserException 126 0.1java.lang.Error 120 0.1java.beans.SimpleBeanInfo 119 0.1java.awt.event.KeyAdapter 118 0.1javax.swing.table.AbstractTableModel 104 0.1java.awt.event.FocusAdapter 101 0.1java.util.AbstractSet 94 0.1java.security.Signature 80 0.1javax.swing.beaninfo.SwingBeanInfo 79 0.1java.security.GeneralSecurityException 78 0.1org.xml.sax.SAXException 70 0.1javax.swing.JComponent 70 0.1javax.swing.event.InternalFrameAdapter 60 0.1java.util.Hashtable 57 0.1java.lang.IllegalArgumentException 56 0.1java.io.Writer 54 0.1java.util.AbstractList 51 0.1java.util.Properties 50 0.1java.io.Reader 49 0.1

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 16: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

596 C. COLLBERG, G. MYLES AND M. STEPP

0

2000

4000

6000

8000 MIN: 0MAX: 968AVG: 4.4

MODE: 1MEDIAN: 2STD DEV: 11

SAMPLES: 90500TOTAL: 396816

23%

,21153

0

47%

,22033

1

60%

,11968

2

69%

,7669

3

75%

,5217

4

79%

,4143

5

82%

,2919

6

85%

,2411

7

87%

,1911

8

89%

,1535

9

96%

,6216

10-19

98%

,1683

20-29

98%

,682

30-39

99%

,358

40-49

99%

,127

50-59

99%

,97

60-69

99%

,94

70-7999%

,57

80-8999%

,40

90-99

99%

,139

100-199

99%

,23

200-299

99%

,18

300-399

99%

,5

400-499

99%

,1

500-599

100%

,1

900-999

(a)

0

5000

10000

15000 MIN: 0MAX: 742AVG: 2.8

MODE: 0MEDIAN: 1STD DEV: 6

SAMPLES: 90500TOTAL: 248930

34%

,31039

0

56%

,20177

1

69%

,11632

2

77%

,7128

3

82%

,4861

4

86%

,3533

5

89%

,2457

6

91%

,2042

7

93%

,1371

8

94%

,1238

9

98%

,3606

10-19

99%

,829

20-29

99%

,298

30-39

99%

,120

40-49

99%

,40

50-59

99%

,40

60-69

99%

,27

70-79

99%

,14

80-89

99%

,8

90-99

99%

,38

100-199

99%

,1

200-299

100%

,1

700-799

(b)

0

1000

2000

3000

4000

5000 MIN: 0MAX: 555AVG: 1.6

MODE: 0MEDIAN: 0STD DEV: 9

SAMPLES: 90500TOTAL: 147886

71%

,64397

0

83%

,10822

1

88%

,4442

2

90%

,2456

3

92%

,1517

4

93%

,1092

5

94%

,812

6

95%

,748

7

95%

,544

8

96%

,490

9

98%

,1940

10-19

99%

,596

20-29

99%

,200

30-39

99%

,111

40-49

99%

,76

50-59

99%

,63

60-69

99%

,35

70-79

99%

,20

80-89

99%

,13

90-99

99%

,80

100-199

99%

,22

200-299

99%

,18

300-399

99%

,5

400-499

100%

,1

500-599

(c)

Figure 8. Field declarations in classes: (a) number of fields per class; (b) number of instance variables per class;(c) number of class (static) variables per class.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 17: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 597

0

2000

4000

6000 MIN: 0MAX: 553AVG: 1.5

MODE: 0MEDIAN: 0STD DEV: 7

SAMPLES: 102688TOTAL: 152468

70%

,72849

0

81%

,11355

1

87%

,5579

2

90%

,2817

3

92%

,2275

4

93%

,1499

5

94%

,1055

6

95%

,895

7

96%

,701

8

96%

,487

9

98%

,1956

10-19

99%

,551

20-29

99%

,240

30-39

99%

,171

40-49

99%

,46

50-5999%

,42

60-6999%

,35

70-79

99%

,9

80-89

99%

,31

90-99

99%

,67

100-199

99%

,20

200-299

99%

,7

300-399

100%

,1

500-599

(a)

0

5000

10000

15000 MIN: 0MAX: 968AVG: 2.6

MODE: 0MEDIAN: 1STD DEV: 8

SAMPLES: 102688TOTAL: 270023

35%

,35966

0

60%

,26492

1

73%

,12769

2

81%

,8079

3

85%

,4618

4

88%

,3354

5

91%

,2314

6

92%

,1795

7

94%

,1261

8

95%

,975

9

98%

,3527

10-19

99%

,900

20-29

99%

,238

30-39

99%

,104

40-49

99%

,67

50-59

99%

,40

60-69

99%

,43

70-79

99%

,22

80-89

99%

,8

90-99

99%

,85

100-199

99%

,13

200-299

99%

,10

300-399

99%

,6

400-499

99%

,1

500-599

100%

,1

900-999

(b)

0

1000

2000

3000

4000

5000 MIN: 0MAX: 967AVG: 1.4

MODE: 0MEDIAN: 0STD DEV: 8

SAMPLES: 90500TOTAL: 126861

65%

,59481

0

84%

,17059

1

89%

,4517

2

92%

,2548

3

93%

,1390

4

94%

,914

5

95%

,660

6

96%

,617

7

96%

,480

8

97%

,374

9

98%

,1535

10-19

99%

,420

20-29

99%

,163

30-39

99%

,106

40-49

99%

,48

50-59

99%

,59

60-69

99%

,19

70-79

99%

,14

80-89

99%

,6

90-99

99%

,60

100-199

99%

,15

200-299

99%

,12

300-399

99%

,1

400-499

99%

,1

500-599

100%

,1

900-999

(c)

Figure 9. Field declarations in classes: (a) number of primitive fields per class or interface; (b) number of referencefields per class or interface; (c) number of final fields per class.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 18: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

598 C. COLLBERG, G. MYLES AND M. STEPP

0

100

200

300 MIN: 0MAX: 52AVG: 0.0

MODE: 0MEDIAN: 0STD DEV: 0

SAMPLES: 90500TOTAL: 3213

98

%,

88

94

6

0

99

%,

10

18

1

99

%,

20

6

2

99

%,

12

2

3

99

%,

99

4

99

%,

21

5

99

%,

19

6

99

%,

14

7

99

%,

7

8

99

%,

7

9

99

%,

37

10-19

99

%,

3

20-29

10

0%

,1

50-59

(a)

0

20

40

60

80

100 MIN: 0MAX: 6AVG: 0.0

MODE: 0MEDIAN: 0STD DEV: 0

SAMPLES: 90500TOTAL: 256

99

%,

90

35

4

0

99

%,

80

1

99

%,

40

2

99

%,

13

3

99

%,

9

4

99

%,

3

5

10

0%

,1

6

(b)

Figure 10. Field declarations in classes: (a) number of transient fields per class;(b) number of volatile fields per class.

one static variable. It is also more common for a class to have fields of reference types rather than

primitive types. On average, a class will have 1.5 fields of primitive type, but 2.6 fields of reference

type.

Fields may also be declared final, transient, or volatile. A final field is one whose value cannot be

altered after it is first assigned in the instance or class initializer. A transient field is one that is not part

of the persistent state of its parent object. A volatile field is one that cannot be internally cached by

the JVM, since it is assumed to be accessed by multiple threads. Figures 9(c), 10(a), and 10(b) report

on the number of final, transient, and volatile fields per class, respectively. We see that 98% of all

classes have no transient fields, 99% have no volatile fields, and more than half of all classes have no

final fields. The rareness of these modifiers makes the outliers in these graphs particularly interesting.

While in general there are very few transient fields, Figure 10(a) shows us that one class had 52 of

them. Also, when we compare Figures 9(b) and 9(c), we see that they have very similar MAX values.

On closer inspection, this is due to a single class in the file ‘kawa-1.7.jar’ which has 968 fields,

all of which are reference types, and only one of which is not final.

Table VIII gives a breakdown of the declared types of fields. Only primitive types and types exported

from the Java standard library are shown. Our data also contained some user defined types with high

usage counts. This is due to idiosyncrasies of our collected programs, such as a program declaring

vast numbers of fields of one of its classes. Table VIII shows that the vast majority of types are ints,

Strings, and booleans. We note that, somewhat surprisingly, java.lang.Class (Java’s notion

of a class) is a frequent field type, and doubles are more frequent than floats.

5.2. Constant pool

Figure 11 shows the number of entries in the constant pool (the class file’s symbol table) per class.

While small literal integers are stored directly in the bytecode, large integers as well as Strings and

real numbers are, instead, stored in the constant pool. Figures 12–14 show the relative distribution of

literal types.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 19: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 599

Table VIII. Most common field types.

Field type Count %

int 153 861 21.8java.lang.String 105 787 15.0boolean 44 914 6.4java.lang.Class 24 355 3.4long 16 556 2.3java.lang.Object 14 472 2.0byte[] 10 229 1.4int[] 8157 1.1java.util.Vector 7601 1.0java.util.Hashtable 7095 1.0short 7048 1.0byte 6464 0.9java.lang.String[] 6412 0.9java.util.Map 5692 0.8double 5256 0.7java.util.List[] 4971 0.7float 3115 0.4java.io.File 3019 0.4char[] 2995 0.4java.math.BigInteger 2782 0.3java.lang.StringBuffer 2472 0.3java.sql.Connection 2443 0.3javax.swing.JLabel 2066 0.3java.util.HashMap 2064 0.3java.awt.Color 2058 0.3char 1987 0.3java.util.ArrayList 1748 0.2

The difference between Figures 14(a) and 14(b) is that ‘string constants’ are user-defined strings,

such as all literal strings that appear in the source code. The ‘UTF8 strings’ include all user strings,

but also include strings used internally by the classfile format, as well as the names of all referenced

classes, interfaces, methods, fields, etc. It is interesting to note that UTF8 strings comprise over half of

the total constants counted, whereas the sum of all numeric constants is less than 2% of the total.

5.3. Methods

Figures 15–17 give statistics of methods. Of interest is that 73% of all classes have nine or fewer

methods (Figure 15(a)), and that the vast majority of classes have no abstract or native methods

(Figures 15(b) and 16(a)). Almost all classes have at least one virtual method, with an average of

7.7 methods per class (Figure 17(a)). Static methods are quite rare: 80% of all classes have at most one

static method, with an average of 1.3 methods per class (Figure 16(b)).

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 20: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

600 C. COLLBERG, G. MYLES AND M. STEPP

0

5000

10000

15000 MIN: 4MAX: 26708AVG: 122.1

MODE: 24MEDIAN: 65STD DEV: 215

SAMPLES: 102688TOTAL: 12538316

0%

,1

1

4

0%

,2

86

6

0%

,6

1

7

1%

,1

14

7

8

1%

,3

75

9

11

%,

10

22

9

10-19

22

%,

11

05

7

20-29

31

%,

91

28

30-39

38

%,

74

91

40-49

44

%,

60

20

50-59

49

%,

51

47

60-69

54

%,

45

70

70-79

58

%,

40

90

80-89

61

%,

39

85

90-99

83

%,

22

41

3

100-199

91

%,

80

54

200-299

95

%,

37

04

300-399

97

%,

18

79

400-499

98

%,

10

35

500-599

98

%,

67

6

600-699

99

%,

35

7

700-799

99

%,

22

5

800-8999

9%

,1

64

900-9999

9%

,4

96

1000-1999

99

%,

74

2000-2999

99

%,

6

3000-3999

99

%,

3

4000-4999

99

%,

2

5000-5999

10

0%

,3

20000-29999

(a)

Figure 11. Number of constant pool entries per class or interface.

0

1000

2000

3000

4000 MIN: 0MAX: 2818AVG: 1.9

MODE: 0MEDIAN: 0STD DEV: 42

SAMPLES: 102688TOTAL: 190363

89%

,91653

0

92%

,3814

1

94%

,1684

2

95%

,1202

3

96%

,866

4

97%

,562

5

97%

,406

6

97%

,354

7

98%

,234

8

98%

,180

9

99%

,902

10-19

99%

,273

20-29

99%

,91

30-39

99%

,88

40-49

99%

,55

50-59

99%

,22

60-69

99%

,29

70-79

99%

,41

80-89

99%

,24

90-99

99%

,89

100-199

99%

,36

200-299

99%

,1

300-399

99%

,6

400-499

99%

,20

500-599

99%

,22

1000-1999

100%

,34

2000-2999

(a)

0

200

400

600

800

1000 MIN: 0MAX: 1033AVG: 0.3

MODE: 0MEDIAN: 0STD DEV: 13

SAMPLES: 102688TOTAL: 33383

95%

,98232

0

98%

,3031

1

99%

,663

2

99%

,221

3

99%

,92

4

99%

,102

5

99%

,68

6

99%

,28

7

99%

,68

8

99%

,19

9

99%

,64

10-19

99%

,16

20-29

99%

,6

30-39

99%

,3

40-49

99%

,1

50-59

99%

,1

60-69

99%

,2

70-79

99%

,16

80-89

99%

,32

100-199

99%

,6

200-299

99%

,2

300-399

100%

,15

1000-1999

(b)

Figure 12. Literal constants in classes: (a) number of integer entries per class or interface; (b) numberof long entries per class or interface.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 21: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 601

0

50

100

150

200 MIN: 0MAX: 287AVG: 0.1

MODE: 0MEDIAN: 0STD DEV: 3

SAMPLES: 102688TOTAL: 7278

99

%,

10

18

35

0

99

%,

41

1

1

99

%,

13

9

2

99

%,

84

3

99

%,

52

4

99

%,

42

5

99

%,

19

6

99

%,

18

7

99

%,

8

8

99

%,

5

9

99

%,

42

10-19

99

%,

7

20-29

99

%,

7

30-39

99

%,

170-79

99

%,

2

100-199

10

0%

,16

200-299

(a)

0

100

200

300

400

500 MIN: 0MAX: 1292AVG: 0.1

MODE: 0MEDIAN: 0STD DEV: 5

SAMPLES: 102688TOTAL: 12148

97%

,100185

0

98%

,1365

1

99%

,435

2

99%

,240

3

99%

,117

4

99%

,64

5

99%

,64

6

99%

,36

7

99%

,32

8

99%

,12

9

99%

,85

10-19

99%

,16

20-29

99%

,9

30-39

99%

,2

40-49

99%

,1

50-59

99%

,1

60-69

99%

,1

70-79

99%

,1

80-89

99%

,3

90-99

99%

,7

100-199

99%

,10

200-299

99%

,1

300-399

100%

,1

1000-1999

(b)

Figure 13. Literal constants in classes: (a) number of float entries per class or interface; (b) numberof double entries per class or interface.

5.4. Member protection

Figures 18 and 19 show the frequency of visibility restrictions of class members (fields and methods).

A member can be package private, private, protected, or public. Figure 19(c)

summarizes the information by giving average numbers of members with a particular protection.

5.5. Inheritance

Figure 20 shows information about class inheritance. Figure 20(a) shows the number of immediate

subclasses of a class, i.e. the number of classes that directly extend a particular class. Figure 20(b)

shows the number of classes that directly or indirectly extend a particular class. We found that 97%

of all classes have two or fewer direct subclasses. One of the classes in our collection is extended

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 22: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

602 C. COLLBERG, G. MYLES AND M. STEPP

0

5000

10000

15000 MIN: 0MAX: 13266AVG: 7.1

MODE: 0MEDIAN: 1STD DEV: 73

SAMPLES: 102688TOTAL: 726412

45

%,

47

13

9

0

58

%,

12

54

9

1

64

%,

69

16

2

70

%,

55

03

3

73

%,

38

58

4

76

%,

30

34

5

79

%,

2393

6

81

%,

2245

7

83

%,

1913

8

84

%,

1596

9

92

%,

75

84

10-19

95

%,

34

28

20-29

97

%,

1488

30-39

97%

,920

40-49

98%

,516

50-59

98%

,363

60-69

98%

,168

70-79

99%

,187

80-89

99%

,159

90-99

99%

,437

100-19999%

,100

200-29999%

,59

300-399

99%

,62

400-499

99%

,24

500-599

99%

,22

600-699

99%

,2

700-799

99%

,2

800-899

99%

,2

900-999

99%

,12

1000-1999

99%

,4

2000-2999

100%

,3

10000-19999

(a)

0

5000

10000

15000MIN: 2

MAX: 13335AVG: 64.1

MODE: 14MEDIAN: 38STD DEV: 101

SAMPLES: 102688TOTAL: 6582750

0%

,11

2

0%

,50

3

0%

,247

4

0%

,364

5

1%

,1205

6

2%

,888

7

4%

,1529

8

5%

,1628

9

21%

,16562

10-19

37%

,15828

20-29

49%

,12120

30-39

57%

,8881

40-49

65%

,8089

50-59

71%

,5946

60-69

76%

,4731

70-79

79%

,3835

80-89

82%

,3065

90-99

95%

,12889

100-199

98%

,3048

200-299

99%

,978

300-399

99%

,352

400-499

99%

,166

500-599

99%

,153

600-699

99%

,59

700-799

99%

,22

800-899

99%

,9

900-999

99%

,25

1000-1999

99%

,5

2000-2999

100%

,3

10000-19999

(b)

Figure 14. Literal constants in classes: (a) number of string entries per class or interface; (b) numberof UTF8 string constants per class.

by 187 classes. In addition, 48% of classes are at depth 1 in the inheritance graph, i.e. they extend

java.lang.Object, the root of the inheritance graph (Figure 20(c)). The average depth of a class

is low (only 2.1), although six of our classes are at depth 30–39. In many cases we failed to build

the inheritance hierarchy due to the program containing references to classes not in the jar-file or the

standard Java library.

Figure 21 shows the same information for interfaces.

Table VII was computed by looking at which classes each application class extended. Every interface

is considered to extend java.lang.Object. Similarly, Table IX looks at which interfaces were

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 23: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 603

0

2000

4000

6000

8000

10000 MIN: 0MAX: 570AVG: 9.0

MODE: 2MEDIAN: 5STD DEV: 15

SAMPLES: 90500TOTAL: 814603

0%

,3

83

0

6%

,5

93

9

1

25

%,

16

60

1

2

36

%,

99

48

3

45

%,

82

39

4

53

%,

71

88

5

58

%,

50

92

6

64

%,

51

81

7

69

%,

42

10

8

73

%,

35

58

9

90

%,

15

25

4

10-19

95

%,

46

54

20-29

97

%,

19

77

30-39

98

%,

93

5

40-49

99

%,

463

50-59

99

%,

240

60-699

9%

,171

70-799

9%

,97

80-89

99

%,

100

90-99

99

%,

225

100-199

99

%,

12

200-299

99

%,

4

300-399

99

%,

11

400-499

10

0%

,18

500-599

(a)

0

100

200

300

400

500 MIN: 0MAX: 123AVG: 0.1

MODE: 0MEDIAN: 0STD DEV: 1

SAMPLES: 90500TOTAL: 12519

96%

,87477

0

98%

,1234

1

98%

,472

2

98%

,391

3

99%

,238

4

99%

,125

5

99%

,166

6

99%

,98

7

99%

,21

8

99%

,51

9

99%

,112

10-19

99%

,61

20-29

99%

,21

30-39

99%

,26

40-49

99%

,4

80-89

100%

,3

100-199

(b)

Figure 15. Method declarations in classes: (a) number of methods per class;(b) number of abstract methods per class.

extended by other interfaces. There is a bit of ambiguity here, because in Java source code an interface

uses the extends keyword to extend another interface, although technically the interface is really being

implemented. Table X shows which interfaces were implemented by any application class, including

other interfaces.

Method overriding occurs when a method in a class has the same name and signature as a method

in its superclass. This is a technique used to provide a more specialized implementation of a particular

method. Figure 22 shows that the majority of classes have at most one overridden method.

6. METHOD-LEVEL STATISTICS

In this section we present method-level statistics. This includes information about method signatures,

local variables, CFGs, and exception handlers.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 24: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

604 C. COLLBERG, G. MYLES AND M. STEPP

0

20

40

60

80

100 MIN: 0MAX: 179AVG: 0.0

MODE: 0MEDIAN: 0STD DEV: 0

SAMPLES: 90500TOTAL: 1676

99

%,

90

27

1

0

99

%,

60

1

99

%,

31

2

99

%,

30

3

99

%,

21

4

99

%,

15

5

99

%,

3

6

99

%,

27

7

99

%,

7

8

99

%,

17

10-19

99

%,

8

20-29

99

%,

3

30-39

99

%,

3

40-499

9%

,1

50-59

99

%,

260-69

10

0%

,1

100-199

(a)

0

1000

2000

3000 MIN: 0MAX: 533AVG: 1.3

MODE: 0MEDIAN: 0STD DEV: 8

SAMPLES: 90500TOTAL: 114225

67%

,61138

0

80%

,12150

1

89%

,8093

2

92%

,2145

3

93%

,1278

4

94%

,1039

5

95%

,973

6

96%

,856

7

97%

,427

8

97%

,323

9

99%

,1437

10-19

99%

,351

20-29

99%

,120

30-39

99%

,76

40-49

99%

,29

50-59

99%

,5

60-69

99%

,4

70-79

99%

,4

80-89

99%

,8

90-99

99%

,25

100-199

99%

,11

400-499

100%

,8

500-599

(b)

Figure 16. Method declarations in classes: (a) number of native methods per class;(b) number of static methods per class.

6.1. Method sizes

Figure 23 shows the sizes in bytes and instructions of bytecode methods. The maximum size allowed

by the JVM is 65 535 bytes, but only one of our methods (63 019 bytes long) approached this limit.

6.2. Local variables and formal parameters

Figure 24 shows the maximum number of slots used by a method. All instance methods will use at

least one slot (for the this parameter). No method used more than 157 slots, indicating that the wideinstruction (used to access up to 65 536 slots) will be rarely used.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 25: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 605

0

5000

10000

15000

MIN: 0MAX: 569AVG: 7.7

MODE: 2MEDIAN: 4STD DEV: 12

SAMPLES: 90500TOTAL: 700378

0%

,4

54

0

13

%,

12

19

0

1

33

%,

17

66

8

2

45

%,

10

46

8

3

53

%,

72

04

4

60

%,

68

19

5

65

%,

46

06

6

70

%,

43

93

7

74

%,

36

53

8

77

%,

27

76

9

91

%,

12

83

5

10-19

96

%,

39

48

20-29

97

%,

16

63

30-39

98

%,

758

40-49

99

%,

356

50-59

99

%,

187

60-699

9%

,150

70-799

9%

,92

80-89

99

%,

66

90-99

99

%,

195

100-199

99

%,

9

200-299

10

0%

,10

500-599

(a)

0

1000

2000

3000 MIN: 0MAX: 558AVG: 0.5

MODE: 0MEDIAN: 0STD DEV: 9

SAMPLES: 90500TOTAL: 43710

92%

,83935

0

95%

,2688

1

97%

,1219

2

97%

,678

3

98%

,418

4

98%

,218

5

98%

,321

6

99%

,163

7

99%

,115

8

99%

,65

9

99%

,418

10-19

99%

,122

20-29

99%

,62

30-39

99%

,25

40-49

99%

,5

50-59

99%

,1

60-69

99%

,3

70-79

99%

,9

80-89

99%

,1

90-99

99%

,3

100-199

99%

,2

200-299

99%

,12

400-499

100%

,17

500-599

(b)

Figure 17. Method declarations in classes: (a) number of non-static methods per class;(b) number of final methods per class.

Table XI gives a breakdown of slot types. Note that Java’s short, byte, char, and booleantypes are compiled into integers in the bytecode, and thus will not show up as distinct types. Also, a

slot may contain more than one type within a method, although at any one particular location it must

always have the same type. Table XI shows that ints and Strings make up the majority of slot types.

Only 3.8% of slots contain two types, and only 0.6% contain three types. This indicates that the design

of the JVM could have been simplified by requiring each slot to contain exactly one type throughout

the body of a method, without much adverse effect.

Slots are not explicitly typed in the bytecode. Instead, slot types have to be computed using a static

analysis known as stack simulation. This involves simulating the behavior of each instruction on the

stack and the local variable slots, while following all possible paths of control flow within the method.

A similar algorithm is used in the Java bytecode verifier.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 26: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

606 C. COLLBERG, G. MYLES AND M. STEPP

0

2000

4000

6000

8000

10000 MIN: 0MAX: 743AVG: 2.0

MODE: 0MEDIAN: 0STD DEV: 9

SAMPLES: 90500TOTAL: 177960

51

%,

46

75

1

0

74

%,

20

79

3

1

82

%,

73

35

2

87

%,

47

48

3

90

%,

23

63

4

92

%,

20

35

5

94

%,

11

68

6

94

%,

76

2

7

95

%,

64

1

8

96

%,

508

9

98

%,

22

14

10-19

99

%,

53

3

20-29

99

%,

249

30-39

99

%,

131

40-49

99

%,

52

50-59

99

%,

32

60-69

99

%,

52

70-79

99

%,

21

80-899

9%

,21

90-99

99

%,

58

100-199

99

%,

21

200-299

99

%,

8

400-499

99

%,

3

500-599

10

0%

,1

700-799

(a)

0

2000

4000

6000

8000

10000 MIN: 0MAX: 119AVG: 0.9

MODE: 0MEDIAN: 0STD DEV: 3

SAMPLES: 90500TOTAL: 81574

78%

,70806

0

86%

,7828

1

90%

,3497

2

93%

,2247

3

94%

,1263

4

95%

,920

5

96%

,679

6

97%

,618

7

97%

,417

8

97%

,321

9

99%

,1395

10-19

99%

,311

20-29

99%

,89

30-39

99%

,50

40-49

99%

,11

50-59

99%

,7

60-69

99%

,21

80-89

100%

,20

100-199

(b)

Figure 18. Protection of class members: (a) number of package private members per class;(b) number of private members per class.

Figure 24(b) shows the maximum stack depth required by a method. This is stored as an attribute in

the class file, and could thus be larger then the actual stack size needed at runtime.

The number of slots used by a method in Figure 24(a) includes those slots reserved for method

parameters. Figure 25 breaks out the number of formal parameters per method. This is the number

of parameters, not the number of slots those parameters would consume (i.e. longs and doubles

count as one). As expected, the average is low (1.0), with 90% of all methods having two or

fewer formals. Table XII shows the most common method signatures. The signatures are presented

with the method’s parameter type list first, followed by the method’s return type. So, for example,

a method that takes two integer parameters and returns a String would have a signature of

‘(int,int)java.lang.String’. We have also abstracted away any reference types that do not

appear in the standard Java libraries (i.e. user-defined classes). If a user-defined class is a parameter

or the return type of a method, we replace it with ‘user class’. One reason that ‘()void’ is so

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 27: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 607

0

2000

4000

6000

8000

10000 MIN: 0MAX: 487AVG: 3.0

MODE: 0MEDIAN: 1STD DEV: 10

SAMPLES: 90500TOTAL: 268810

41

%,3

78

76

0

62

%,1

88

55

1

73

%,9

46

0

2

79

%,5

45

1

3

83

%,3

97

7

4

86

%,2765

5

88

%,2080

6

90%

,1701

7

92%

,1143

8

93%

,862

9

97

%,4

10

4

10-19

98%

,1259

20-2999%

,494

30-39

99%

,175

40-49

99%

,69

50-59

99%

,94

60-69

99%

,27

70-79

99%

,15

80-89

99%

,16

90-99

99%

,47

100-199

99%

,1

200-299

100%

,29

400-499

(a)

0

5000

10000

15000 MIN: 0MAX: 556AVG: 7.5

MODE: 1MEDIAN: 4STD DEV: 13

SAMPLES: 90500TOTAL: 683075

4%

,3949

0

20%

,15042

1

34%

,11790

2

44%

,9565

3

53%

,7837

4

59%

,5872

5

66%

,5958

6

71%

,4607

7

75%

,3803

8

78%

,2949

9

92%

,12121

10-19

96%

,3626

20-29

97%

,1569

30-39

98%

,717

40-49

99%

,355

50-59

99%

,199

60-69

99%

,116

70-79

99%

,67

80-89

99%

,91

90-99

99%

,222

100-199

99%

,26

200-299

99%

,16

300-399

99%

,2

400-499

100%

,1

500-599

(b)

Protection %

Package private 14.7

Private 6.7

Protected 22.2

Public 56.4

(c)

Figure 19. Protection of class members: (a) number of protected members per class; (b) number of public methodsper class; (c) average of class members with particular protection.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 28: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

608 C. COLLBERG, G. MYLES AND M. STEPP

0

500

1000

1500

2000 MIN: 0MAX: 187AVG: 0.4

MODE: 0MEDIAN: 0STD DEV: 2

SAMPLES: 90500FAILED: 13724

89

%,

68

45

9

0

94

%,

40

22

1

96

%,

17

92

2

97

%,

78

1

3

98

%,

42

9

4

98

%,

308

5

98

%,

182

6

99

%,

134

7

99

%,

102

8

99

%,

74

9

99

%,

295

10-19

99

%,

122

20-299

9%

,38

30-399

9%

,10

40-49

99%

,15

50-59

99%

,1

60-69

99%

,9

70-79

99%

,1

80-89

100%

,2

100-199

(a)

0

1000

2000

3000

4000

5000 MIN: 0MAX: 346AVG: 0.6

MODE: 0MEDIAN: 0STD DEV: 5

SAMPLES: 90500FAILED: 13724

89%

,68459

0

93%

,3546

1

95%

,1648

2

96%

,812

3

97%

,472

4

97%

,284

5

98%

,228

6

98%

,134

7

98%

,182

8

98%

,136

9

99%

,460

10-19

99%

,193

20-29

99%

,58

30-39

99%

,42

40-49

99%

,10

50-59

99%

,32

60-69

99%

,10

70-79

99%

,9

80-89

99%

,9

90-99

99%

,46

100-199

99%

,3

200-299

100%

,3

300-399

(b)

0

10000

20000

30000

MIN: 1MAX: 10AVG: 2.1

MODE: 1MEDIAN: 1STD DEV: 1

SAMPLES: 90500FAILED: 14383

48%

,37275

1

70

%,

165

99

2

84%

,1

02

78

3

91%

,5

529

4

96

%,

38

75

5

99

%,

19

93

6

99

%,

45

7

7

99

%,

66

8

99

%,

39

9

10

0%

,6

10-19

(c)

Figure 20. (a) Number of immediate subclasses per class; (b) total number of subclasses per class;(c) inheritance depth of a class.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 29: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 609

0

100

200

300

400

500 MIN: 0MAX: 69AVG: 0.4

MODE: 0MEDIAN: 0STD DEV: 2

SAMPLES: 12188FAILED: 1862

87

%,

90

55

0

95

%,

80

8

1

97

%,

20

8

2

98

%,

87

3

98%

,27

4

98%

,18

5

99%

,20

6

99%

,17

7

99%

,2

8

99%

,10

9

99

%,

44

10-19

99%

,4

20-29

99%

,11

30-39

99%

,13

50-59

100%

,2

60-69

(a)

0

50

100

150

200 MIN: 0MAX: 105AVG: 0.7

MODE: 0MEDIAN: 0STD DEV: 5

SAMPLES: 12188FAILED: 1862

87%

,9055

0

94%

,705

1

96%

,194

2

97%

,101

3

97%

,48

4

98%

,28

5

98%

,18

6

98%

,12

7

98%

,7

8

98%

,5

9

99%

,85

10-19

99%

,7

20-29

99%

,13

30-39

99%

,4

40-49

99%

,22

50-59

99%

,6

60-69

99%

,9

80-89

100%

,7

100-199

(b)

0

500

1000

1500

2000 MIN: 0MAX: 16AVG: 0.9

MODE: 0MEDIAN: 0STD DEV: 1

SAMPLES: 12188FAILED: 1862

61%

,6376

0

79%

,1839

1

84%

,4

61

2

95%

,117

0

3

97

%,

22

5

4

98

%,

12

7

5

99

%,

39

6

99

%,

13

7

99

%,

30

8

99

%,

2

9

10

0%

,4

4

10-19

(c)

Figure 21. (a) Number of immediate subinterfaces per interface; (b) total number of subinterfaces per interface;(c) interface extends depth of a class.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 30: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

610 C. COLLBERG, G. MYLES AND M. STEPP

Table IX. Most common standard interfaces to be extended byapplication interfaces.

Interface Count %

user interface 3359 57.7org.w3c.dom.html.HTMLElement 676 11.6java.util.EventListener 362 6.2java.io.Serializable 251 4.3org.w3c.dom.Node 225 3.9java.lang.Cloneable 118 2.0org.omg.CORBA.Object 96 1.6java.security.PrivateKey 43 0.7org.w3c.dom.CharacterData 42 0.7org.w3c.dom.events.EventTarget 39 0.7java.security.PublicKey 39 0.7org.omg.CORBA.portable.IDLEntity 38 0.7org.omg.CORBA.IDLType 36 0.6org.w3c.dom.Element 29 0.5org.w3c.dom.Document 28 0.5java.rmi.Remote 24 0.4org.w3c.dom.css.CSSRule 23 0.4java.security.Key 23 0.4org.w3c.dom.events.Event 22 0.4org.w3c.dom.DOMImplementation 22 0.4org.w3c.dom.Text 21 0.4org.xml.sax.XMLReader 20 0.3org.omg.CORBA.IRObject 18 0.3org.xml.sax.ContentHandler 16 0.3java.lang.Comparable 16 0.3javax.crypto.interfaces.DHKey 14 0.2java.util.Map 12 0.2java.sql.ResultSet 11 0.2java.util.List 10 0.2java.sql.Connection 9 0.2java.lang.Runnable 9 0.2org.w3c.dom.css.CSSValue 8 0.1org.xml.sax.Locator 7 0.1org.xml.sax.DTDHandler 7 0.1java.util.Collection 7 0.1java.sql.ResultSetMetaData 7 0.1org.xml.sax.ext.LexicalHandler 6 0.1org.omg.CORBA.DynAny 6 0.1org.xml.sax.DocumentHandler 5 0.1org.omg.CORBA.Policy 5 0.1javax.xml.transform.SourceLocator 5 0.1java.sql.PreparedStatement 5 0.1org.w3c.dom.events.UIEvent 4 0.1org.w3c.dom.events.DocumentEvent 4 0.1

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 31: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 611

Table X. Most common standard interfaces to be implemented by application classes.

Interface Count %

user interface 21 955 55.9java.io.Serializable 3534 9.0java.awt.event.ActionListener 2880 7.3java.lang.Runnable 1447 3.7java.lang.Cloneable 1009 2.6org.omg.CORBA.portable.Streamable 793 2.0java.awt.event.ItemListener 302 0.8java.lang.Comparable 266 0.7java.util.Iterator 262 0.7java.util.Enumeration 216 0.6java.util.Comparator 215 0.5javax.swing.event.ChangeListener 211 0.5java.awt.event.MouseListener 187 0.5org.xml.sax.EntityResolver 173 0.4java.security.PrivilegedAction 145 0.4org.xml.sax.ErrorHandler 130 0.3java.security.spec.AlgorithmParameterSpec 130 0.3java.beans.PropertyChangeListener 114 0.3java.awt.event.MouseMotionListener 113 0.3org.xml.sax.ext.LexicalHandler 109 0.3java.awt.event.KeyListener 109 0.3org.xml.sax.ContentHandler 100 0.3javax.swing.event.ListSelectionListener 99 0.3java.io.Externalizable 99 0.3java.security.spec.KeySpec 87 0.2org.xml.sax.DocumentHandler 83 0.2org.xml.sax.DTDHandler 82 0.2java.awt.event.AdjustmentListener 81 0.2javax.sql.DataSource 80 0.2java.awt.event.WindowListener 80 0.2java.awt.image.ImageObserver 76 0.2java.awt.image.renderable.RenderedImageFactory 74 0.2javax.naming.spi.ObjectFactory 72 0.2java.sql.Connection 71 0.2java.awt.event.FocusListener 71 0.2org.w3c.dom.NodeList 70 0.2org.xml.sax.AttributeList 59 0.2javax.naming.Referenceable 58 0.1java.io.FilenameFilter 55 0.1org.xml.sax.Locator 52 0.1java.util.Map$Entry 52 0.1java.lang.reflect.InvocationHandler 52 0.1javax.swing.event.DocumentListener 51 0.1java.awt.event.ComponentListener 50 0.1org.xml.sax.Attributes 48 0.1

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 32: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

612 C. COLLBERG, G. MYLES AND M. STEPP

0

1000

2000

3000

4000 MIN: 0MAX: 107AVG: 0.7

MODE: 0MEDIAN: 0STD DEV: 2

SAMPLES: 90500TOTAL: 59638

FAILED: 2767

76

%,

67

07

1

0

89

%,

11

15

2

1

92

%,

32

95

2

95

%,

25

22

3

96

%,

91

4

4

97

%,

55

5

5

97

%,

40

3

6

98

%,

37

0

7

98

%,

229

8

98

%,

181

9

99

%,

73

7

10-19

99

%,

164

20-29

99

%,

84

30-399

9%

,30

40-499

9%

,5

50-59

99

%,

560-69

99

%,

9

70-79

99

%,

3

80-89

99

%,

2

90-99

10

0%

,2

100-199

Figure 22. Number of method overrides per class.

common is that this is the signature of default constructors, especially that for java.lang.Object,

which must be called in the constructors of all classes that directly extend it.

6.3. CFGs

A method body can be converted into a CFG, where the nodes (the basic blocks) are straight-line

pieces of code. Control always enters the top of the basic block and exits at the bottom, either through

an explicit branch or by falling through to another block. There is an edge from basic block A to basic

block B if control can flow from A to B.

Building CFGs for Java bytecode is not straightforward. A major complication is how to deal with

exception handling. Several instructions in the JVM can throw exceptions implicitly. This includes the

division instructions (which may throw a divide-by-zero exception), and the getfield, putfield,

and invokevirtual instructions (which may throw null-reference exceptions). These changes in

control flow can be represented by adding exception edges to the CFG, which connect a basic block

ending in an exception-throwing instruction to the CFG’s sink node. If every such instruction (which are

very common in real code) is allowed to terminate a basic block, blocks become very small. Since some

analyses can safely ignore implicit exceptions, SandMark supports building the CFGs both with and

without implicit exception edges. The jsr and ret instructions used for Java’s finally clause

also cause problems. In general, a data flow analysis is necessary in order to correctly build CFGs

in the presence of complex jsr/ret combinations. SandMark currently does not support this and,

as a consequence, will sometimes introduce spurious edges out of blocks ending in ret instructions.

Since there are few such CFGs in our sample set this problem is unlikely to significantly affect our

data.

Figure 26 shows the number of basic blocks per method body (we make the distinction method

body to rule out methods with no instructions, such as native or abstract methods). We can see that

97% of all method bodies have fewer than 100 basic blocks. We can corroborate this information with

Figures 27(a) and 23(b), to see that the average basic block size is 2.0, and that 97% of all method

bodies have fewer than 200 instructions.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 33: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 613

0

10000

20000

30000

40000

50000

MIN: 1MAX: 63019AVG: 66.2

MODE: 5MEDIAN: 17STD DEV: 281

SAMPLES: 801117TOTAL: 53057748

1%

,1

3971

1

3%

,1

4661

2

4%

,7584

3

5%

,7300

4

18

%,

10

07

11

5

23

%,

45

47

9

6

25

%,

17

620

7

29

%,

32

37

1

8

33

%,

27

79

3

9

52

%,

15

10

21

10-19

61

%,

76

61

5

20-29

68

%,

51

78

5

30-39

73

%,

38

45

0

40-49

76

%,

28

33

9

50-59

79

%,

24

20

6

60-69

81

%,

18

70

0

70-79

83

%,

15

552

80-89

85

%,

12881

90-99

93

%,

63

39

2

100-199

96

%,

22

14

6

200-299

97

%,

10596

300-39998%

,5811

400-499

98%

,3782

500-599

99%

,2551

600-699

99%

,1527

700-799

99%

,1108

800-899

99%

,951

900-999

99%

,2699

1000-1999

99%

,757

2000-2999

99%

,361

3000-3999

99%

,113

4000-4999

99%

,44

5000-5999

99%

,46

6000-6999

99%

,87

7000-7999

99%

,32

8000-8999

99%

,1

9000-9999

99%

,57

10000-19999

99%

,14

20000-29999

99%

,2

40000-49999

100%

,1

60000-69999

(a)

0

20000

40000

60000

80000 MIN: 1MAX: 42725AVG: 33.2

MODE: 3MEDIAN: 9STD DEV: 156

SAMPLES: 801117TOTAL: 26597868

1%

,13971

1

5%

,29170

2

18%

,103439

3

27%

,76971

4

34%

,52415

5

40%

,45790

6

43%

,28719

7

46%

,21379

8

49%

,23837

9

66%

,140920

10-19

76%

,72423

20-29

81%

,46637

30-39

85%

,29709

40-49

88%

,21440

50-59

90%

,15808

60-69

91%

,11776

70-79

92%

,10059

80-89

93%

,7075

90-99

97%

,31320

100-199

98%

,9045

200-299

99%

,3720

300-399

99%

,1551

400-499

99%

,778

500-599

99%

,654

600-699

99%

,369

700-799

99%

,455

800-899

99%

,216

900-999

99%

,878

1000-1999

99%

,320

2000-2999

99%

,77

3000-3999

99%

,96

4000-4999

99%

,27

5000-5999

99%

,3

6000-6999

99%

,10

7000-7999

99%

,24

8000-8999

99%

,1

9000-9999

99%

,32

10000-19999

99%

,2

20000-29999

100%

,1

40000-49999

(b)

Figure 23. Method sizes: (a) size in bytes; (b) number of instructions per method body.

As can be seen from Figure 27(a), the average number of instructions in a basic block is very small,

only 2.0, and 98% of all blocks have fewer than six instructions. Figure 28(a) shows the average

out-degree of a basic block node is, predictably, low, only 1.2. Out-degrees higher than two can

only be achieved either when an instruction is inside an exception handler’s try block, or with the

JVM’s tableswitch and lookupswitch instructions. In the first case, edges are added from

each basic block inside a try block to the first basic block of the handler code. Therefore, if a

basic block is inside multiple nested try blocks its out-degree may be high. In the second case,

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 34: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

614 C. COLLBERG, G. MYLES AND M. STEPP

0

50000

100000

150000

200000

MIN: 0MAX: 157AVG: 2.9

MODE: 1MEDIAN: 2STD DEV: 3

SAMPLES: 801117TOTAL: 2297561

3%

,3

07

84

0

34

%,

24

39

62

1

61

%,

21

96

70

2

75

%,

10

99

30

3

83

%,

65

22

8

4

88

%,

42

11

3

5

92

%,

26

44

2

6

94

%,

17

401

7

95

%,

12

107

8

96

%,

8155

9

99

%,

22

483

10-19

99

%,

2066

20-29

99

%,

487

30-399

9%

,145

40-49

99

%,

69

50-59

99

%,

23

60-69

99

%,

10

70-79

99

%,

13

80-89

99

%,

18

90-99

10

0%

,11

100-199

(a)

0

50000

100000

150000

200000

MIN: 0MAX: 262AVG: 2.9

MODE: 2MEDIAN: 2STD DEV: 2

SAMPLES: 8011171%

,14138

0

23%

,175747

1

52%

,228656

2

71%

,155015

3

84%

,103273

4

92%

,60850

5

95%

,28811

6

97%

,17043

7

98%

,6487

8

99%

,3728

9

99%

,6629

10-19

99%

,506

20-29

99%

,94

30-39

99%

,52

40-49

99%

,27

50-59

99%

,21

60-69

99%

,19

70-79

99%

,6

80-89

99%

,3

90-99

99%

,10

100-199

100%

,2

200-299

(b)

Figure 24. Local variables: (a) number of max locals per method body; (b) numberof max stack weights per method body.

the tableswitch and lookupswitch instructions are Java’s implementation of switch-

statements, which may have many possible branch targets. Higher in-degrees can occur when a trycatch block has many instructions inside it that could potentially trigger the exception. Each of these

instructions will end its block, and have an edge going from it to the handler block. Thus, the in-degree

of the handler block will become large.

Figure 27(b) shows the number of instructions per basic block when implicit exception edges have

not been generated. As can be seen, this increases the average number of instructions per block to 7.7.

A node x in a directed graph G with a single exit node dominates node y in G if every path from

the entry node to y must pass through x. The dominator set of a node y is the set of all nodes which

dominate y. Dominator information is used in code optimizations such as loop identification and code

motion. Figure 29 shows the number of dominator blocks per basic block.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 35: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 615

Table XI. Most common slot types.

Register type Count %

int 614 910 16.2java.lang.String 365 915 9.62 types 144 145 3.8java.lang.Object 76 764 2.0byte[] 50 658 1.3long 49 903 1.3java.lang.Throwable 38 046 1.0double 25 541 0.63 types 23 426 0.6java.lang.StringBuffer 21 716 0.6java.lang.String[] 20 600 0.5java.util.Iterator 16 036 0.4float 15 595 0.4java.lang.Class 15 129 0.4java.util.Vector 14 795 0.4int[] 14 604 0.4java.lang.Exception 14 149 0.4java.io.File 13 334 0.4java.io.InputStream 11 686 0.3java.util.List 11 615 0.3java.lang.ClassNotFoundException 11 331 0.3java.util.Enumeration 10 732 0.3char[] 9534 0.3java.lang.Object[] 9417 0.2

6.4. Subroutines and exception handlers

Java subroutines are implemented by the instructionsjsr and ret. They are chiefly used to implement

the finally clause of an exception handler. This clause can be reached from multiple locations.

For example, a return instruction within the body of a try block will first jump to the finallyclause before returning from the method. Similarly, before returning from within an exception handler,

the finally block must be executed. To avoid code duplication (inlining thefinally block at every

location from which it could be called) the designers of the JVM added the jsr and ret instructions

to jump to and return from a block of code. This has caused much complication in the design of the

JVM verifier. See, for example, Stata et al. [7]. Figure 30 shows that 98% of methods have no more

than two exception handlers, and 98% of all methods have no subroutines. Figure 30(c) shows that

the average size of a subroutine is 7.5 instructions. The length of a subroutine was computed as the

number of instructions between a jsr’s target and its corresponding ret. Together, our data indicate

that jsr and ret could have been left out of the JVMs instruction set without much code increase

from finally clauses being implemented by code duplication.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 36: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

616 C. COLLBERG, G. MYLES AND M. STEPP

0

10000

20000

30000

40000

50000 MIN: 0MAX: 27AVG: 1.0

MODE: 0MEDIAN: 1STD DEV: 1

SAMPLES: 874115TOTAL: 861689

40

%,

35

32

35

0

77

%,

32

56

17

1

90

%,

11

55

11

2

96

%,

46

04

0

3

98

%,

17

95

2

4

99

%,

81

63

5

99

%,

38

84

6

99

%,

1676

7

99

%,

918

89

9%

,401

9

99

%,

714

10-19

10

0%

,4

20-29

Figure 25. Number of formal parameters per method.

6.5. Interference graphs

An interference graph models the variables and live range interferences of a method. The live ranges

of a local variable are the locations in a method between where the variable is first assigned and where

it is last used. Since method parameters and the ‘this’ reference are in local variable slots, they

are considered to have their first assignment before the first instruction of the method. The graph has

one vertex per local variable and an edge between two vertices when the corresponding variables’

live ranges overlap (or interfere). As an example consider the sample code in Figure 31(a) and the

corresponding interference graph in Figure 31(b). Since the code has 5 variables, the graph has 5 nodes.

The graph has an edge v1 → v2 since variables v1 and v2 are live at the same time. An interference

graph is often used during the code generation pass of a compiler to perform register allocation.

Two variables with intersecting live ranges cannot be assigned to the same register. Figure 32 shows that

95% of the methods have nine or fewer nodes in their interference graphs. This means that methods

typically will need very few local variable slots. This analysis agrees with the data in Figure 24(a),

which show that methods declare their maximum number of slots to be small.

After examining these data, it appears that the designers of the Java instruction set were wise to make

the typical instruction use only 1 byte to refer to a local variable index. The wide prefix allows such an

instruction to use a 2-byte index, but as we can see this will almost never be necessary. Thus, had the

designers simply made all instructions use 2-byte indices, there would have been much wasted space

in the bytecode.

7. INSTRUCTION-LEVEL STATISTICS

In this section we present information regarding the frequency of individual instructions and patterns of

instructions. We also show the most common subexpressions and constant values found in the bytecode.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 37: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 617

Table XII. Most common method signatures.

Method signature Count %

()void 120 997 13.8(user class)void 57 762 6.6()java.lang.String 53 047 6.1()user class 44 098 5.0(java.lang.String)void 43 810 5.0()boolean 39 772 4.5()int 35 064 4.0(int)void 18 959 2.2(boolean)void 11 461 1.3(user class)user class 10 332 1.2(user class,user class)void 9652 1.1(java.lang.String)user class 7781 0.9(java.lang.String)java.lang.String 7777 0.9(user class)boolean 6880 0.8(user class)java.lang.Object 6812 0.8()java.lang.Object 6461 0.7(java.lang.String)java.lang.Class 6258 0.7(java.lang.String,java.lang.String)void 5561 0.6(java.lang.Object)boolean 5373 0.6(int)int 4776 0.5(java.lang.Object)void 4697 0.5(java.awt.event.ActionEvent)void 4479 0.5(int)user class 4270 0.5(int)boolean 4116 0.5(java.lang.String[])void 4044 0.5(java.lang.String)boolean 3933 0.4(int,int)void 3726 0.4(int)java.lang.String 3473 0.4()byte[] 3380 0.4()user class[] 3322 0.4(user class,int)void 3251 0.4()java.util.List 2970 0.3(user class,java.lang.String)void 2821 0.3(byte[])void 2697 0.3()java.lang.String[] 2292 0.3(user class,user class)user class 2289 0.3(int,int)int 2023 0.2(java.awt.event.MouseEvent)void 2008 0.2(user class)int 1998 0.2(java.lang.String)int 1993 0.2(user class)java.lang.String 1951 0.2()org.omg.CORBA.TypeCode 1941 0.2(java.lang.String,user class)void 1900 0.2(java.lang.Object)user class 1873 0.2

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 38: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

618 C. COLLBERG, G. MYLES AND M. STEPP

0

20000

40000

60000 MIN: 1MAX: 10699AVG: 16.7

MODE: 2MEDIAN: 4STD DEV: 63

SAMPLES: 801117TOTAL: 13416883

3%

,3

16

26

1

26

%,

18

11

03

2

40

%,

10

96

04

3

46

%,

52

71

7

4

51

%,

37

22

6

5

55

%,

33

72

5

6

58

%,

25

96

3

7

61

%,

22

51

2

8

64

%,

21

22

5

9

80

%,

13

17

86

10-19

87

%,

54

15

7

20-29

91

%,

29

66

0

30-39

93

%,

17

950

40-49

95

%,

11939

50-59

96%

,7904

60-69

96%

,6152

70-79

97%

,4257

80-89

97%

,3338

90-999

9%

,12788

100-19999%

,2499

200-299

99%

,1067

300-399

99%

,548

400-499

99%

,381

500-599

99%

,195

600-699

99%

,192

700-799

99%

,75

800-899

99%

,93

900-999

99%

,322

1000-1999

99%

,56

2000-2999

99%

,9

3000-3999

99%

,41

4000-4999

99%

,1

5000-5999

99%

,3

6000-6999

99%

,2

7000-7999

100%

,1

10000-19999

Figure 26. Number of basic blocks (in CFGs with implicit exception edges) per method body.

7.1. Instruction counts

There are 200 usable JVM instruction opcodes. Table XIII shows the frequency of each of those

bytecode instructions. The most frequently occurring instruction is aload 0 which is responsible for

pushing the local variable 0, the this reference of non-static methods. Even though this is the most

frequently occurring instruction it only has a frequency of 10%. The invokevirtual instruction

which calls a non-static method is also common, as is getfield, dup, and invokespecial, the

last two being used to implement Java’s new operator. These five instructions account for 33.8% of

all instructions. Our data indicate that the majority of the remaining instructions each occur with a

frequency of at most 1%, and that the jsr w and goto w instructions (used for long branches) do not

occur at all.

7.2. Instruction patterns

A k-gram is a contiguous substring of length k which can be comprised of letters, words, or, in our case,

opcodes. The k-gram is based on static analysis of the executable program. To compute the unique set

of k-grams for a method, we slide a window of length k over the static instruction sequence as it is laid

out in the class file.

We computed data for k-grams where k = 2, 3, 4, which is shown in Tables XIV, XV, and XVI,

respectively. These tables show that as the value of k increases the percentages of the most frequently

occurring sequences decrease. For example, the most frequently occurring 2-gram, aload 0,getfield, has a frequency of only 4.7%. For 3- and 4-grams, the most frequently occurring

sequence is less than 1%. This indicates that these sequences become quite unique for each individual

application.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 39: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 619

0

50000

100000

150000

200000 MIN: 1MAX: 209AVG: 2.0

MODE: 1MEDIAN: 1STD DEV: 1

SAMPLES: 13416883TOTAL: 26597868

45

%,

60

99

22

2

1

73

%,

37

37

11

8

2

89

%,

22

33

95

4

3

97

%,

96

01

96

4

98

%,

19

13

08

5

99

%,

87

22

5

6

99

%,

40

29

9

7

99

%,

25

945

8

99

%,

11

739

9

99

%,

27

17

1

10-19

99

%,

2112

20-29

99

%,

324

30-39

99

%,

103

40-499

9%

,67

50-59

99

%,

21

60-69

99%

,6

70-79

99

%,

28

80-89

99

%,

15

90-99

99

%,

26

100-199

10

0%

,4

200-299

(a)

0

200000

400000

600000MIN: 1

MAX: 42725AVG: 7.7

MODE: 3MEDIAN: 3STD DEV: 67

SAMPLES: 3463565TOTAL: 26597868

10%

,364196

1

25%

,528134

2

44%

,632798

3

56%

,421403

4

66%

,343282

5

72%

,230377

6

77%

,170226

7

80%

,113225

8

83%

,96463

9

95%

,399132

10-19

97%

,87611

20-29

98%

,31321

30-39

99%

,14105

40-49

99%

,7659

50-59

99%

,5509

60-69

99%

,3391

70-79

99%

,2402

80-89

99%

,1735

90-99

99%

,5568

100-199

99%

,1783

200-299

99%

,981

300-399

99%

,363

400-499

99%

,251

500-599

99%

,215

600-699

99%

,125

700-799

99%

,180

800-899

99%

,72

900-999

99%

,518

1000-1999

99%

,286

2000-2999

99%

,60

3000-3999

99%

,94

4000-4999

99%

,31

5000-5999

99%

,2

6000-6999

99%

,10

7000-7999

99%

,23

8000-8999

99%

,1

9000-9999

99%

,31

10000-19999

99%

,1

20000-29999

100%

,1

40000-49999

(b)

Figure 27. Number of instructions per basic block in CFGs (a) with and (b) without implicit exception edges.

7.3. Expressions

Figure 33 shows the size (number of nodes in the tree) and height (length of longest path from root

to leaf) of expression trees in our samples. As reported already by Knuth [1], expressions tend to be

small. We found that 61% of all expressions only have one node.

Expressions are constructed by performing a stack simulation over each method. For each instruction

that will produce a result on the stack the simulator determines which instructions may have put its

operands on the stack. This information is used to build up a dependency graph with instructions and

operands as nodes, and an edge from node a to node b if b is used by a. If the program contains certain

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 40: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

620 C. COLLBERG, G. MYLES AND M. STEPP

0

1000

2000

3000

4000

5000 MIN: 0MAX: 536AVG: 1.2

MODE: 1MEDIAN: 1STD DEV: 1

SAMPLES: 13416883

0%

,15

0

83

%,

11

17

19

42

1

98

%,

19

89

69

5

2

99

%,

17

04

55

3

99

%,

43

05

0

4

99

%,

20

56

2

5

99

%,

95

23

6

99

%,

26

00

7

99

%,

21

83

8

99

%,

91

5

9

99

%,

30

26

10-19

99

%,

88

3

20-29

99

%,

434

30-39

99

%,

141

40-49

99

%,

358

50-599

9%

,196

60-699

9%

,122

70-79

99

%,

322

80-89

99

%,

241

90-99

99

%,

177

100-199

99%

,33

200-299

99%

,6

400-499

100%

,4

500-599

(a)

0

10000

20000

30000

40000

50000 MIN: 0MAX: 1877AVG: 1.2

MODE: 1MEDIAN: 1STD DEV: 2

SAMPLES: 13416883

0%

,997

0

94%

,12643958

1

98%

,564832

2

99%

,86581

3

99%

,35498

4

99%

,16301

5

99%

,9563

6

99%

,6310

7

99%

,5623

8

99%

,3918

9

99%

,20892

10-19

99%

,9102

20-29

99%

,4241

30-39

99%

,2705

40-49

99%

,1484

50-59

99%

,1238

60-69

99%

,755

70-79

99%

,545

80-89

99%

,408

90-99

99%

,1499

100-199

99%

,289

200-299

99%

,64

300-399

99%

,56

400-499

99%

,8

500-599

99%

,9

600-699

99%

,1

700-799

99%

,2

800-899

99%

,2

900-999

100%

,2

1000-1999

(b)

Figure 28. CFGs: (a) out-degree of basic block nodes; (b) in-degree of basic block nodes.

types of loops these graphs might have cycles, in which case they are discarded. The following code

segment is an example of such a loop:

0: ICONST_21: ICONST_32: IADD3: DUP4: IFEQ <1>

In this example, the IADD instruction may become its own child. If the branch at offset 4 is taken,

then the result from the first iteration of the IADD will be used as the first operand in the second

iteration of the IADD.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 41: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 621

0

200000

400000

600000

MIN: 0MAX: 10698AVG: 105.3

MODE: 1MEDIAN: 11STD DEV: 437

SAMPLES: 13416883

5%

,8

02

36

3

0

12

%,

85

28

36

1

18

%,

78

48

08

2

23

%,

66

58

00

3

27

%,

62

60

05

4

31

%,

55

18

35

5

35

%,

51

94

29

6

39

%,

48

35

76

7

42

%,

44

59

74

8

45

%,

40

94

36

9

65

%,

26

17

86

6

10-19

74

%,

12

15

10

0

20-29

79

%,

64

80

96

30-39

82

%,

39

20

92

40-49

84

%,

26

36

57

50-59

85

%,

18

7572

60-69

86

%,

142886

70-79

87

%,

112374

80-8988%

,92959

90-999

1%

,4

79

24

9100-199

93

%,

22

0598

200-299

94

%,

146686

300-399

95

%,

109706

400-499

95%

,83629

500-599

96%

,70436

600-699

96%

,57647

700-799

97%

,50364

800-899

97%

,44180

900-999

98

%,

18

8378

1000-1999

99%

,64405

2000-2999

99%

,50667

3000-3999

99%

,22138

4000-4999

99%

,5854

5000-5999

99%

,3900

6000-6999

99%

,1683

7000-7999

99%

,1000

8000-8999

99%

,1000

9000-9999

100%

,699

10000-19999

Figure 29. Number of dominator blocks per basic block (in CFGs with implicit exception edges).

In Table XVII we show the most common subexpressions found in our sample method bodies.

Table XVIII explains the abbreviations used. L, for example, represents a local variable, d a double

constant, (α+α) addition, etc. Since we are counting subexpressions, the same piece of an expression

will be counted more than once. For example, if x and y are local variables, then the expression x+y*2will generate subexpressions L, L, i, (L*i), and (L+(L*i)), each of which will increase the count

of its respective expression class.

To compute subexpressions we convert each expression tree into a string representation, classify

each subexpression into equivalence classes according to Table XVIII, and count each subexpression

individually.

What we see from Table XVII is that, unsurprisingly, local variable references, method calls,

integer constants, and field references make up the bulk of expressions. Somewhat more surprising

is that the expression ((Class)M()) is very frequent. Most likely this is the result of references

to ‘generic’ methods (particularly Java library functions such as java.util.Vector.get())

returning java.lang.Objects which then have to be cast into a more specific type.

7.4. Constant values

Tables XIX, XX, and XXI show the most common literal constants found in the bytecode. Constants

can occur in three different ways: as references to entries in the constant pool (instructionsldc, ldc w,

and ldc2 w), as arguments to bytecode instructions (bipush n, sipush n, and iinc n,c), or

embedded in special instructions (iconst n, etc.).

Figure 34 shows the distribution of integer constant values. It is interesting to note that 63% of

all literal integers are 0, powers of two, or powers of two plus/minus one. This has implications for,

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 42: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

622 C. COLLBERG, G. MYLES AND M. STEPP

0

2000

4000

6000

MIN: 0MAX: 105AVG: 0.1

MODE: 0MEDIAN: 0STD DEV: 0

SAMPLES: 801117TOTAL: 116627

91

%,

72

95

54

0

97

%,

49

97

8

1

98

%,

11

18

3

2

99

%,

59

57

3

99

%,

19

01

4

99

%,

914

5

99

%,

563

6

99

%,

356

7

99

%,

199

8

99

%,

141

9

99

%,

320

10-19

99%

,33

20-2999%

,10

30-3999%

,5

40-49

99%

,1

50-59

100%

,2

100-199

(a)

0

20

40

60

80

100 MIN: 0MAX: 29AVG: 0.0

MODE: 0MEDIAN: 0STD DEV: 0

SAMPLES: 801117TOTAL: 9501

FAILED: 2

98%

,792483

0

99%

,8049

1

99%

,446

2

99%

,89

3

99%

,30

4

99%

,4

5

99%

,4

6

99%

,2

7

99%

,2

8

99%

,1

9

99%

,3

10-19

100%

,2

20-29

(b)

0

200

400

600

800

1000 MIN: 2MAX: 125AVG: 7.5

MODE: 4MEDIAN: 4STD DEV: 5

SAMPLES: 9501TOTAL: 71634

0%

,1

2

2

0%

,16

3

27%

,2592

4

43%

,1540

5

51%

,724

6

59%

,75

0

7

65

%,

585

8

73%

,73

4

9

97%

,2326

10-19

99

%,

168

20-29

99%

,2

4

30-39

99

%,

12

40-49

99

%,

8

50-59

99

%,

6

60-69

10

0%

,4

100-199

(c)

Figure 30. (a) Number of exception handlers per method body; (b) number of subroutines per method body;(c) number of instructions per subroutine.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 43: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 623

v1 = a×a

v2 = a×b

v3 = 2× v2

v4 = v1 + v2

v5 = b× v3

(a)

v3

v2

v1

v4

v5

(b)

Figure 31. (a) Sample code and (b) corresponding interference graph.

0

50000

100000

MIN: 0MAX: 342AVG: 3.1

MODE: 1MEDIAN: 2STD DEV: 4

SAMPLES: 801117TOTAL: 2516760

3%

,30858

0

34%

,244128

1

61%

,219542

2

74%

,101653

3

82%

,65441

4

87%

,40603

5

90%

,24532

6

92%

,17245

7

94%

,12274

8

95%

,8768

9

99%

,28411

10-19

99%

,4902

20-29

99%

,1409

30-39

99%

,453

40-49

99%

,335

50-59

99%

,166

60-69

99%

,160

70-79

99%

,54

80-89

99%

,44

90-99

99%

,113

100-199

99%

,6

200-299

100%

,20

300-399

(a)

0

10000

20000

30000

40000

50000MIN: 0

MAX: 5694AVG: 6.2

MODE: 0MEDIAN: 1STD DEV: 42

SAMPLES: 801117TOTAL: 4982823

38%

,312328

0

64%

,202327

1

67%

,2

2946

2

76%

,79144

3

78

%,

14

37

5

4

80

%,

126

04

5

84%

,3

662

1

6

85

%,

729

7

7

86

%,

72

30

8

87

%,

720

4

9

93%

,50679

10-19

96%

,1

89

72

20-29

97%

,8

528

30-39

98%

,5

076

40-49

98%

,3

368

50-59

98

%,

26

59

60-69

98

%,

15

31

70-79

99

%,

122

7

80-89

99%

,9

16

90-99

99

%,

37

13

100-199

99

%,

11

05

200-299

99

%,

39

7

300-399

99

%,

22

5

400-499

99

%,

154

500-599

99

%,

130

600-699

99

%,

66

700-799

99

%,

65

800-899

99%

,2

6

900-999

99

%,

12

9

1000-1999

99

%,

33

2000-2999

99

%,

40

3000-3999

99

%,

1

4000-4999

10

0%

,1

5000-5999

(b)

Figure 32. (a) Number of interference graph nodes per method body; (b) number ofinterference graph edges per method body.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 44: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

624 C. COLLBERG, G. MYLES AND M. STEPP

Table XIII. Instruction frequencies.

Opcode Count % Opcode Count %

aload 0 2 672 134 10.0 aaload 108 016 0.4invokevirtual 2 360 924 8.9 anewarray 106 780 0.4dup 1 521 855 5.7 putstatic 105 900 0.4getfield 1 447 792 5.4 astore 1 99 436 0.4invokespecial 1 003 439 3.8 isub 93 852 0.4ldc 936 890 3.5 if icmplt 89 901 0.3aload 1 909 356 3.4 if icmpne 89 452 0.3aload 876 138 3.3 iconst 3 85 851 0.3bipush 865 346 3.3 arraylength 81 083 0.3new 665 727 2.5 iaload 70 687 0.3iconst 0 634 481 2.4 iconst m1 67 600 0.3iload 601 808 2.3 ldc2 w 66 717 0.3putfield 552 241 2.1 iconst 4 64 544 0.2goto 507 322 1.9 istore 3 58 529 0.2iconst 1 495 114 1.9 istore 2 57 750 0.2aload 2 494 004 1.9 iand 55 782 0.2invokestatic 457 014 1.7 instanceof 50 049 0.2getstatic 438 851 1.6 if acmpne 48 379 0.2return 433 081 1.6 if icmpeq 47 866 0.2astore 395 436 1.5 newarray 44 390 0.2sipush 383 115 1.4 iconst 5 39 857 0.1areturn 351 978 1.3 baload 39 482 0.1aastore 332 112 1.2 castore 38 065 0.1aload 3 314 398 1.2 istore 1 37 068 0.1invokeinterface 300 563 1.1 if icmpge 35 921 0.1ifeq 286 898 1.1 sastore 30 690 0.1iastore 285 979 1.1 ixor 29 811 0.1ldc w 281 190 1.1 if icmple 28 212 0.1pop 270 894 1.0 imul 27 174 0.1istore 264 341 1.0 iload 0 26 837 0.1ireturn 259 627 1.0 dload 26 640 0.1iload 2 200 600 0.8 lastore 23 738 0.1iload 1 197 241 0.7 ifle 23 025 0.1checkcast 193 243 0.7 monitorexit 22 023 0.1aconst null 178 499 0.7 jsr 20 074 0.1iload 3 172 820 0.6 lconst 0 19 617 0.1bastore 171 902 0.6 nop 18 136 0.1iadd 171 637 0.6 lload 17 515 0.1ifne 167 878 0.6 ifge 17 494 0.1iconst 2 163 348 0.6 i2b 17 432 0.1athrow 151 515 0.6 ishl 17 233 0.1astore 2 144 741 0.5 fload 16 966 0.1iinc 132 890 0.5 ior 15 500 0.1astore 3 121 477 0.5 ishr 15 363 0.1ifnull 121 318 0.5 lcmp 15 033 0.1ifnonnull 110 290 0.4 dstore 14 261 0.1

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 45: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 625

Table XIII. Continued.

Opcode Count % Opcode Count % Opcode Count %

dup x1 14 202 0.1 fconst 0 4316 0.0 fneg 496 0.0dmul 14 077 0.1 f2d 4086 0.0 pop2 378 0.0idiv 13 833 0.1 laload 3781 0.0 fstore 1 374 0.0if icmpgt 12 477 0.0 dload 3 3538 0.0 d2l 273 0.0iflt 11 944 0.0 lconst 1 3515 0.0 dup2 x1 263 0.0caload 11 871 0.0 dload 2 3286 0.0 lneg 208 0.0if acmpeq 11 595 0.0 fload 1 3261 0.0 l2f 187 0.0dastore 11 325 0.0 lsub 3212 0.0 dstore 0 168 0.0astore 0 10 400 0.0 fload 2 3130 0.0 dup2 x2 164 0.0tableswitch 10 197 0.0 dcmpg 3040 0.0 lstore 0 159 0.0land 10 047 0.0 dup x2 2984 0.0 drem 55 0.0monitorenter 9961 0.0 fdiv 2930 0.0 f2l 42 0.0daload 9782 0.0 saload 2867 0.0 fstore 0 20 0.0fastore 9708 0.0 ineg 2577 0.0 frem 12 0.0ret 9670 0.0 multianewarray 2500 0.0 jsr w 0 0.0dconst 0 9295 0.0 d2f 2402 0.0 goto w 0 0.0i2l 8832 0.0 freturn 2360 0.0fstore 8765 0.0 fload 3 2357 0.0lookupswitch 8738 0.0 lshl 2195 0.0lload 1 8723 0.0 fconst 1 2122 0.0faload 8634 0.0 dload 0 2093 0.0iushr 8468 0.0 lload 0 1982 0.0dadd 8407 0.0 fcmpl 1917 0.0i2d 8403 0.0 istore 0 1787 0.0ifgt 7976 0.0 lstore 3 1727 0.0fmul 7674 0.0 lmul 1664 0.0lstore 7512 0.0 lor 1520 0.0dup2 7332 0.0 lstore 2 1475 0.0lload 2 7125 0.0 f2i 1391 0.0ddiv 6644 0.0 lshr 1386 0.0lload 3 6514 0.0 lstore 1 1352 0.0dsub 5882 0.0 fcmpg 1243 0.0i2c 5839 0.0 dneg 1230 0.0l2i 5680 0.0 dstore 3 1215 0.0dload 1 5548 0.0 ldiv 1194 0.0fadd 5515 0.0 lxor 1146 0.0irem 5508 0.0 dstore 2 1085 0.0dreturn 5159 0.0 lushr 993 0.0dconst 1 5146 0.0 dstore 1 908 0.0dcmpl 5065 0.0 l2d 878 0.0i2f 5003 0.0 swap 849 0.0lreturn 4935 0.0 fconst 2 839 0.0ladd 4621 0.0 fstore 3 695 0.0fsub 4463 0.0 fstore 2 650 0.0i2s 4338 0.0 lrem 539 0.0d2i 4331 0.0 fload 0 510 0.0

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 46: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

626 C. COLLBERG, G. MYLES AND M. STEPP

Table XIV. Most common 2-grams.

Opcode Count %

aload 0,getfield 1 219 837 4.7new,dup 664 718 2.6ldc,invokevirtual 353 412 1.4invokevirtual,invokevirtual 332 487 1.3dup,bipush 330 887 1.3putfield,aload 0 311 038 1.2iastore,dup 250 744 1.0invokevirtual,aload 0 235 924 0.9dup,sipush 226 520 0.9aload 1,invokevirtual 223 958 0.9aload,invokevirtual 222 692 0.9getfield,invokevirtual 219 107 0.8aload 0,aload 1 214 369 0.8aastore,dup 208 247 0.8dup,invokespecial 202 840 0.8aload 0,invokevirtual 200 872 0.8invokevirtual,pop 193 105 0.7aload 0,invokespecial 159 742 0.6astore,aload 146 309 0.6bastore,dup 141 779 0.5ldc,aastore 133 994 0.5getfield,aload 0 129 300 0.5invokespecial,aload 0 122 168 0.5ldc,invokespecial 120 935 0.5dup,ldc 116 043 0.4invokespecial,athrow 115 994 0.4aload 2,invokevirtual 115 394 0.4goto,aload 0 115 340 0.4putfield,return 113 044 0.4dup,iconst 0 109 473 0.4invokevirtual,ldc 109 060 0.4invokevirtual,return 106 765 0.4invokevirtual,ifeq 103 093 0.4bipush,bastore 102 969 0.4invokevirtual,astore 100 355 0.4ifeq,aload 0 99 667 0.4bipush,bipush 98 715 0.4ldc w,iastore 98 376 0.4iconst 0,ireturn 98 199 0.4invokevirtual,aload 93 325 0.4aload 0,new 90 040 0.3anewarray,dup 81 992 0.3dup,aload 0 80 579 0.3aload 0,aload 0 80 329 0.3aload,aload 78 718 0.3

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 47: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 627

Table XV. Most common 3-grams.

Opcode Count %

new,dup,invokespecial 202 836 0.8aload 0,getfield,invokevirtual 194 765 0.8iastore,dup,bipush 132 759 0.5invokevirtual,aload 0,getfield 125 019 0.5new,dup,ldc 115 036 0.5aload 0,getfield,aload 0 111 950 0.4getfield,aload 0,getfield 111 002 0.4iastore,dup,sipush 102 667 0.4bipush,bastore,dup 100 197 0.4invokevirtual,ldc,invokevirtual 98 964 0.4ldc w,iastore,dup 97 826 0.4dup,ldc,invokespecial 91 303 0.4aload 0,new,dup 90 029 0.4dup,bipush,bipush 83 402 0.3ldc,aastore,dup 82 970 0.3anewarray,dup,iconst 0 81 984 0.3aastore,dup,bipush 80 740 0.3new,dup,aload 0 80 524 0.3invokevirtual,invokevirtual,invokevirtual 80 161 0.3invokespecial,ldc,invokevirtual 69 626 0.3aload 0,getfield,aload 1 68 922 0.3bastore,dup,sipush 67 634 0.3aload 0,invokespecial,aload 0 66 723 0.3dup,sipush,bipush 64 736 0.3aload 0,aload 1,putfield 60 661 0.2bastore,dup,bipush 60 580 0.2goto,aload 0,getfield 60 205 0.2aload 0,aload 0,getfield 58 350 0.2dup,invokespecial,ldc 57 764 0.2dup,sipush,ldc w 57 139 0.2dup,bipush,ldc w 56 309 0.2aastore,dup,iconst 1 56 004 0.2putfield,aload 0,getfield 55 324 0.2aastore,aastore,dup 55 149 0.2aload 0,getfield,areturn 55 073 0.2ldc,invokevirtual,invokevirtual 53 465 0.2new,dup,aload 1 52 185 0.2ldc,invokevirtual,aload 0 51 342 0.2invokespecial,putfield,aload 0 50 827 0.2sipush,bipush,bastore 50 642 0.2dup,bipush,ldc 50 439 0.2dup,iconst 0,ldc 49 992 0.2aload 0,getfield,getfield 49 974 0.2iconst 0,ldc,aastore 49 056 0.2sipush,ldc w,iastore 48 252 0.2

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 48: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

628 C. COLLBERG, G. MYLES AND M. STEPP

Table XVI. Most common 4-grams.

Opcode Count %

aload 0,getfield,aload 0,getfield 95 199 0.4new,dup,ldc,invokespecial 91 302 0.4new,dup,invokespecial,ldc 57 764 0.2dup,invokespecial,ldc,invokevirtual 57 239 0.2bipush,bastore,dup,sipush 50 663 0.2dup,sipush,bipush,bastore 50 642 0.2bastore,dup,sipush,bipush 50 642 0.2sipush,bipush,bastore,dup 50 392 0.2anewarray,dup,iconst 0,ldc 48 862 0.2dup,iconst 0,ldc,aastore 48 697 0.2iastore,dup,sipush,ldc w 48 252 0.2dup,sipush,ldc w,iastore 48 252 0.2ldc w,iastore,dup,sipush 48 198 0.2sipush,ldc w,iastore,dup 47 875 0.2iastore,dup,bipush,ldc w 47 528 0.2dup,bipush,ldc w,iastore 47 528 0.2ldc w,iastore,dup,bipush 47 520 0.2bipush,ldc w,iastore,dup 47 384 0.2dup,bipush,bipush,bastore 44 682 0.2bastore,dup,bipush,bipush 44 680 0.2bipush,bastore,dup,bipush 44 674 0.2bipush,bipush,bastore,dup 44 209 0.2aload 0,new,dup,invokespecial 43 141 0.2new,dup,new,dup 42 678 0.2invokevirtual,ldc,invokevirtual,invokevirtual 42 594 0.2ldc,aastore,aastore,dup 41 430 0.2aastore,aastore,dup,bipush 40 443 0.2dup,ldc,invokespecial,athrow 40 441 0.2new,dup,aload 0,getfield 36 325 0.1new,dup,invokespecial,putfield 34 800 0.1ldc,aastore,dup,iconst 1 34 705 0.1iconst 0,ldc,aastore,dup 34 705 0.1aastore,dup,iconst 1,ldc 34 585 0.1putfield,aload 0,new,dup 34 499 0.1dup,iconst 1,ldc,aastore 34 191 0.1invokevirtual,aload 0,getfield,invokevirtual 34 185 0.1aload 0,iconst 0,putfield,aload 0 33 108 0.1aload 0,aload 1,putfield,return 32 147 0.1aload 0,aconst null,putfield,aload 0 31 719 0.1aload 0,getfield,aload 1,invokevirtual 31 472 0.1iconst 2,anewarray,dup,iconst 0 30 710 0.1ldc,invokevirtual,aload 0,getfield 30 470 0.1putfield,aload 0,iconst 0,putfield 27 739 0.1iastore,dup,bipush,ldc 26 735 0.1dup,bipush,ldc,iastore 26 735 0.1

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 49: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 629

0

200000

400000

600000 MIN: 0MAX: 105AVG: 0.8

MODE: 0MEDIAN: 0STD DEV: 1

SAMPLES: 19285712

61

%,

11

87

56

83

0

83

%,

42

87

70

1

1

91

%,

15

74

03

8

2

94

%,

54

38

08

3

96

%,

26

59

24

4

96

%,

15

92

72

5

97

%,

90

657

6

97

%,

80

228

7

98

%,

10

54

87

8

99

%,

13

21

41

9

99

%,

16

42

25

10-19

99

%,

3299

20-299

9%

,1617

30-399

9%

,1087

40-49

99

%,

342

50-59

99

%,

131

60-69

99%

,36

70-79

99%

,20

80-89

99%

,10

90-99

100%

,6

100-199

(a)

0

100000

200000

300000

400000

500000 MIN: 1MAX: 1545AVG: 3.6

MODE: 1MEDIAN: 0STD DEV: 10

SAMPLES: 19285712

61%

,11875683

1

76%

,2891306

2

84%

,1525188

3

89%

,892267

4

91%

,482602

5

93%

,297245

6

94%

,234464

7

95%

,147304

8

95%

,122818

9

97%

,353167

10-19

97%

,44736

20-29

97%

,19117

30-39

97%

,11737

40-49

98%

,21951

50-59

98%

,134921

60-69

99%

,187613

70-79

99%

,39684

80-89

99%

,1009

90-99

99%

,2014

100-199

99%

,623

200-299

99%

,159

300-399

99%

,26

400-499

99%

,12

500-599

99%

,8

600-699

99%

,23

700-799

99%

,5

800-899

99%

,5

900-999

100%

,25

1000-1999

(b)

Figure 33. (a) Height and (b) size of expression trees.

for example, software watermarking algorithms such as that by Cousot and Cousot [11], which hides a

watermark in unusual constants. Figure 34 tells us that in real programs most constants are small (93%

are less than 1000) or very close to powers of two, and hence hiding a mark in unusual constants is

likely to be unstealthy.

7.5. Method calls

Table XXII reports the most frequently called Java library methods. To collect these data, we looked at

every INVOKE instruction to see what method it named. No attempt at method resolution was made.

Figure 35 measures the size of receiver sets of method calls. That is, for a virtual method invocation

o.M() we count the number of methods M() that might potentially be called. This depends on the

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 50: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

630 C. COLLBERG, G. MYLES AND M. STEPP

Table XVII. Most common subexpressions.

Expression Count % Expression Count %

L 6 574 522 34.9 L.F.F.F 5087 0.0M() 4 119 506 21.9 (L.F+L) 4607 0.0i 2 967 193 15.7 (L.F&i) 4572 0.0L.F 1 366 327 7.3 (M()+i) 4511 0.0" 1 039 672 5.5 (L.F+L.F) 4326 0.0N 665 727 3.5 (L&l) 4149 0.0S 438 851 2.3 ((L+M())+L.F[]) 4128 0.0A 153 670 0.8 (M()+L) 3997 0.0L[] 121 966 0.6 (L.length-i) 3994 0.0((Class)M()) 115 363 0.6 ((L>>i)&i) 3917 0.0null 95 366 0.5 (i*L) 3792 0.0L.F[] 83 259 0.4 ((L&l)<>l) 3734 0.0l 67 088 0.4 (L/i) 3689 0.0L.F.F 54 078 0.3 (L.F>>i) 3654 0.0L.length 51 938 0.3 (((L+M())+L.F[])+i) 3392 0.0((Class)L) 49 937 0.3 L.F[].F 3367 0.0(L+i) 41 182 0.2 S[][] 3351 0.0(L instanceof Class) 39 081 0.2 (L.F instanceof Class) 3342 0.0d 37 202 0.2 ((L>>>i)&i) 3162 0.0S[] 29 776 0.2 (L+L.F) 3051 0.0(L+L) 24 534 0.1 ((Class)L[]) 2995 0.0(L.F+i) 24 296 0.1 (L.F-L) 2977 0.0L.F.length 23 898 0.1 (S[]&i) 2898 0.0(L-i) 19 852 0.1 (LˆL) 2818 0.0f 17 746 0.1 (L.F[]&i) 2773 0.0((Class)S) 14 614 0.1 ((long)L) 2748 0.0(L-L) 13 495 0.1 ((byte)L) 2699 0.0(L&i) 13 436 0.1 (L.F*L.F) 2690 0.0(L.F-i) 10 759 0.1 S.length 2626 0.0(L>>i) 9050 0.0 (l&L) 2618 0.0(L[]&i) 8285 0.0 ((l&L)<>l) 2612 0.0((Class)L.F) 7715 0.0 ((double)L.F) 2509 0.0M().F 7518 0.0 ((L[]&i)<<i) 2473 0.0(L+M()) 7451 0.0 L.F.F[] 2437 0.0(M()-i) 7181 0.0 (L-L.F) 2423 0.0(L>>>i) 7181 0.0 -(L) 2423 0.0(M() instanceof Class) 6305 0.0 (L<>l) 2322 0.0(L<<i) 6013 0.0 (L&L) 2285 0.0(L*i) 5818 0.0 ((L.F>>i)&i) 2275 0.0L[][] 5757 0.0 S.F 2214 0.0(L.F-L.F) 5509 0.0 ((char)L) 2201 0.0((double)L) 5300 0.0 (S+i) 2132 0.0(L*L) 5234 0.0 ((float)L) 2129 0.0L.F[][] 5230 0.0 ((int)M()) 2081 0.0

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 51: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 631

Table XVIII. Abbreviations used in Table XVII.

null ≡ ACONST NULL ((τ)α) ≡ typecast, τ is a primitive-(α) ≡ negation type or Class(α+α) ≡ addition A ≡ create new array(α-α) ≡ subtraction S ≡ static field(α*α) ≡ mult F ≡ non-static field(α/α) ≡ div M() ≡ method call(α%α) ≡ mod/rem (α instanceof κ) ≡ instanceof(α&α) ≡ and " ≡ string constant(α|α) ≡ or f ≡ float constant(αˆα) ≡ xor d ≡ double constant(α<<α) ≡ left shift l ≡ long constant(α>>α) ≡ signed right shift N ≡ NEW(α>>>α) ≡ IUSHR or LUSHR (α<α) ≡ DCMPL or FCMPLα[] ≡ array element (α>α) ≡ DCMPG or FCMPGα.length ≡ ARRAYLENGTH (α<>α) ≡ LCMPi ≡ int constant L ≡ load local variable

static type of o, and the number of methods in type(o)’s subclasses that override o’s M(). A static

class hierarchy analysis [12] is used to compute the receiver set.

The size of the receiver set has implications for, among other things, code optimization. A virtual

method call that has only one member in its receiver set can be replaced with a direct call. Furthermore,

if, for example, o.M()’s receiver set is {Class1.M(), Class2.M()}, then to expand o.M()inline, the code if o instanceof Class1 then Class1.M() else Class2.M() has

to be generated. The larger the receiver set, the more type tests will have to be inserted.

To compute receiver sets for an INVOKEVIRTUAL instruction, we first resolve the method

reference. We then gather all of the subclasses of the resolved method’s parent class (including itself)

and for each one look to see whether it contains a non-abstract method with the same name and

signature as the resolved method. If so, we check to see whether the resolved method is accessible

from the given subclass. If this is true, then the INVOKEVIRTUAL instruction could possibly execute

the subclass’ method, and it is added to the receiver set for the INVOKEVIRTUAL instruction.

For an INVOKEINTERFACE instruction, we perform the same test but we look instead at

all implementors of the resolved method’s parent interface. This set will contain all classes

that directly implement the interface, as well as subclasses of those classes, and classes that

implement any subinterfaces of the interface (i.e. anything that could be cast to the interface type).

The INVOKESPECIAL and INVOKESTATIC instructions do not use dynamic method invocation;

the method they will call can always be determined statically. Thus, they all have receiver sets of

size 1.

Since we count only method bodies in the receiver sets, it is possible to have receiver sets of size 0.

This can occur if an abstract class has no subclasses to implement its abstract methods, yet code is

written to call its abstract methods with future subclasses in mind. Similarly, an INVOKEINTERFACEcall may have no receivers if no classes implement the given interface.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 52: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

632 C. COLLBERG, G. MYLES AND M. STEPP

Table XIX. Common integer constants.

Most common int constants Most common long constants

Value Count % Value Count %

0 634 484 20.5 0 19 617 29.21 611 382 19.7 1 3515 5.22 165 656 5.3 −1 2320 3.53 86 253 2.8 1000 1114 1.7−1 78 187 2.5 287948901175001088 740 1.14 65 209 2.1 2 722 1.18 45 619 1.5 255 669 1.05 40 047 1.3 3 387 0.610 31 762 1.0 100 347 0.5255 31 249 1.0 5 344 0.56 29 798 1.0 8388608 343 0.57 28 356 0.9 4294967295 331 0.59 25 497 0.8 10 323 0.516 24 931 0.8 7 304 0.532 19 401 0.6 71776119061217280 270 0.412 17 889 0.6 4 269 0.413 17 228 0.6 60000 217 0.311 15 763 0.5 9 199 0.315 15 008 0.5 541165879422 196 0.314 13 733 0.4 9223372036854775807 190 0.324 13 607 0.4 64 182 0.320 10 844 0.3 9007199254740992 170 0.348 9759 0.3 8 167 0.217 9513 0.3 -9223372036854775808 160 0.263 8963 0.3 500 159 0.246 8540 0.3 60 156 0.247 8214 0.3 36028797018963968 148 0.218 8115 0.3 2147483647 144 0.234 8029 0.3 67108864 139 0.231 7681 0.2 3600000 134 0.264 7602 0.2 17179869184 132 0.240 7187 0.2 144115188075855872 132 0.221 7044 0.2 140737488355328 130 0.2100 6984 0.2 1024 129 0.245 6970 0.2 10000 125 0.223 6860 0.2 137438953504 123 0.219 6855 0.2 43980465111040 122 0.241 6631 0.2 1099511627776 122 0.230 6621 0.2 562949953421312 118 0.258 6551 0.2 17592186044416 118 0.2128 6547 0.2 33554432 117 0.222 6510 0.2 268435456 117 0.225 6441 0.2 16384 109 0.2

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 53: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 633

Table XX. Common real constants.

Most common float constants Most common double constants

Value Count % Value Count %

0.0 4316 24.3 0.0 9295 25.01.0 2122 12.0 1.0 5146 13.82.0 839 4.7 2.0 1920 5.20.5 573 3.2 0.5 1296 3.5255.0 319 1.8 100.0 710 1.9−1.0 311 1.8 10.0 689 1.94.0 177 1.0 5.0 585 1.6100.0 164 0.9 -Infinity 467 1.310.0 151 0.9 −1.0 463 1.20.75 146 0.8 1000.0 454 1.264.0 124 0.7 3.0 409 1.13.0 124 0.7 3.141592653589793 (π) 364 1.01000.0 114 0.6 NaN 333 0.920.0 109 0.6 4.0 311 0.890.0 79 0.4 0.25 285 0.83.1415927 (π) 73 0.4 Infinity 206 0.6NaN 68 0.4 8.0 198 0.557.29578 (180/π ) 68 0.4 1.797693· · · 7E308 (MAX) 182 0.550.0 68 0.4 180.0 162 0.46.2831855 (2π) 64 0.4 1.5 156 0.46.0 64 0.4 0.1 152 0.43.4028235E38 (MAX) 62 0.3 360.0 145 0.41.0E-4 61 0.3 20.0 120 0.3180.0 60 0.3 6.283185307179586 (2π) 118 0.35.0 58 0.3 −2.0 112 0.30.85 58 0.3 0.01 107 0.30.1 58 0.3 255.0 105 0.30.01 52 0.3 0.2 104 0.3−10.0 51 0.3 6.0 103 0.30.0010 47 0.3 0.6 92 0.28.0 45 0.3 7.0 91 0.21.5 45 0.3 9.0 88 0.20.8 45 0.3 1.25 83 0.20.3 45 0.3 16.0 77 0.20.25 45 0.3 60.0 75 0.2-Infinity 44 0.2 31.0 72 0.2Infinity 44 0.2 26.0 71 0.2100000.0 42 0.2 0.75 71 0.21.5707964 (π/2) 40 0.2 12.0 69 0.2

0.70710677 (1/√

2) 40 0.2 0.05 67 0.2−100.0 38 0.2 0.3 66 0.2200.0 37 0.2 15.0 65 0.265536.0 36 0.2 645.0 64 0.2

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 54: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

634 C. COLLBERG, G. MYLES AND M. STEPP

Table XXI. Most common string constants.

Value Count %

empty string 36 456 3.5" " 9003 0.9newline 5281 0.5")" 4860 0.5"." 4718 0.5"S" 4540 0.4"’" 4201 0.4"Q" 4139 0.4":" 4083 0.4"," 3885 0.4"R" 3796 0.4"P" 3663 0.4", " 3562 0.3"/" 3481 0.3""" 3113 0.3"0" 3024 0.3"name" 2725 0.3"(" 2561 0.2"false" 2536 0.2"true" 2461 0.2"]" 2115 0.2": " 2093 0.2"Center" 1931 0.2"-" 1699 0.2"BC" 1658 0.2"PvQ" 1649 0.2">" 1634 0.2"id" 1450 0.1"P->Q" 1373 0.1"P&Q" 1370 0.1"java.lang.String" 1314 0.1"line.separator" 1313 0.1";" 1307 0.1"W" 1237 0.1"=" 1210 0.1"shortDescription" 1207 0.1tab 1159 0.1"}" 1151 0.1"RvS" 1110 0.1"null" 1107 0.1"*" 1084 0.1"A" 1074 0.1"[" 1046 0.1"class" 1012 0.1"˜P" 996 0.1

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 55: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 635

0

20000

40000

60000

MIN: -9.22337e+18MAX: 9.22337e+18AVG: 467327912110143.2

MODE: 0MEDIAN: 2STD DEV: 4.20786e+17

SAMPLES: 3167173

6%

,2

02

48

0

0

61

%,

17

57

91

4

[0- 10)

80

%,

60

34

02

[10- 100)

93

%,

40

09

73

[100- 1000)

95

%,

61

84

4

[1000- 10000)

96

%,

31

37

2

[10000- 100000)

96%

,6412

[100000- 1e6)

96%

,6646

[1e6- 1e7)

97

%,

10825

[1e7- 1e8)

98

%,

30

47

6

[1e8- 1e9)

99

%,

37

64

1

[1e9- 1e10)

99%

,468

[1e10- 1e11)

99%

,907

[1e11- 1e12)

99%

,570

[1e12- 1e13)

99%

,682

[1e13- 1e14)

99%

,734

[1e14- 1e15)

99%

,975

[1e15- 1e16)

99%

,1078

[1e16- 1e17)

99%

,2720

[1e17- 1e18)

10

0%

,9054

[1e18- 1e19)

(a)

Value Count %

0 654101 20.71 695404 22.02 169760 5.42n,n > 1 205877 6.52n −1,n > 1 198280 6.32n

+1,n > 1 93544 3.0other 1150207 36.3

(b)

Figure 34. Constant values: (a) distribution of integers (int and long); (b) integers(int and long) close to powers of two.

Figure 35(a) shows that 88% of all virtual method calls have a receiver set with size at most 2,

with the average size being 4.5. It is interesting to note the large number of methods with a receiver

size between 20 and 29. As can be expected, the average receiver set size is significantly larger for an

interface method call. Figure 35(b) shows an average set size of 16.5.

7.6. Switches

Figure 36(a) measures the number of case labels for each tableswitch and lookupswitchinstruction. We had to treat the tableswitch instruction specially, since it uses a contiguous range

of label values. Not all of the labels in the tableswitch instruction necessarily appeared in the

source code for the program. As a result, some of the branch targets for the cases will be the same as

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 56: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

636 C. COLLBERG, G. MYLES AND M. STEPP

Table XXII. Most common calls to methods in the Java library.

Method Count %

java.lang.StringBuffer.append(String)StringBuffer 340 044 15.8StringBuffer.toString()String 143 985 6.7StringBuffer.<init>()void 93 837 4.3Object.<init>()void 52 597 2.4StringBuffer.<init>(String)void 48 408 2.2String.equals(Object)boolean 46 645 2.2java.util.Hashtable.put(Object,Object)Object 42 629 2.0java.io.PrintStream.println(String)void 42 594 2.0StringBuffer.append(int)StringBuffer 31 702 1.5StringBuffer.append(Object)StringBuffer 27 284 1.3String.length()int 25 505 1.2String.valueOf(Object)String 20 146 0.9java.lang.IllegalArgumentException.<init>(String)void 15 737 0.7StringBuffer.append(char)StringBuffer 15 116 0.7String.substring(int,int)String 14 441 0.7java.util.Vector.size()int 13 817 0.6java.util.Vector.addElement(Object)void 12 705 0.6java.util.Vector.elementAt(int)Object 12 087 0.6java.lang.System.arraycopy(Object,int,Object,int,int)void 11 969 0.6java.util.Iterator.hasNext()boolean 11 891 0.6String.charAt(int)char 11 831 0.5java.util.Iterator.next()Object 11 800 0.5java.lang.Integer.<init>(int)void 11 658 0.5java.lang.Throwable.getMessage()String 11 216 0.5java.util.Vector.<init>()void 10 434 0.5java.util.List.add(Object)boolean 10 166 0.5Object.getClass()java.lang.Class 9641 0.4java.util.Hashtable.get(Object)Object 9584 0.4String.equalsIgnoreCase(String)boolean 9391 0.4java.util.List.size()int 9226 0.4java.util.Map.put(Object,Object)Object 8830 0.4java.util.List.get(int)Object 8797 0.4java.lang.Class.forName(String)java.lang.Class 8641 0.4java.util.Map.get(Object)Object 8313 0.4java.awt.Container.add(java.awt.Component)java.awt.Component 8270 0.4String.substring(int)String 7862 0.4java.io.PrintWriter.println(String)void 7767 0.4java.util.Enumeration.nextElement()Object 7539 0.3java.lang.Class.getName()String 7288 0.3String.startsWith(String)boolean 7186 0.3String.indexOf(String)int 6960 0.3java.util.ArrayList.<init>()void 6705 0.3java.lang.Integer.parseInt(String)int 6667 0.3java.util.Enumeration.hasMoreElements()boolean 6526 0.3java.lang.NullPointerException.<init>(String)void 6403 0.3

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 57: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 637

0

10000

20000

30000

40000

50000MIN: 0

MAX: 917AVG: 4.5

MODE: 1MEDIAN: 1STD DEV: 39

SAMPLES: 2360924FAILED: 413186

0%

,684

0

82

%,

16

15

66

6

1

88

%,

10

54

71

2

90

%,

49

70

5

3

93

%,

51

19

7

4

94

%,

15

175

5

94

%,

9597

6

95

%,

9783

7

96

%,

15

214

8

96

%,

5769

9

97

%,

15

74

4

10-19

98

%,

30

19

4

20-29

99

%,

7370

30-39

99%

,4587

40-49

99%

,2874

50-5999%

,475

60-69

99%

,32

70-79

99%

,120

80-89

99%

,6

90-99

99%

,251

100-199

99%

,1577

300-399

99%

,2503

400-499

99%

,518

500-599

99%

,1451

700-799

99%

,1703

800-899

100%

,72

900-999

(a)

0

10000

20000

30000

MIN: 0MAX: 652AVG: 16.5

MODE: 1MEDIAN: 7STD DEV: 24

SAMPLES: 300563FAILED: 82205

1%

,2491

0

14%

,29488

1

24%

,20433

2

30%

,13673

3

34%

,10078

4

39%

,9186

5

46%

,15637

6

51%

,10785

7

54%

,6299

8

59%

,12267

9

75%

,34378

10-19

83%

,17108

20-29

84%

,1881

30-39

88%

,8768

40-49

91%

,8367

50-59

97%

,12029

60-69

99%

,4143

70-79

99%

,625

80-89

99%

,193

90-99

99%

,145

100-199

99%

,252

200-299

99%

,93

400-499

99%

,36

500-599

100%

,3

600-699

(b)

Figure 35. Number of receiver set sizes per (a) virtual method call (invokevirtual) and(b) interface method call (invokeinterface).

the default case target. Therefore, when computing the label set size and density of a tableswitchinstruction, we ignore all of the labels whose branch targets are the same as the default case’s target.

The figure shows that the average number of labels per switch is 12.8 and that 89% of the switches

contain fewer than 30 labels.

Figure 36(b) shows the density of switch labels, computed as

number of case arms

max label − min label + 1(1)

This measure is important for selecting the most appropriate implementation of switch

statements [13,14]. In the JVM, the tableswitch instruction is used when the density is high and

the lookupswitch is used when the density is low.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 58: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

638 C. COLLBERG, G. MYLES AND M. STEPP

0

1000

2000

3000 MIN: 0MAX: 500AVG: 8.7

MODE: 2MEDIAN: 4STD DEV: 18

SAMPLES: 18935TOTAL: 164325

0%

,1

86

0

14

%,

26

23

1

30

%,

30

21

2

44

%,

26

84

3

56

%,

22

02

4

63

%,

12

69

5

69

%,

12

21

6

73

%,

66

4

7

76

%,

57

8

8

78

%,

45

9

9

90

%,

22

16

10-19

95

%,

94

0

20-29

96

%,

19

7

30-39

97

%,

24

5

40-49

98

%,

92

50-59

98

%,

72

60-69

98

%,

34

70-799

8%

,30

80-89

99

%,

91

90-99

99

%,

84

100-199

99

%,

18

200-299

99

%,

1

300-399

99

%,

6

400-499

10

0%

,2

500-599

0

500

1000

1500 MIN: 0MAX: 1AVG: 0.7

MODE: 1MEDIAN: 1018STD DEV: 0

SAMPLES: 18935FAILED: 186

5%

,1061

[0,0.05)

11%

,1135

[0.05,0.1)

16%

,885

[0.1,0.15)

20%

,755

[0.15,0.2)

23%

,557

[0.2,0.25)

27%

,713

[0.25,0.3)

30%

,674

[0.3,0.35)

32%

,354

[0.35,0.4)

35%

,545

[0.4,0.45)

36%

,205

[0.45,0.5)

39%

,535

[0.5,0.55)

40%

,127

[0.55,0.6)

41%

,285

[0.6,0.65)

43%

,411

[0.65,0.7)

45%

,237

[0.7,0.75)

46%

,270

[0.75,0.8)

48%

,300

[0.8,0.85)

49%

,162

[0.85,0.9)

49%

,130

[0.9,0.95)

50%

,103

[0.95,1)

100%

,9305

[1,1.05)

(b)

(a)

Figure 36. Switching statements: (a) number of case arms in tableswitch and lookupswitch; (b) labeldensity of tableswitch and lookupswitch.

8. RELATED WORK

In a widely cited empirical study, Knuth conducted an analysis of 440 FORTRAN programs [1].

The study was conducted in an attempt to understand how FORTRAN was actually being used by

typical programmers. By understanding how the language was being used, a better compiler could be

designed. Each of the programs were subjected to static analysis in order to count common constructs

such as assignment statements, ifs, gotos, do loops, etc. In addition, dynamic analysis was performed on

25 programs which examined the frequency of the constructs during a single execution of the program.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 59: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 639

The final analysis studied the effects of various local and global optimizations on the inner loops of

17 programs.

Knuth’s study was the first attempt to understand how programmers actually wrote programs.

Since that initial study, many similar explorations have been conducted for a variety of languages.

Salvadori et al. [3] and Chevance and Heidet [2] both examined the profile of Cobol programs.

Salvadori et al. looked at the static profile of 84 Cobol programs within an industrial environment.

In addition to examining the frequency of specific constructs, they also studied the development

history by recording the number of runs per day and the time interval between the runs. Chevance

and Heidet studied the static nature of Cobol programs through the number of occurrences of source-

level constructs in more than 50 programs. The authors took their study a step further by computing

the frequency of the constructs as the program executed. In this study, for categories of data were

examined: constants, variables, expressions, and statements.

Other than Chevance and Heidet [2], most studies of programmer behavior have concentrated on

the static structure of programs. Of equal importance is to examine how programs change over time.

Collberg et al. [15] showed how to visualize the evolution of a program by taking snapshots of its

development from a CVS repository and presenting these data using a temporal graph-drawing system.

Cook and Lee [4] undertook a static analysis of 264 Pascal programs to gain an understanding

of how the language was being used. The analysis was conducted within 12 different contexts, e.g.

procedures, then-parts, else-parts, for-loops, etc. In addition, they compared their results with those of

other language studies. Cook [16] conducted a static analysis of the instructions used in the system

software on the Lilith computer. An analysis of APL programs was conducted by Saal and Weiss [5,6].

Antonioli and Pilz [17] conducted the first analysis of the Java class file. The goal of their study was

to answer three questions. (1) What is the size of a typical class file? (2) How is the size of the class file

distributed between its different parts? (3) How are the bytecode instructions used? To answer these

questions, they examined six programs with a total of 4016 unique classes. In contrast to the present

study, they examined the size in bytes of each of the five parts of a class file (i.e. header, constant,

class, field, and method). They also examined instruction frequencies to see what percentage of the

instruction set was actually being used. They found that on average only 25% of the instruction set was

used by any one program. Our analysis does not focus on the frequency of a particular instruction per

program but instead looks at the frequency over all programs. Overall, their study is different from ours

in that they were interested in answering a few very specific questions, where our analysis is focused

on obtaining a complete understanding of JVM programs.

Gustedt et al. [18] conducted a study of Java programs that measures the tree width of CFGs. The tree

width is effected by such constructs as goto usage, short-circuit evaluation, multiple exits, breakstatements, continue statements, and returns. The authors examined both Java API packages as well

as Java applications obtained through Internet searches.

O’Donoghue et al. [19] performed an analysis of Java bytecode bigrams. Their analysis was

performed on 12 benchmark applications. The only similarity between their data and ours is that

we both found aload 0, getfield to be the most frequently occurring bigram. We attribute the

differences to the small sample size used in their study.

One of the byproducts of our analysis is a large repository of publicly available data on Java

programs. Appel [20] maintains a collection of interference graphs which can be used in studying

graph-coloring algorithms. The availability of such repositories is highly useful in the study of compiler

implementation techniques.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 60: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

640 C. COLLBERG, G. MYLES AND M. STEPP

9. DISCUSSION AND SUMMARY

In this paper we have performed a static analysis of 1132 Java programs obtained from the Internet.

Through the use of SandMark, we were able to analyze the structure of the Java bytecode. Our analysis

ranged from simple counts, such as methods per class, instructions per method, and instructions per

basic block, to structural metrics such as the complexity of CFGs.

Our main goal in conducting the study was to use the data in our research on software protection,

however we believe these data are useful in a variety of settings. These data could be used in the

design of future programming languages and virtual machine instruction sets, as well as in the efficient

implementation of compilers.

It would be interesting to perform a similar study of Java source code. Even though Java bytecode

contains much of the same information as in the source from which it was compiled, some aspects

of the original code are lost. Examples include comments, source code layout, some control structures

(when translated to bytecode, for and while loops may be indistinguishable), some type information

(Booleans are compiled to JVM integers), etc.

Owing to our random sampling of code from the Internet, it is possible that our set of Java jar-

files is somewhat skewed. It would be interesting to further validate our results by comparing against

a different set of programs, such as standard benchmark programs (for example, SpecJVM [21]), or

programs collected from standard source code repositories (for example, sourceforge.net).

We would also welcome studies for other languages. It would be interesting to validate our results

by performing a similar study for MSIL, the bytecode generated from C# programs, since MSIL and

JVM (and C# and Java) share many common features. It would also be interesting to compare our

results with languages very different from Java, such as functional, logic, and procedural languages.

It might then be possible to derive a set of ‘linguistic universals’, programming behaviors that apply

across a range of languages. Such information would be invaluable in the design of future programming

languages.

Our experimental data and the SandMark tool that was used to collect it can be downloaded from

http://sandmark.cs.arizona.edu/download.html.

REFERENCES

1. Knuth DE. An empirical study of FORTRAN programs. Software—Practice and Experience 1971; 1:105–133.2. Chevance RJ, Heidet T. Static profile and dynamic behavior of COBOL programs. SIGPLAN Notices 1978; 13(4):44–57.3. Salvadori A, Gordon J, Capstick C. Static profile of COBOL programs. SIGPLAN Notices 1975; 10(8):20–33.4. Cook RP, Lee I. A contextual analysis of Pascal programs. Software—Practice and Experience 1982; 12:195–203.5. Saal HJ, Weiss Z. Some properties of APL programs. Proceedings of 7th International Conference on APL. ACM Press:

New York, 1975; 292–297.6. Saal HJ, Weiss Z. An empirical study of APL programs. International Journal of Computer Languages 1977; 2(3):47–59.7. Stata R, Abadi M. A type system for Java bytecode subroutines. Conference Record of POPL 98: The 25th ACM SIGPLAN-

SIGACT Symposium on Principles of Programming Languages, San Diego, CA, 1998. ACM Press: New York, 1998;149–160.

8. Collberg CS, Tomborson C. Watermarking, tamper-proofing, and obfuscation—tools for software protection. IEEETransactions on Software Engineering 2002; 8(8):735–746.

9. Collberg C, Myles M, Huntwork A. SANDMARK—A tool for software protection research. IEEE Magazine of Security

and Privacy 2003; 1(4):40–49.10. Lindholm T, Yellin F. The Java Virtual Machine Specification (2nd edn). Addison-Wesley: Reading, MA, 1999.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe

Page 61: An empirical study of Java bytecode programsgoto.ucsd.edu/~mstepp/publications/empirical.pdf · 2007-11-28 · An empirical study of Java bytecode programs Christian Collberg, Ginger

AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 641

11. Cousot P, Cousot R. An abstract interpretation-based framework for software watermarking. Proceedings of the ACM

Conference on Principles of Programming Languages. ACM Press: New York, 2004.12. Dean J, Grove D, Chambers C. Optimization of object-oriented programs using static class hierarchy analysis. Proceedings

of the 9th European Conference on Object-Oriented Programming. Springer: Berlin, 1995; 77–101.13. Bernstein R. Producing good code for the case statement. Software—Practice and Experience 1985; 15(10):1021–1024.14. Kannan S, Proebsting TA. Correction to ‘producing good code for the case statement’. Software—Practice and Experience

1994; 24(2):233.15. Collberg C, Kobourov S, Nagra J, Pitts J, Wampler K. A system for graph-based visualization of the evolution of software.

Proceedings of the ACM Symposium on Software Visualization, June 2003. ACM Press: New York, 2003.16. Cook RP. An empirical analysis of the Lilith instruction set. IEEE Transactions on Computers 1989; 38(1):156–158.17. Antonioli DN, Pilz M. Analysis of the Java class file format. Technical Report, University of Zurich, 1998. Available at:

ftp://ftp.ifi.unizh.ch/pub/techreports/TR-98/ifi-98.04.ps.gz.18. Gustedt J, Mæhle OA, Telle JA. The treewidth of Java programs. Proceedings of the 4th International Workshop on

Algorithm Engineering and Experiments (ALENEX) (Lecture Notes in Computer Science, vol. 2409). Springer: Berlin,2002.

19. O’Donoghue D, Leddy A, Power J, Waldron J. Bigram analysis of Java bytecode sequences. Proceedings of the 2ndWorkshop on Intermediate Representation Engineering for the Java Virtual Machine. National University of Ireland:Ireland, 2002; 187–192.

20. Appel A. Sample graph coloring problems. http://www.cs.princeton.edu/∼appel/graphdata/.21. SpecJVM98. http://www.specbench.org/osg/jvm98.

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe