Selecting, Quantifying, Optimizing, and Understanding...

Selecting, Quantifying, Optimizing, and

Understanding Visualization Techniques: A

Computational Intelligence-Based Approach

Tufail Muhammad

Supervised by

Dr. Zahid Halim

A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy

in Computer System Engineering

Faculty of Computer Science and Engineering

Ghulam Ishaq Khan Institute of Engineering Sciences and Technology

Topi, Khyber Pakhtunkhwa, Pakistan

Fall, 2016

Dissertation examination committee:

Dr. Zahid Halim Advisor, Faculty of Computer Science and

Engineering, Ghulam Ishaq Khan Institute

of Engineering Sciences and Technology,

Topi, PAKISTAN.

Prof. Dr. Keith C.C. Chan Foreign Evaluator, Big Data Lab,

Department of Computing, The Hong

Kong Polytechnic University, Hung Hom,

Kowloon, HONG KONG.

QS World University Rank=116

Prof. Dr. Ivan Viola Foreign Evaluator, The Institute of

Computer Graphics and Algorithms,

Vienna University of Technology,

AUSTRIA.

QS World University Rank=197

Prof. Dr. Michael John Watts Foreign Evaluator, Academic Head of

Programme, Information Technology,

Auckland Institute of Studies, Auckland,

NEW ZEALAND.

Dr. Muhammad Tanvir Afzal External Examiner, Department of

Computer Science, Capital University of

Science & Technology Islamabad

Expressway, Kahuta Road, Zone-V,

Islamabad., PAKISTAN.

Dr. Muhammad Zohaib Zafar Iqbal External Examiner, Department of

Computer Science, National University of

Computer and Emerging Sciences, A.K

Brohi Road, Sector H-11/4, Islamabad.,

PAKISTAN.

Dr. Ahmar Rashid Internal Examiner, Faculty of Computer

Science and Engineering, Ghulam Ishaq

Khan Institute of Engineering Sciences

and Technology, Topi, PAKISTAN.

Acknowledgements

Dr. Ghulam Abbas Internal Evaluator (Pre-screening), Faculty

of Computer Science and Engineering,

Ghulam Ishaq Khan Institute of

Engineering Sciences and Technology,

Topi, PAKISTAN.

Dr. Rashad M Jillani Internal Evaluator (Pre-screening), Faculty

of Computer Science and Engineering,

Ghulam Ishaq Khan Institute of

Engineering Sciences and Technology,

Topi, PAKISTAN.

The work in this dissertation has been carried out at the Faculty of Computer

Science and Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences

and Technology (GIKI), Topi, Pakistan. The research was supported by the Higher

Education Commission of Pakistan (HEC) under the Indigenous Ph.D. Fellowship

Program.

Scholar PIN-: PS3-254

Declaration of authorship

The content of this dissertation entitled “Selecting, Quantifying, Optimizing,

and Understanding Visualization Techniques: A Computational Intelligence-Based

Approach“ had been undertaken in the Faculty of Computer Science and

Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences and

Technology, under the supervision of Dr. Zahid Halim. I, herewith declare

that the material presented in this dissertation has not previously been

submitted, in whole or in part, for any kind of academic awards elsewhere.

Moreover, the results are produced from original research, except otherwise

properly acknowledged and referenced to in the contents.

Tufail Muhammad

Acknowledgements

Foremost, I thank the Almighty God, the Most Gracious, and the Most

Merciful. At the same time, when I look back a few years, thinking of certain

people and memorable events with them, I would like to thank them all for

their support and encouragement throughout my doctoral studies.

I would like to present utmost gratitude to my supervisor, Dr. Zahid

Halim, whose invaluable support, help, patience, and encouragement enabled

me to complete this dissertation. He deserves the appreciation that cannot be

expressed in words. Every meeting with him was a source of inspiration and

great motivation for a PhD student having little hopes. His guidance not only

helped me in building my technical knowledge about the research field, but

also in my non-academic matters. Surely, without his supervision, I would be

lost. I would like to convey special thanks to Prof. Dr. Khalid J. Siddiqui for

his proofreading/editing, which has improved the composition of this thesis.

I want to express my heartfelt gratitude to my colleagues at GIK Institute,

Dr. Fazal Wahab, Dr. Adam Khan, Dr. Rahim Khan, Mr. Ihsan Ali, and Mr.

Mehran Bashir for their constant support and motivation that brought this

dissertation to existence. I would like to thank Mr. Ali Abass, Mr.

Muhammad Sohaib, Mr. Muhammad Riaz, and Mr. Hilal Khan for their

good wishes and “khidmat” (services) during the last two years.

I would like to express my deepest gratitude to my family, my parents,

specially my late father, though he is no longer here to see my success! I am

grateful to my brothers and my wife, their unconditional support and

encouragement helped me throughout my Ph.D. study. I am indebted to the

feelings of my sons, Iqrash Ahmad Khan and Zawar Ahmad Khan, who

always missed me.

Acknowledgements

I would like to convey special thanks to the Higher Education Commission

(HEC) of Pakistan for financially supporting my PhD study through the

indigenous PhD program. I am also thankful to GIK Institute for providing

research facilities and excellent environment. My special thanks to all the

faculty members and staff at the Faculty of Computer Science and

Engineering for their support and cooperation.

Publications

List of Publications Extracted from

This Work

Chapter 3 is published in the Applied Soft Computing Journal

T. Muhammad and Z. Halim, "Employing Artificial

Neural Networks for Constructing Metadata-Based

Model to Automatically Select an Appropriate Data

Visualization Technique," Applied Soft Computing, Vol.

49, 2016. pp. 365-384. [ISSN: 1568-4946, Thomson

Reuters JCR 2016, Impact factor 2.857, Elsevier]

Chapter 4 is published in the Information Sciences Journal

Z. Halim and T. Muhammad, "Quantifying and

Optimizing Visualization: An Evolutionary Computing-

Based Approach," Information Sciences, Vol. 385, 2017.

pp. 284-313. [ISSN: 0020-0255, Thomson Reuters JCR

2016, Impact factor 3.364, Elsevier]

Chapter 5 is accepted in the Journal of Visual Languages and Computing

T. Muhammad, Z. Halim, and M. A. Khan,

“Visualizing Trace of Java Collection APIs by Dynamic Bytecode Instrumentation,” Journal of Visual Languages and Computing, Vol. --, 20--. [ISSN: 1045-926X,

Thomson Reuters JCR 2016, Impact factor 0.634,

Elsevier] in-press.

Dedication

To, my father (late)

Abstract

Information visualization is a prominent technique to visually explore and

analyze large volumes of data effectively. Visualization must be aesthetically

appealing and perceptually pleasing to the human cognition. This needs

necessitates a framework to predict visualization technique based on two

aspects: the underlying dataset and the task to be performed on it.

Additionally, the resultant visualization must be optimal in the context of

aesthetics and human perception. This dissertation contributes in three

perspectives that subsume information visualization aspects: automatic

technique selection of a visualization, quantifying and optimizing

visualization layout, and visualizing software trace. The study provides

computational intelligence (CI) model to predict a visualization technique

based on the metadata of original dataset and relevant tasks. Similarly,

visualization metrics are formulated to objectively measure the visualization

quality. Based on these metrics, an evolutionary algorithm optimizes the

visualization layout. Finally, the hierarchical visualization technique is used

to study the usage of application programming interface (API) objects in the

program trace. The trace is collected using the bytecode instrumentation.

This dissertation has three parts. First part aims to predict an appropriate

visualization technique for a specific dataset. A customize dataset is built

using the knowledge that exists in the contemporary literature on various

visualization techniques. The dataset comprise of four metadata attributes,

relevant task, and the visualization techniques. The study develops an

artificial neural network (ANN) to predict a visualization technique using five

input and eight output neurons. Optimal neural network architecture is

obtained by evaluating various structures with different network

configuration. Several well-known performance metrics, i.e., confusion

matrix, accuracy, precision, and sensitivity of the classification are used to

compare various neural network architectures. Additionally, the best ANN

Abstract

model is compared with five other well-known classifiers: k-nearest neighbor

(k-NN), naïve Bayes (NB), decision tree (DT), random forest (RF), and

support vector machine (SVM).

Second part provides design of an optimal visualization using visualization

quality metrics. Initially, the study focuses on the design parameters which

contribute towards the quality of a visualization technique. Visualization

metrics are proposed to measure the aesthetic and perceptual characteristics of

visualization. They include: effectiveness, expressiveness, readability, and

interactivity. An evolutionary algorithm (EA)-based framework to optimize

the layout of a visualization technique is also proposed. Treemap

visualization technique is used for layout optimization using the EA. These

results are evaluated using control experiments and benchmark tasks.

The last part uses treemap-based visualization to analyze API objects used

in the software, particularly to understand API’s objects during runtime of

Java programs. The work consists of two aspects: the extraction of APIs

information using bytecode instrumentation, and development of a

visualization tool to analyse the traces using treemaps. Initially, a bytecode

instrumentation tool is developed to probe and collect runtime information.

The extracted information is logged into an extensible markup language

(XML) file. The log file is synthesized using treemap. The instrumentation

part is evaluated using twenty benchmark and ten real world applications.

The results show that the instrumentation tool causes minimal runtime

overheads.

Table of Contents

Declaration of authorship .....................................................................IV

Acknowledgements .............................................................................. V

Abstract ........................................................................................... IX

List of Figures .................................................................................. XV

List of Acronyms ........................................................................... XVIII

CHAPTER1 : INTRODUCTION ....................................................................... 1

1.1 Motivation ..................................................................................... 3

1.2 Problem statement ............................................................................ 5

1.3 Primary research questions .................................................................. 8

1.4 Research Hypothesis ......................................................................... 9

1.5 Aims and objectives ......................................................................... 10

1.6 Research Methodology ..................................................................... 12

1.7 Assumptions .................................................................................. 15

1.8 Dissertation Outline ......................................................................... 15

1.9 Chapter Summary ........................................................................... 17

CHAPTER2 : LITERATURE REVIEW ............................................................... 18

2.1 Dynamic code analysis and software visualization .................................... 19

2.2 Automatic visualization prediction ....................................................... 29

2.3 Visualization optimization ................................................................. 34

2.4 Chapter Summary ........................................................................... 43

Table of Contents

CHAPTER 3 : ON SELECTING DATA VISUALIZATION TECHNIQUE SELECTION ............. 46

3.1 Proposed system, dataset, and visualization techniques .............................. 49

3.1.1. BUILDING THE DATASET ....................................................... 51

3.1.2. VISUALIZATION TECHNIQUES ................................................. 55

3.2 Artificial neural network preliminaries .................................................. 58

3.3 Experiments and results .................................................................... 60

3.3.1. ANN EXPERIMENTS ............................................................ 63

3.3.2. THE N-FOLD CROSS-VALIDATION ............................................. 71

3.3.3. PERFORMANCE ANALYSIS ..................................................... 76

3.4 Sensitivity analysis ........................................................................... 80

3.5 Comparison with other classifiers ......................................................... 81

3.6 Ranking three best visualizations ......................................................... 87

3.7 Comparison with state-of-the-art .......................................................... 93

3.8 Discussion ..................................................................................... 96

3.9 Chapter summary ............................................................................ 98

CHAPTER 4 : QUANTIFYING AND OPTIMIZING VISUALIZATION .......................... 100

4.1 The information visualization metrics ................................................. 102

4.1.1. EFFECTIVENESS ............................................................... 104

4.1.2. EXPRESSIVENESS .............................................................. 104

4.1.3. READABILITY .................................................................. 105

4.1.4. INTERACTIVITY ................................................................ 106

4.1.5. THE COMBINED FITNESS FUNCTION ........................................ 107

4.2 Proposed solution .......................................................................... 107

4.2.1 PROBLEM FORMULATION ..................................................... 109

Table of Contents

4.2.2 CHROMOSOME ENCODING .................................................... 110

4.2.3 REPRODUCTION OPERATORS ................................................. 114

4.3 Experiments and Results ................................................................. 116

4.3.1 TREEMAP ........................................................................ 116

4.3.2. EA RESULTS.................................................................... 118

4.3.3. EVALUATION .................................................................. 122

4.3.3.1. USER STUDY ......................................................... 124

4.3.3.2. ANALYSIS OF VARIANCE AND POST HOC ANALYSIS ........... 136

4.3.4. DIRECT METHOD .............................................................. 141

4.4 Discussion ................................................................................... 143

4.5 Chapter Summary ......................................................................... 149

CHAPTER 5 : VISUALIZING TRACE OF JAVA COLLECTION APIS .......................... 152

5.1. Proposed System for Java Tree Visualization ....................................... 155

5.1.1 JAVA TRACES VISUALIZATION SYSTEM OVERVIEW ........................ 156

5.1.2 INSTRUMENTATION ........................................................... 158

5.1.3 DATA COLLECTION ............................................................ 159

5.1.4 VISUALIZATION AND USER INTERACTION .................................. 160

5.2. Case study .................................................................................. 163

5.3. Performance evaluation and comparison ............................................. 167

5.3.1 EXPERIMENT DESIGN .......................................................... 171

5.4. Performance evaluation .................................................................. 180

5.5. Chapter Summary ........................................................................ 184

CHAPTER 6 : CONCLUSIONS AND FUTURE WORK ........................................... 186

6.1. Primary research questions .............................................................. 187

Table of Contents

6.2. Summary of the findings ................................................................. 197

6.3. Limitations ................................................................................. 199

6.4. Future work ................................................................................ 200

APPENDIX A ........................................................................................ 207

APPENDIX B ........................................................................................ 207

APPENDIX C ........................................................................................ 207

APPENDIX D ........................................................................................ 207

APPENDIX E ........................................................................................ 204

List of Figures

Figure 1. 1 Block diagram listing system components ...................................................... 10

Figure 1. 2 A visual roadmap of the dissertation .............................................................. 14

Figure 2. 1 Related work taxonomy ................................................................................ 19

Figure 2.2 A tree with its corresponding treemap ............................................................. 23

Figure 2.1 Typical compositions of a Java program ......................................................... 24

Figure 2.2 Information visualization techniques classification .......................................... 30

Figure 3.1 System working for the visualization prediction .............................................. 50

Figure 3.1 Visualization, tasks and metadata mapping ..................................................... 53

Figure 3.1 The eight visualizations used as class label ...................................................... 56

Figure 3.2 Neural Network ............................................................................................. 58

Figure 3.5 Dataset larger values vs. smaller values ........................................................... 63

Figure 3.6 Single hidden layer network, 2-hidden layered network ................................... 67

Figure 3.7 no. of nodes vs. MSE ..................................................................................... 69

Figure 3.8 2-Hidden nodes .............................................................................................. 70

Figure 3.9 Two hidden layered structure analysis ............................................................ 74

Figure 3.10 Accuracy for different number of nodes in hidden layer ................................ 75

Figure 3.11 Hidden nodes vs. MSEs ............................................................................... 75

Figure 3.12 Hidden Nodes vs. MSE for 1 hidden layered ANN....................................... 77

Figure 3.13 Confusion matrix of the best ANN architecture ............................................ 79

Figure 4.4 Crossover and mutation operations. ............................................................... 115

Figure 4.5 A tree with its corresponding treemap ........................................................... 117

Figure 5.1 System overview of the system for Java traces visualization ............................ 158

Figure 5.2 Segment of log file ......................................................................................... 161

Figure 5.3 Visualization main view ................................................................................ 165

Figure 5.4 Visualization Package-wise view.................................................................... 166

Figure 5.5 Mutator methods view .................................................................................. 166

Figure 5.6 Search result for HashTable .......................................................................... 166

List of Tables

Table 3.1 Dataset description .......................................................................................... 54

Table 3.2 The eight visualization techniques used in recent literature ............................... 55

Table 3.3 Network structure and initial parameters .......................................................... 64

Table 3.4 NN performance ............................................................................................. 78

Table 3.5 Best ANN performance ................................................................................... 79

Table 3.6 Impact of various learning approaches on the ANN ......................................... 80

Table 3.7 Sensitivity analysis results ................................................................................ 81

Table 3.8 Overall accuracy for random forest .................................................................. 83

Table 3.9 SVM prediction accuracy ................................................................................ 84

Table 3.10 Per class accuracy of different classifiers ......................................................... 85

Table 3.11 Average accuracy and CPU time of classifiers ................................................ 86

Table 3.12 Accuracy comparison using Friedman test ..................................................... 89

Table 3.13 Sensitivity analysis using classifiers- error rate (%) .......................................... 89

Table 3.14 Dataset description ........................................................................................ 90

Table 3.15 Dataset with three best visualizations ............................................................. 91

Table 3.16 Benchmark datasets and the predicted visualization based on task ................... 92

Table 3.17 Comparison between the proposed system and state-of-the-art ........................ 95

Table 4.1 Aspects mentioned in literature for better visualization .................................... 104

Table 4.3 Description of the genes in a chromosome....................................................... 112

Table 4.4 EA parameter settings ..................................................................................... 116

Table 4.5 Various combinations of the objective function ................................................ 123

Table 4.6 The five benchmark tasks ................................................................................ 124

Table 4.7 Summary of user study scores ......................................................................... 130

Table 5.1 An example of collection API objects analysis using clustering ........................ 153

Table 5.2 Log File Detail .............................................................................................. 164

Table 5.3 Collection APIs per program .......................................................................... 168

Table 5.4 Null hypotheses with their alternatives ............................................................ 171

Table 5.6 Task description ............................................................................................. 173

Table 5.7 Experimental group statistics for time and usability score (0-5) ......................... 175

List of Tables

Table 5.8 Control group statistics for time and usability score (0-5) ................................. 176

Table 5.9 Per task comparison ....................................................................................... 178

Table 5.10 Results statistics ............................................................................................ 179

Table 5.11 Software time taken while loading ................................................................. 181

Table A.1 Single hidden layer NN structure time and MSE (split) ................................... 207

Table A.2 Two-Hidden layers NN structure time and MSE ............................................ 207

Table A.3 Network with Training and Test data ............................................................. 207

Table A.4 NN with validation check- Early stop ............................................................ 207

Table A.5 NN with stop with goal / Validation stop ..................................................... 207

Table A.6 Single hidden layer NN structure and MSEs (10Folds-CV) ............................. 207

Table A.7 Comparison of classification and prediction accuracy (10Folds-CV) ................ 207

Table A.8 Information visualization results for iris dataset .............................................. 207

List of Acronyms

1D One dimension

2D Two dimensions

ANN Artificial neural network

ANA Analytical style demonstrator

API Application programming interface

ART Artistic style demonstrator

AST Abstract syntax tree

AUC Area under curve

AWT Abstract window toolkit

BCEL Byte code engineering library

CI Computational intelligence

DT Decision tree

DVF Dense vector field

EA Evolutionary algorithm

EC Evolutionary computation

FFNN Feed-forward neural network

GA Genetic algorithm

HCI Human computer interaction

IDE Integrated development environment

JCF Java collection framework

JRE Java runtime environment

JVM Java virtual machine

JVMPI Java virtual machine profiler interface

JVMTI Java virtual machine tools interface

k-NN k-nearest neighbor

LBM Lifetime behavior model

LM Levenberg-Marquardt

MAG Magazine style demonstrator

List of Acronyms

MBs Megabytes

MLP Multilayer perceptron

MSE Mean square error

NB Naïve Bayes

nD n dimensions

RF Random forest

Rprop Resilient backpropagation

SVM Support vector machine

TTT Task by data type taxonomy

XML Extensible mark-up language

Chapter 1 Introduction

Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 1

Introduction

“Imagination rules the world”

Napoleon Bonaparte

These days Information visualization is ubiquitous in almost

every discipline to visually analyse high volumes of data

effectively. Nevertheless, the selection of an appropriate

visualization technique for a particular problem domain or

dataset is still a non-trivial undertaking. Moreover, the

perspective visualization is needed to be perceptually

appealing and aesthetically alluring to the human cognitive

system. The work presented in this dissertation addresses

these problems using software instrumentation and

computational intelligence-based approaches. Using software

instrumentation, a visualization tool is presented to

comprehend the collection APIs usage in Java-based

applications through dynamic analysis. Computational

intelligence is used for predicting appropriate visualization

techniques for a particular dataset using the metadata.

Additionally, a set of visualization metrics is proposed to

quantify the given visualization technique. Later, the

proposed visualization metrics are used to optimize

visualization layout using evolutionary algorithms. This

chapter provides a synopsis for the research done in the

dissertation. The motivation for the work carried out in the

subsequent chapters is also covered. Various research

questions are formulated, the answers to which are soughed

Chapter

in this work. The methodology for research with major

limitations and assumptions is also discussed.

The advances in computing and related technologies have given birth to

new avenues of information handling and exploration. These advances

have large generated volumes of data. These volumes of data and

information are produced and manipulated daily using various media, i.e.,

social media [1, 2], business, financial transactions [3], ever growing large

database (also known as big data) [4], and large intricate software system

running round the clock [5]. This has paved the way for approaches to see

exploit hidden information underneath these piles of data. The three

popular domains that are used to explore the huge amount of data include:

data mining, exploratory data analysis, and visualization. Each of these

can further be classified into various domains. For example, stream

mining, uncertain data mining, and graph mining are a few of the sub

categories for the data mining domain. Similarly, visualization has sub

domains of information visualization and software visualization. Since the

work presented in this dissertation addresses the problem at hand using

visualization and computational intelligence that uses visualization as a

test bed, we will be focusing on the same in the rest of this chapter.

However, for a further study on data mining and exploratory data analysis,

the reader is referred to [6, 7].

Information visualization is a powerful method to visually explore the

large complicated data to get a thorough insight [8]. Information

visualization uses visual computing to amplify human cognition with

abstract information. It promises to help in expeditious understanding and

action in a world of increasing data volumes. There has been a consistent

demand for sophisticated tools and techniques, based on which the

visualization systems explore and analyse information efficiently [9].

Nonetheless, considering the naïve business users, the appropriate

tools/technique selection in perceptive to the data at hand remains a non-

trivial issue [10, 11]. Moreover, to be effective and efficient in exploration

for presentation of underlying information, specific visualization datasets

are needed [12]. The visualization must not be merely a pretty image; it

should be perceptually and aesthetically appealing to the human cognitive

system as well. This is also true for complex entities, such as, software, that

produces a vast amount of information, especially during its execution [13].

Based on the discussion above, this dissertation aims to address the

problem of gathering data from software-based systems for visual

inspection. The computational intelligence methods are also developed to

extract useful information, to predict an appropriate visualization

technique for given a dataset, and optimizing the layout of a visualization

technique.

1.1 Motivation

Effective information handling and manipulation is an important factor

that influences the strategic decision making from the primordial time to the

current world. However, the existing era is marked as the information age

which needs to process bulks of data rapidly to get insight [6]. In addition to

various experiments, processes, and events that generate data, the software

systems whether system software or application software, also generate

data. This is more relevant for the software systems that are built using

object oriented paradigm due to inheritance, polymorphism, and overriding

features. Collecting and understanding such data can give insight about the

internal working of a software system that can later be optimized for

performance. However, the data generated by these software-based systems

can be huge, ranging from megabytes (MBs) to gigabytes (GBs) of data,

depending upon the size of the particular software. Additionally, the object

oriented software has a vast usage of objects of different data structures,

such as Java collection application programming interfaces (APIs) which

make it very difficult to understand the code. Programmers are always

interested in optimizing their code in order to make it more efficient.

Instances where code size is huge and objects from other multiple classes

are instantiated make it a difficult task to understand.

Information visualization, on the other hand, is a powerful method for

visual representation of large datasets and gives instant stimulus to the

human cognitive system [14]. The advances in computing and information

processing facilities has given birth to new tools and visualization

techniques [15]. Selection of a particular visualization technique for a given

problem requires domain expertise. Notwithstanding a business user is

always enthusiastic to build most optimal visualization of the data in hand.

Most of the time the users are naïve who have little or no knowledge about

the underlying dataset and the intrinsic relationship among the items.

Although the field of information visualization has come a long way, a

quantitative measure to evaluate how good or bad a particular visualization

represent the data is yet to be found. This quantitative measure will not only

help is choosing an appropriate visualization, but can also be used to

optimize the layout of a particular visualization technique.

Computational intelligence (CI), a set of nature-inspired computational

methodologies that can address complex real-world problems for which

traditional approaches are ineffective, can be used to optimize the layout of

a visualization technique. The core components of CI consist of

evolutionary computation (EC), artificial neural networks (ANN), and

fuzzy logic to name a few. However, in order to optimize the visualization

layout the CI-based solution will require an objective function that needs to

be either maximized or minimized.

Based on the aforementioned details the work in this dissertation is

motivated to use data generated by the software-based systems (although

data from any systems can also be used) and software instrumentation

coupled with visualization to gain insight about complex software-based

systems. It also aims to exploit the search capabilities of CI for predicting

and optimizing a visualization technique.

1.2 Problem statement

Based on the motivation, the problem is twofold: firstly, there is a need

for an appropriate visualization selection mechanism that effectively

presents the data and handles the tasks needed to be accomplished with that

particular dataset. Secondly, the selected visualization should fulfil the

aesthetic and perceptual requirements of the users [12]. Conversely, an

inappropriate visualization will lead to inadequate and inappropriate

decision making based on the incorrect visualized information. In addition,

the visualization must not overwhelm the user, whilst conveying the

intended information effectively and efficiently [16]. The remedy to these

situations emerges with many challenges, i.e., what type of information is

needed from the dataset to predict an appropriate visualization? how are the

tasks and visualization related? the availability of some benchmark dataset,

visualization perceptual design parameters, metric to evaluated

visualization, and how these metrics are computationally measured. Several

areas like, human computer interaction (HCI), interface design, perceptual

studies, and cognitive science already have abundant foundational work

available while addressing these imperfections. Similarly, advances in the

computational intelligence techniques, such as ANNs and genetic algorithm

(GAs) bring powerful method for classification, prediction, and

optimization.

Large software systems are an example of complex entities that human

beings develop [17]. The behaviour of software systems during execution is

always subject of interest to the developers with respect to maintenance and

application optimization [18, 19]. Particularly, in Java-based applications,

where program use collection of APIs, i.e., HashTable, ArryList, etc., to

store runtime data. The program performance may degrade with inefficient

usage of APIs [20]. In case of program comprehension, in general, and for

maintenance in particular, developers need to understand the location

where the large APIs objects are created. The runtime information about

APIs usage may be recorded through binary instrumentation. This may be

subject to the runtime overheads [21]. Thus, there is a need for an effective

method to analyse the large amount of information collected during

software execution.

Visualization is a powerful method of exploration and confirmation of

underlying hypotheses. Modern visualization tools and techniques are

developed in abundance, both in research and industry. This situation

demands for a visualization technique selection method for a given data.

Nevertheless, the underlying data may have a complex intrinsic relationship

and characteristics that influence the visualization selection process [22]. It

copes with the complexity of underling data or system. Business users

having insufficient skills and knowledge about the data and/or visualization

are an example for such situations. The data and visualization technique is

highly desirable to build a suitable visualization [6]. This perspective

necessitates for an automatic visualization framework that predicts a

visualization technique for a specific data with high accuracy. The potential

system must have the knowledge not only about data, but also visualization

tasks as well. This will make the tasks to be executed by the user easier and

enhance the productivity and decision making.

As an intimately connected problem, the selected visualization must also

be perceptually pleasing and aesthetically appealing to human cognition.

However, creating such visualization is not a trivial task. Generally, these

quality aspects of visualization are inherently subjective in nature and vary

in context [12, 23]. In visualization community controlled experiment and

user studies remains the main evaluation methods for the comparison

among different visualization types and tools [24]. In literature, various

theoretical visualization metrics are discussed and proposed to evaluate the

visualization techniques [25, 26]. Such situations need an automated

framework to evaluation and create an optimal visualization based on the

existing knowledge. The computational model must be established on the

bases of existing theories from the empirical research conducted over the

years. The advancements in several divergent, however related fields may

be utilized, i.e., HCI, cognitive science, interface design, computer graphics,

and operational research. The intended system will then be able to display

better visualization created without the intervention of human beings.

To explore and comprehend a complex entity such as software,

visualization is the more appealing option [13, 27-29]. Information

visualization techniques are already used for exploring different aspects of

software in static perspective [30] and to analyse runtime behaviour of a

software [13]. Most object oriented applications are built from re-useable

components know as libraries and APIs. Java-based application developers

use APIs to store program data and variables during execution [31]. Since

the efficient usage of these APIs has an impact on the performance of a

program, the developer needs to understand the runtime behaviour of their

programs. The usage pattern and location in source code information, e.g.,

the packages or classes responsible for a large number of objects creation

help in program comprehension and maintenance [32, 33]. The dynamic

bytecode instrumentation is a method to extract information about state of

program during the execution [34]. However, this method has its own

limitation since the performance degrades due to instrumentation code [35].

Additionally, the runtime information needs to be properly analysed for

insight patterns. The situation demands a twofold solution, first a

lightweight instrumentation to extract runtime information about the APIs

used by the Java application. Secondly, a suitable visualization is inevitable

that assists developers to effectively analyse the large traced information

with minimum effort.

Hence, this work provides solution to cover the above-mentioned aspects

of the problem statement. In case of appropriate visualization selection, a

metadata-based ANN approach is provided. The proposed approach

leverages the user from the complexities of dataset and visualization

selection process. The model is trained and test for the accuracy of

visualization selection using a dataset and the tasks need to be performed.

This makes the users’ task easier and enhances the productivity by

augmenting the decision making. Furthermore, the proposed visualization

metrics and computational intelligence-based framework provides better

visualization. The proposed visualization metrics are based on the existing

theories and knowledge that provide a strong basis for computational

model. The proposed solution can be utilized to evaluate visualizations

computationally and get a better visualization for human decision making.

Moreover, the selective instrumentation and visualization-based tool enable

the developers to extract Java collection APIs information and provide a

visualization to get the insight. The visualization tool helps the developer to

analyze the large amount of data effectively.

1.3 Primary research questions

The research presented in this dissertation has three aspects. Firstly, the

automatic visualization prediction for a particular dataset based on

metadata and the tasks that the user needs to perform. Secondly, to build

perceptually better visualization using optimal design attributes. Finally, to

use information visualization techniques and bytecode instrumentation to

investigate Java APIs usage during program execution. Based on this

background and the motivation, the following research questions are

formulated to carry out this work:

RQ-1: What are important characteristics of a dataset that influence

the selection of visualization technique?

RQ-2: How metadata and a particular task related to the data, be used

to predict visualization for a dataset?

RQ-3: What is the best CI model to predict visualization based on

metadata?

RQ-4: What aesthetic and perpetual design parameters are important

for specific visualization?

RQ-5: How the visualization features and design parameters map to

the visualization metrics?

RQ-6: How the visualization metrics computationally evolve to

optimize the layout of a visualization technique?

RQ-7: Which types of Collection APIs are frequently used by a

given/target Java program during the execution?

RQ-8: Which packages/classes/methods of target program are

responsible to instantiate Collection APIs objects?

RQ-9: How dynamic bytecode instrumentation is used to extract APIs

objects traces with minimal runtime overheads?

RQ-10: Can treemap-based visualization be utilized for the analysis of

Collection APIs objects of a particular Java program?

1.4 Research Hypothesis

Research hypothesis are high level general questions about the research. In

this dissertation we formulate the following research hypotheses

H1: A computational intelligence-based model can be used to

automatically predict visualization for a specific dataset with a

relatively higher degree of accuracy.

Figure 1. 1 Block diagram depicting the system components

H2: Evolutionary computation-based optimization framework can

provide with perceptually and aesthetically better visualization design

parameters.

H3: Selective bytecode instrumentation with treemap visualization can

be used for Java APIs usage understanding.

1.5 Aims and objectives

This dissertation aims to contribute towards information visualization

techniques using computational intelligence (CI) and software

instrumentation in two perspectives. Firstly, the automatic prediction of

visualization for a particular dataset by utilizing the metadata and the tasks

those are to be performed on such data. This process is followed by the

optimization of visualization, to build a perceptually appealing and

aesthetically pleasing visualization. This part of the work is accomplished

by using computational intelligence techniques, specifically artificial neural

networks (ANNs) and evolutionary algorithms (EAs). ANN is tuned and

tested for the prediction of visualization based on metadata and the

visualization tasks. EA-based framework is developed to determine the

design parameters for the optimal set of values to build comparatively better

visualization. Secondly, using visualization technique along with bytecode

instrumentation to analyse and comprehend the APIs usage in Java

application. A lightweight, bytecode instrumentation is developed to extract

runtime usage of APIs objects in Java programs with minimal overheads.

Treemap visualization is then devised to display the vast amount of

information on a single screen. Figure 1.1 shows the working and

interaction of various components of this work using a block diagram. The

following objectives are formulated to achieve the research aims:

Investigation existing literature to find the intrinsic properties of a

dataset that may be used to select appropriate visualization method.

Review information visualization techniques for specific types of

tasks on a dataset.

Develop a novel dataset based on metadata, tasks, and visualization

techniques.

Develop a CI-based model to classify and predict visualization for a

dataset with intended task having high degree of accuracy.

Test the CI-model against other well-known classifiers and state-of-

the-art approaches.

Develop and formulate visualization metrics from the existing

knowledge.

Investigate the perceptual and aesthetic design parameter for a

particular visualization.

Design an EA-based framework to determine design parameters of a

visualization technique.

Develop bytecode instrumentation tool to extract Java APIs objects

usage information during the execution of a program.

Develop a treemap-based visualization interface to analyse Java

APIs objects on the basis of its type, packages, classes, and methods.

Design a controlled experiment and case studies with benchmark

tasks to evaluate the instrumentation and visualization results.

1.6 Research Methodology

A stepwise methodology is adopted to carry out this research and to

satisfy the formulated questions. Several areas of research are investigated to

establish strong foundation for the proposed techniques. The broader areas

including: information visualization, automated visualization selection,

intrinsic properties of the dataset, and computational intelligence-based

classifier. The research convinced to carry out the experiments for research

questions RQ-1-to-RQ-3. Theories on visualization metrics, perceptual and

aesthetic design aspects of visualization and computer graphics and soft-

computing techniques are carefully investigated to form a foundation for

research questions RQ-4-to-RQ-6. Further, to address RQ-7-to-RQ-10 an

exhaustive review of research in software visualization tools, dynamic

analysis, Java APIs usage analysis, and evaluation methods is carried out.

For the visualization prediction module, a novel dataset is built based on

the knowledge presented in the literature. Previously there was no

benchmark dataset available for the visualization classification. However,

already established visualization techniques, i.e., line chart, parallel

coordinates, and treemaps are known to be more suitable for some tasks.

Commonly used tasks are extracted with their intended visualization

technique and combined with the metadata of datasets to be visualized. The

metadata consists of intrinsic properties, i.e., dimensions, number of

instances, number of attributes, and data types. The newly created dataset is

then classified using ANN-based classifier. Several ANN models and

training methods are tested. ANN model is compared with well-known

classifiers including: support vector machine (SVM), random forest (RF),

and decision tree (DT) using benchmark performance metrics. The

proposed system is also compared with state-of-the-art systems.

The next step involves exploration of metrics for the optimization of

visualization layout. Various theories on quantifying visualization are

presented from the information visualization literature. However, the

subjective nature of these theories makes it challenging to devise a set of

metrics. The perceptual and aesthetic design parameters of a particular

visualization are mapped to the proposed visualization metrics. The

mapping process is based upon the theories and knowledge presented over

the years within the domain of information visualization, human computer

interaction (HCI), interface design, and psychology. An EA-based

technique is developed to develop optimal design parameter values. The EA

is fed with a random initial population where the fitness function and the

genetic operators are used to search for the best solution. In addition to this,

the outcome is needed to be compared with contemporary research. For this

purpose the internal metrics are combined with external evaluation

criterion. The evaluation process is followed by user studies and statistical

analysis of result collected during this process.

Figure 1. 2 A visual roadmap of the dissertation

The final step in this methodology is the development of visualization

system, and bytecode instrumentation tool to extract and analyse Java APIs

objects instantiated during the program execution. A modular strategy is

adopted for this purpose; the bytecode instrumentation tool is developed

with the aim to avoid runtime overheads. The overhead certainly result in

performance degradation of the targeted applications. The instrumentation

module adds probe to Java application and APIs objects information are

stored in a log file. The log file is then used for post-mortem analysis of

APIs based on treemap visualization. The treemap visualization module is

provided with extensible markup language (XML)-based tree format of the

log file as an input. The visualization interface provides with several sub-

views to explore the log file for APIs packages, classes, methods, and data

type information. The evaluation of the tool is also twofold: the evaluation

of instrumentation module is carried out by case study of large Java

applications and a benchmark suit as well, while, the evaluation of

visualization method is performed using controlled experiment. Figure 1.2

shows a visual roadmap of this dissertation.

1.7 Assumptions

Every research has certain limitations and the work is usually carried out

with several logical assumptions. The work presented in this dissertation is

also obtained with certain assumption and few limitations. The controlled

experiment and user study are performed with the assumption that all

participants are volunteers and that they are honest in their judgment about

the tasks. Additionally, it is assumed that the participants have the

experience and are truthful in providing the required information. The

development of a dataset, visualization metric, and exploring visualization

design parameters for optimality, the assumption is taken that existing

knowledge and experiment are also even-handedly performed. The

development of optimal visualization is taken out in general perspective, not

subject to individual preferences.

1.8 Dissertation Outline

The remaining dissertation consists of five more chapters. Chapter 2 lists

the contemporary work related to this dissertation with several aspects, i.e.,

software visualization using dynamic analysis, tool design, visualization

evaluation methods, automatic visualization selection, visualization

optimization, visualization metrics, and computational intelligence

approaches to automatic prediction, dataset classification, and design

optimization.

Chapter 3 deals with the concept of automatic prediction of visualization

techniques using computational intelligence. The chapter presents the

methodology for the creation of perspective dataset used for the purpose of

training and testing an ANN model. Training and testing of several ANN

models are elaborated in detail including different network parameter

settings. The chapter also covers an extensive evaluation of several

classifiers in comparison with ANN model. The classifiers are evaluated

using well-known performance metrics, i.e., accuracy, f-measure, and

confusion matrix. The discussion and comparison with state-of-the-art

elucidate the advantages of the proposed approach.

Chapter 4 describes the major contributions of this dissertation, an

information visualization optimization using bio-inspired evolutionary

algorithms. The theories and existing literature on information

visualization are formalized to explore and devise metrics to quantify

visualization quality. The proposed method for visualization metric and

mapping to specific visualization technique is presented. The chapter also

shows extensive study of EAs and the main component. A detail

experimental case study and comparison shows the effectiveness of the

proposed solution.

Chapter 5 presents details about the first contribution of this work, the

visual analysis and comprehension of APIs usage in large corpus based on

Java. A comprehensive study on the subject is presented. It consists of

basic methodology to build instrumentation and visualization tool, key

components, and modules. A case study on the large Java application is

devised to verify the tool’s results on real applications. This is followed by

the empirical evaluation of visualization part through controlled

experiment. Several statistical tests are applied to authenticate the result of

experiment and confirm the hypothesis. The effectiveness of

instrumentation tool regarding runtime overhead is checked on standard

benchmark suites. The results are tested and compared against state-of-the-

art approaches.

Chapter 6 draws the conclusion from this work and articulates the major

contributions and limitations of the study. The chapter summarizes and

revisits the research questions and concludes to what extent the dissertation

is successful with their answers. The promising future directions are

elucidated to lead this work in several directions.

1.9 Chapter Summary

This chapter has laid the foundation for the rest of the dissertation. The

motivation and problem statement highlighted the context in which the

actual problem is being solved using the proposed work. Followed these,

hypotheses and research questions were formulated to fill the knowledge

gap through answers to these questions. Moreover, the aims and objectives

set for the dissertation and methodology to achieve these are elaborated.

The basic assumptions that are taken while carrying out this research were

also described. The last section of the chapter outlined the dissertation

chapter-wise along with the short description of each chapter. The next

chapter presents a comprehensive review of contemporary theories and

past research closely relevant to this dissertation.

Chapter2 Literature review

Literature Review

“There is always a better way “

Thomas Edison

The work in this dissertation focuses on three aspects of

visualization: automatic prediction, optimization, and the

use of visualization for the comprehension of dynamically

collected data about Java application programming interface

(APIs). This chapter reviews the literature pertaining mainly

to these three areas of research. An exhaustive investigation of

the relevant work is presented with focus on contributions,

opportunities, and identifying the gaps between knowledge.

Further, the major contributions of this dissertation are

discussed with the key distinction from the contemporary

Information visualization is a powerful technique for visual

representation of data with the help of computer-based system. The field

combines related areas is rich with new tools and well established theories.

The visualization tools are built keeping in view various, aspects including:

features to cope with large volumes of data, effective way of handling data,

perform various tasks related to the data, and easy to use interface.

Nevertheless, theories are developed to provide bases for new techniques

and algorithms to coincide with the existing knowledge. Advances in

information visualization have given birth to new aspect and research

avenues, like: visual analytics, visual data mining, and visual data

classification. The work presented in this dissertation is concerned mainly

Chapter

Figure 2. 1 Related work taxonomy

Dissertation

Dynamic instrumentation and trace visualization

Bytecode instrumentation

Software visualization

API analysis

Automatic visualization selection

Theoretical visualization taxonomies

Automated visualization systems

Visualization optimization

Soft computing-based optimization

Visualization Metrics

Treemaps

with three areas of information visualization, i.e., automatic visualization

prediction, visualization optimization, and software visualization with

dynamic bytecode instrumentation. Additionally, the dissertation also

covers related work in program component analysis, APIs usage analysis,

computational intelligence techniques, dataset creation, visualization

metrics, and optimization using evolutionary algorithms. A

comprehensive analysis of relevant theories and contemporary approaches

are discussed along with their limitations.

The rest of the chapter is organized as follows: Section 2.1 consists of

related works on dynamic analysis and software visualization. The

automatic visualization selection is discussed in Section 2.2, while the

literatures on visualization optimization and visualization metrics are

presented in Section 2.3.

2.1 Dynamic code analysis and software visualization

This section covers dynamic code analysis and software visualization

related literature in the perspective of this dissertation. However, the

literature that combines aspects of dynamic analysis and visualization is

more important to build bases for this work. Additionally, in the domain of

dynamic code analysis, only bytecode instrumentation is discussed, since

this work only performs bytecode instrumentation to extract runtime

information of a Java program. The work relevant to the APIs usage

analysis is also analyzed and reviewed.

Software visualization is a technique to visually depict software at

different levels for effective understanding and to manage the code’s

complexity. Software visualization is an important activity for maintenance,

reverse engineering, and re-engineering of software [36]. Software has

different aspect, i.e., the development lifecycle activities, source code view,

software architecture, and runtime behavior. In the literature, considerably

large amount of work has been proposed to visualize different aspects of the

software. Cornelissen et al. [13] analyze the importance of dynamic analysis

in program comprehension. They discuss the growing importance of

visualization for understanding the runtime and the behaviour of the

software. Caserta et al. [4] present a comprehensive survey on visualization

related to static aspects of software and its evolution. Recently, Shahin et al.

[28] conducted a comprehensive review of software architecture

visualization based on the articles published during 1999 and 2011.

Pauw et al. propose a software visualization tool Jinsight [27] to explore

the runtime behavior of Java program. The tool is provided with two

visualization views, i.e., histogram view and execution view to get the

insight of a Java program. Jinsight captures several program events,

including: object construction/destruction, and program method invocation

using java virtual machine (JVM) instrumentation technique. However,

Jinsight [27] is more directed towards performance analysis than program

comprehension. As the amount of traces increase in [27], the visualization

becomes difficult for the user to understand and comprehend patterns.

Gammatella, is a visualization tool presented by Jones et al. [37] to explore

program data visually. The tool synthesizes program trace data to visualize.

The visualization is based on treemap technique where program execution

data in shown at several granularity levels, i.e., system level and file level.

The tool is provided with interaction facility where a user may visualize

regions of the program’s trace data. However, Gammetella has a limitation

when it comes to visually show objects per package or per class since large

applications require the information hierarchy to visualize. The tool also

needs to be evaluated for effectiveness and usability. At present, only case

study is presented in support of the tool. Wu et al. propose a novel

visualization based on their object lifetime behavior model for Java program

objects [18]. The lifetime behavior model (LBM) is presented to capture and

model events for a specific Java object in a temporal manner. The authors

present a prototype tool that gathers Java program traces at object level

using Java virtual machine profiling interface (JVMPI) and then use several

visual views to get the insight. The visualization is based on a tree type

structure and consists of three models: thread-oriented view, method-

oriented view, and interaction-oriented view. The authors also propose an

object performance measurement model and visualization based on several

states of program object during the course of execution. Nonetheless, the

visualization technique needs to be evaluated to determine its effectiveness

and usability. The approach does not provide a global view of a Java

program, which could help user to view complete information on a single

screen. However, the object performance model in [18] need complete

demonstration and its under visualization may further be examined. Reiss et

al. [38] present online visualization system for an executing program based

on various program components. Their visualization scheme is based on

different abstraction levels to depict program behavior while it is executing.

The technique consists of two visualization views, JIVE and JOVE, used for

program comprehension, debugging, and performance. The JIVE

visualization provides the program behavior in terms of classes, packages,

and threads on a single view. The JOVE part focuses on program code and

how it executes during runtime. The major limitation of this approach is the

difficulty to visualize the entire course of execution. However, the

visualization is effective in showing program component hierarchy. At the

same time, the visualization is not evaluated empirically. Vasco, an

interactive visualization tool has been presented by Duseau et al. [39]. The

tool is used to assist developer in the discovery of program behavior and

understanding the temporary objects issues, known as object churn. The

tool provides a flexible and scalable approach that enables developers to

quickly identify the source of objects’ churn in intensive applications. Their

visualization is based on Sunburst, a popular technique using tree like

structure. The developer is provided with different views of trace data with

less strain on cognition. They also report the tool’s application in three

framework intensive software applications. However, their works does not

utilize a formal approach to evaluate the visualization. The hierarchical

information about the source of object churn is not clear in the depicted

visualization. Heapviz [40], is another offline approach that uses dynamic

analysis to extract program data structure and use interactive visualization

to effectively obtain the insight. The technique captures the snapshot of

program heap and uses graph-based visualization to present a global view.

The technique is evaluated using case studies and benchmark tools. The tool

summarizes objects created and uses this information for program

comprehension and debugging. However, the graph navigation is not trivial,

particularly in cases where graph becomes too large. Moreover, some work

like [41] and [42] present visualization tool for pedagogical purposes to

learn about program data structure and how they are used during execution

time. However, such tool cannot be applied in large software applications.

Treemap [43] is a popular space-filling visualization technique used for

hierarchical information. After its inception, in 1991 by Shniederman [43],

treemap is used in a variety of domains to visualize hierarchical

information, including business [44] [45] [46], news media [47], software

[48], medical [49], etc. Originally treemap was proposed for hard disk

Figure 2.2 A tree with its corresponding treemap

content visualization to quickly analyze the space occupied by various

users. Several variations to the treemap’s original slice-dice layout

algorithm have been proposed over the years [50] [51] [48] [52]. Treemap

visualization divides the entire screen into nested rectangles that may be

represented as squares. Each rectangle corresponds to a node in the data

hierarchy. The innermost rectangles represent leaf nodes, while the other

nodes are represented by enclosing rectangles as shown in Figure 2.2.

Several dimensions of the original data are shown by using color and size of

the rectangular region.

In the literature, several applications of treemap based user information

have been presented in the last two decades. In [24] authors propose a

treemap-based visualization and user interface called ResultMaps, for the

deployment of digital library repository search. The technique has been

tested using two lab experiments and both produced better results in helping

repository search visualization. TreeCovery, another treemap-based

visualization interface has been proposed by Rios et al. [53]. The tool helps

the government agencies to effectively monitor the financial distribution

among various departments. The authors have further enhanced the

treemap with zooming, feature highlighting, and item filtering features.

Figure 2.1 Typical compositions of a Java program

Java collection framework (JCF) has been introduced with Java 2, to

provide an efficient mechanism to handle program data structure, i.e.,

hashtable, list, map, arraylist, etc. [54]. JCF is part of Java core API

package java.util; consisting of a group of classes and interfaces to

standardize the process of data handling as a single unit of collection. Most

of the collections in Java are derived from the collection interface provide in

java.util package. Java developers heavily use these Java collection APIs

since it provides a convenient way to handle different data structures

without the burden of going into the details. All the operations a

programmer would like to perform on data such as, searching, sorting, and

insertion can be performed using JCF.

Another aspect of the proposed work is the identification and usage

information of program components during execution, at different

granularity levels. In contemporary literature, large volume of work has

been proposed to extract and analyze runtime program components [55]

[56] [57]. A comprehensive survey has been conducted by Robillard et al.

[58] to systematically categorize the API usage and analysis techniques. The

authors include over sixty techniques in their study and organized them

based on API usage, migration, and behavior specification into five different

categories. Zaidman et al. [59] proposed a mining technique for automatic

identification of program key classes using dynamic coupling and web

mining approaches. The case studies show that the technique is able to

identify most of the key classes of a program. The basic objective of the

work is code comprehension so that the developers easily get the snapshot

of complete software. However, the technique is only limited to the key

classes. Nevertheless, the technique is not supported by proper visualization

to effectively gain the information. Kawrykow et al. present a novel

technique [55] to automatically detect API imitation in using static code

analysis. The technique analyzes the target program to identify code

segments that imitates library method. Moreover, the imitations are

grouped to identify potential API usage pattern. They use this approach for

discovering API usage that needs improvement and provide the

recommendation. A key limitation of this technique is that the user could

not detect all APIs; only APIs that need improvement are available.

Souza et al. [60] adopt a different approach to evaluate API usage in a

program. The tool presented in their work is twofold; finds API usability

using complexity metrics and present information to the user using a

visualization module. They use software metric instead of usability method

to compute the API complexity. The visualization module is based on

treemap technique, where each API complexity is shown in different colors

while the hierarchal information depicts method, class, and package of an

API. Additionally, they also used star-plot to show the classes in the

package. The star is used to depict information of several metrics. The long

face of a star shows that particular class is more complex. However, the tool

does need proper evaluation via testing with standard benchmark tools.

Mileva et al. [61] utilize an approach based on “wisdom of crowds” to analyze

the popularity and usage pattern of API. They carry out static analysis of

over 200 open source projects from Sourceforge and Apache to observe the

frequency of how an API is being used in projects. They developed a

prototype tool to analyze the information and plot the result. This type of

information would assist the developer to discover whether a particular API

is used by other developers or the API usage is declined due to other issues.

Nevertheless, their approach is based on static analysis and the API

behavior and performance during program execution is not considered. The

developer is not aware of the actual reason of API abandonment. Lämme et

al. [56] propose a scalable technique for API usage analysis of the open

source Java projects having large code corpus. Based on abstract syntax tree

(AST), they present an automatic approach to extract features, like: tagging

with metadata, analysis, checkout, and synthesis. The authors are more

motivated towards API migration with a high degree of relevance and

applicability. However, the presented technique is yet to be tested on large

commercial program. Yet, another static approach has been presented by

Bauer et al. [62] to analyze the dependencies of a project on externally used

third party APIs. The technique is based on the information extraction from

the source code and then use visualization to assist user to get the insight.

The project searches for API usage through AST-based approach using

eclipse java compiler, where each class of a project is inspected for a number

of APIs calls. Moreover, visualization module is used to support user’s

decision making about API dependency during software maintenance. The

visualization shows the total number of calls from a project source code to

various external APIs, where the packages are depicted in decreasing order

of the APIs calls. The tool is evaluated using a case study of three software

applications with around one thousand lines of source code. The approach

is effective in the quantification of APIs for software. The actual behavior of

APIs is visible from execution time snapshot only. Additionally, the

visualization part needs to be evaluated through user studies.

Khan et al. proposed an object invocation-based model and clustering

technique to find inefficient API usage in Java program [20]. Their

approach is used to extract runtime information from a Java program using

instrumentation. Moreover, hierarchical agglomerative clustering is utilized

to classify the trace data and identify the code location inevitable to an

inefficient APIs usage. The approach is evaluated with a single case study

only. A different approach to this topic has been adopted by Mortiz et al.

[63] to automate the process of API usage identification and visualization.

The major theme of their work is the representation of software as a rational

topic model. Moreover, document network is used to depict the APIs calls.

ExPort, is an interactive tool based on the above model to search the code

visually keeping in view the developer’s needs. The call-graph view of

visualization is used to assist developer to search APIs effectively and

efficiently. The technique enables the developer to find the APIs usage

across several functions of the software. The authors claim that a database

was created using over 13000 open source projects, while presented

prototype tool using two software tools consisting of over 3700 methods.

However, their research has some limitations, as the tool was not evaluated

for the usability or effectiveness. The tool also requires the understanding of

software architecture and structure to make queries. Furthermore, the tool is

based on static analysis and does not reveal the runtime behavior of these

APIs. Recently, Said et al. [64] proposed a generalize technique for multi-

level API usage pattern based on source code analysis. The objective of this

technique is to analyze different methods of a particular API that a client

program collectively uses. Clustering algorithm is used to form a

hierarchical structure for the API’s documentation. The technique is

evaluated with four APIs while using 22 client programs for each API.

However, the technique is limited due to static approach.

Recently, Caserta et al. [21] proposed JBInsTrace, a tool for Java program

profiling and analysis based on bytecode instrumentation. They used fine-

grained technique to trace Java-based program classes including Java

runtime environment (JRE) class. The tool is used to trace program at basic

block level and extract runtime information. Furthermore, this runtime

information is then combined with statically collected data to perform

detailed analysis of a program. The authors claim that the tool was utilized

with reasonable runtime overhead. Nevertheless, the tool has some

limitations, like: JRE classes for trace extraction makes the data huge and

needs to be handled for full program analysis. Moreover, the author did not

focus on how to effectively analyze the information to perform detailed

analysis. The tool is demonstrated only on five Java programs to evaluate

the performance, and the actual trace information is not reported. A more

recent work in this area has been presented by Lengauer et al. [33]. They

present a detailed report on the design and implementation of

instrumentation technique to extract runtime information about Java

objects. The authors proposed a light-weight memory monitor to trace

object allocations and de-allocations. To do this Java virtual machine was

tracked. A novel technique is used to keep the runtime overhead minimal

with compact binary trace. Moreover, they also propose a method to

reconstruct the omitted runtime information for a specific trace. An offline

analysis of the collected trace is provided to build the layout of Java heap at

different timestamps. The performance is evaluated using over 30

benchmark tools. However, the tool is not evaluated for usefulness.

Additionally, support by visualization to effectively get the insight of large

volume of information is also missing. Yin et al. [65] present an

instrumentation tool called PAST. They formulate an approach based on

probe in abstract syntax tree (AST) level, to identify the location needed to

be tracked in the fully optimized target program. Their technique works in

contrast to instruction level instrumentation, where debugging information

is used to trace program. A prototype tool is developed to implement their

technique using both offline and online instrumentation.

The work presented in this dissertation focuses on capturing data related

to object instantiation during the program execution and also provides

visualization of the runtime data to have a better understanding about the

software’s code. The previous work discussed in this section has two major

limitations, first these were not based on dynamic analysis and secondly no

suitable visualization was provided. In contrast, this work aims to visualize

dynamic data of programs and provide visualization to assist programmer

more effectively. The core difference between the prior work and the

technique presented is that, previously APIs object creation hierarchy was

overlooked. The specific goal of this work is to help the programmer to

visually understand packages/classes of a Java program that are responsible

for creating collection type objects at runtime and to further help them in

program understanding and performance analysis. For the purpose of

collecting runtime information of software, instrumentation is used. This is

done without degrading the performance of the software which is being

traced, in terms of execution time.

2.2 Automatic visualization prediction

Visualization techniques refer to the creation of visual images and

animation to effectively and efficiently communicate the information.

Visualization techniques can be divided in two types: scientific visualization

and information visualization. Scientific visualization visually depict some

physical phenomenon, i.e., surface visualization [66], flow visualization

[67], or volume visualization [68]. The information visualization techniques

are focused on abstract data, i.e., software visualization [35] [69], security

visualization [70], and network visualization [71]. In the literature, several

information visualization techniques have been presented including parallel

coordinates, treemap, and map. The focus of the research is on how to

select an appropriate visualization technique for a specific dataset. This

section provides comprehensive overview of the related literature about

information visualization, classification and automatic prediction. Past

research in the information visualization domain show several

categorization and taxonomies established on various criterion.

Shneiderman et al. [72] formulate a task by data type (TTT) taxonomy of

information visualization techniques. The taxonomy is based on the data

types of the dataset need to be visualized along with the task that may be

Figure 2.2 Information visualization techniques classification

performed on the data using that particular visualization technique. The

author incorporates seven data types for the classification of visualization,

including 1-, 2-, and 3-dimensional data, temporal and multidimensional

data, and data having network or tree relationship. Furthermore, seven

tasks have been used to which the data types are mapped, i.e., overview,

zoom, filter, detail-on demand, relate, history, and extract; known as

information seeking mantra. These tasks are used at an abstract level and there

may be other additional tasks based on these seven tasks. However, their

work is more theoretical and is not used to classify visualization techniques.

Chi et al. [73] takes a different approach for classification of visualization

technique by using data state reference model. The data state reference

model consists of four data stages and three transformation operators. This

taxonomy is process-centric and operates along with the visualization

pipeline to take all operators needed for design. The author shows the

transformation involved in each state along the visualization pipeline from

data value to visualization design. The author argues the taxonomy which is

helpful in understanding the design space as well as the application of

visualization techniques in broader sense throughout the visualization

pipeline. Chi taxonomy and reference model is more general, it can be

utilized for both scientific and information visualization domains. A large

number of examples have been provided from several visualization domains

to explain the proposed taxonomy. Nevertheless, the taxonomy needs to be

updated as the new visualization techniques are introduced. Keim et al. [74]

use other guidelines to classify information visualization and visual data

mining techniques. He suggests three criterions: data to be visualized,

visualization technique, and interaction and distortion techniques for the

classification purpose using related examples. The underlying dataset that is

to be visualized has significant impact on the visualization to be used.

Consequently, the author categorized the dataset into 6 types; 1-2

dimensions, multidimensional, text/web data, hierarchical, and software

data. Alternatively, the visual display techniques are divided in five

different types; 2D/3D, geometrically-transformed, dense, iconic-based, and

stacked-based techniques. The user interaction techniques are also utilized

to classify visual techniques in six categories; standard, dynamic projection,

interactive zooming and filtering, distortion and brushing, and linking.

However, despite the clear explanation, the proposed rationale is more

theoretical and needs to be implemented on a real working system to

automatically suggest visualization for future use. Marty et al. [70] use the

relevant task along with dataset and visualization technique to identify an

appropriate visualization technique in a context. The author argues that

visualization selection for a dataset with some intended task is not a trivial

undertaking. The selection of visualization for a specific dataset depends on

several factors, including the number of instances in a dataset, total number

of attributes, and the data type of the variables. Furthermore, the authors

identify four types of tasks: relationships, distribution, trends, and

comparison also contribute to the selection of suitable visual technique.

Marty’s contribution is more directed toward network security related data

and visualization task. However, the idea can be extended to other domains

and visualization techniques by their rigorous study.

Furthermore, in literature several visualization selection techniques have

been suggested. Lange et al. proposed a prototype tool, Vis-Wizz [75], to

assist the user in the visualization selection process. Although, the tool is for

scientific visualization with the multi-dimensional dataset, the idea is

equally applicable to information visualization as well. The visualization

recommendation is based on data characteristics combined with

visualization goals a user needs to accomplish. Furthermore, the system

also provides an evaluation of the resultant visualization based on the user

intended aims. The limitations of this system include: the difficulty of user

interface and the user is overwhelmed with mapping of data to visual

attributes. Grinstein et al. [76] present a basis for benchmarking and

evaluation of visualization technique for knowledge discovery and data

mining. They provide general rules for further advancement in the area of

automatic visualization selection for a specific dataset. The authors

empirically evaluate five visualization techniques on nine datasets to

formulate general criteria for the stand evaluation in data mining using

visual representation. Moreover, nine datasets have been evaluated against

different intrinsic features considered as tasks existing in each dataset, i.e.,

outlier, cluster, class clusters, important features, possible rules, and exact

rules. Similarly, the datasets selection for the purpose of this study is based

on the complexity, attributed to factors: dimension, number of records,

cardinality of dimension, and number of independent variables.

Nevertheless, the proposed benchmark rules need further improvement to

incorporate new visualization, tasks, and datasets. Guettala et al. [77]

present an automatic process of visualization selection and optimization for

visual data mining. The suitable visualization selection is based on the

model of the underlying dataset combined with user’s objectives. A

prototype assistant tool provides several mapping data attribute to potential

visualization. The interface allows users to interactively select visualization

based on simple heuristics. Once the visualization is selected, the system

uses an interactive genetic algorithm to optimize the visualization for

effectiveness. This allows the user to customize the selected visualization

and improve the mapping by using visually supported interactive interfaces.

However, the system is non-trivial for a user having limited knowledge of

visual mapping.

Laaser et al. [78] propose a rule-based system for automatic visualization

selection based on data and corresponding metadata. The system uses

predefined rules to select the most suitable visualization and its properties

for a specific dataset. The rules illustration in the system is more flexible and

perspective dependent, consequently, the visualization selection for same

dataset may differ. Several properties have been incorporated as rules in the

system, i.e., number of columns, total number of values to be visualized,

and data types in dataset. Furthermore, some other characteristics of

dataset, i.e., data properties, dimensions, and relationship among the data

values are also utilized. The system has the provision to override all rules or

select any other visualization technique based on user’s choice. However,

the system requires domain knowledge to design rules. The rules design and

their incorporation is not trivial. Additionally, the system is provided with

only business related visualization. The addition of new visualization will

necessitate the change in mapping mechanism between rules and

visualization techniques.

Recently, in [79] a visualization ranking-based approach is presented by

Chronister et al. . Basic idea is to put a human subject in the loop, where

users will need to perform selection among a set of visualizations that the

system provides. The set of visualizations is ranked according to fitness

score as a measure of suitability for the context present. The user then

selects most appropriate visualization technique to apply on the dataset.

Fitness score is computed for a visualization using the metadata of both

visualization and dataset. The visualization metadata may consist of

properties like visualization support data type, while the dataset’s metadata

includes characteristics: data types, number of attributes, and number of

instances.

As evident from the above survey and to the best of our knowledge,

neural networks have not been used for the classification and prediction of

visualization techniques as yet. Rather closely related work to find

appropriate visualization is available in [78, 79], as yet rule-based system are

used. It is also noticed that the selection of an appropriate visualization

technique beforehand will be a great help in understanding the data being

visualized [80]. Based on this, the proposed system in this work is motivated

to utilize ANN to select a particular visualization technique for a give

dataset.

2.3 Visualization optimization

The study of information visualization is one of the major research areas

and has been applied to an assortment of applications ranging from

biomedical to physics experiments. Quantifying visualization and

optimizing its layout is an emerging subfield in information visualization.

Since quantification of visualizing is subjective, it makes the problem

complex. This section lists the recent work on quantifying visualizations,

information visualization using treemaps, and techniques for optimizing the

visualization.

Tanahashi et al. [81] propose optimization technique to produce

aesthetically pleasing and legible storyline visualization. Their technique

consists of two main parts, the algorithmic part which performs layout

design for the visualization, and rule part to improve the aesthetic by

adjusting the line geometry using commonly agreed rules. To optimize the

layout for storyline visualization, the authors used three types of quality

metrics, i.e., line wiggles, line crossovers, and white space gaps between

lines. Additionally, visual quality metrics are combined with the general

design principle of storyline visualization. Genetic algorithm-based

computation approach is implemented to find the optimal layout for

visualization using the quality metrics as fitness function. The throughput of

this process results in legible storyline visualization. Moreover, the

visualization must be aesthetically pleasing and more legible. Two

properties, i.e., flow of lines and clarity of lines in storyline visualization

influence the aesthetic and legible perception. Therefore, the techniques are

applied to improve the aesthetic and legibility by enhancing visual flow and

clarity. However, the technique is focused on the optimization of storyline

visualization techniques only.

House et al. [82] proposed another approach for optimization of complex

visualization for perceptual and aesthetic properties. The method has two

stages: first stage utilizes genetic algorithm with human-in-the-loop strategy

to explore large visualization parameter space. This stage ends with a

database of large visualizations rated according to objective function score.

Furthermore, the second stage is devised to extract optimal visualization

parameters from the large database built during the first stage. Principal

component analysis, neural network, and clustering techniques are used at

this stage for data mining purposes. The second stage results in guidelines

for producing visualization and strategies to make the visualization more

aesthetic.

A semi-automated assistant tool for creating perceptually appealing

visualization for complex multidimensional datasets is presented by Healey

et al. [83]. The tool, called VIA, consists of two engines: search engine and

evaluation engine. The search engine is based on a real time search

algorithm which collects the basic information about a dataset from the

user. Search engine utilizes this information to generate a suitable mapping

between data values and visual attributes. A perspective user may guide the

search engine to evolve the current result for more expedient mapping.

Furthermore, the evaluation engine, a collection of more than one engine, is

used to build more appropriate visualization for a dataset and related tasks.

The evaluation engine uses knowledge based on human visual studies to

weight, and finds the optimal result. The main disadvantage of this work is

the limited set of visual features. Similarly, the domain knowledge is

required to get started with the search engine; this process may be limited to

user capability and knowledge about the problem.

The work presented by Marrero et al. [84] systematically reviews and

classify several visualization metrics. The classification is based on the use

of dataset structure, purpose of visualization, and the user expertise and

related tasks. Three types of visualization metrics are discussed in their

work; mathematical, user-centric, and visual efficiency metrics. Each of

these metrics is based on features related to various aspects of visualization.

Such quantification criteria help building better visualization both

subjectively and objectively.

Rigau et al. [85] proposed an aesthetic measure based on Shannon’s

information theory [86] and Kolmogorov complexity [87]. The basic

concept is to relate the quality of art with complexity level. The work

utilizes Birkhoff’s aesthetic measure from an information visualization

perspective. Three measures from Bense’s concept have been taken into

consideration; initial repertoire, the basic states, the palette used, and the

range of colors selected by the artist. This work also focuses on the

quantification of aesthetic quality in painting [85]. Another analogous work

is demonstrated by Li et al. [88] evaluates the visual quality of paintings.

His work adopts the problem with machine learning and a data-driven

approach. The technique starts with feature extraction of the visual quality

of artistic work. An experiment is conducted to compare the human-based

survey result with computational models that classify aesthetic paintings

from non-aesthetic ones. The authors argue that there is a relationship

among the subjective human perception and computational model of the

problem.

An interesting work is present by Huang et al. [12] to measure the

effectiveness of graph visualization in context of human cognitive load and

visual efficiency. The main objective is to overcome the limitation of

performance-based measures in graph visualization. The model presents

two new measures of effectiveness: mental efforts and visualization

efficiency. The model reveals that in a visualization environment, there is a

strong relationship among metal effort for a task, memory requirement, and

mental efforts. User studies have been conducted to evaluate the model for

the cognitive measure of metrics with various complexities.

Matvienko et al. [89] propose an evaluation method for dense vector field

(DVF) visualization which is based on image visual quality metric. The

technique is based on the average similarity of DVF visualization, image,

and underlying dataset, i.e., the vector field. In case of a vector field,

automatic evaluation of the images are carried out by using different

parameter sets, verity of DVF method, and quality improvement strategies.

The local image gradient is taken as the visual quality measure to compare

two images. User survey has been conducted with 53 subjects to evaluate

the effectiveness of the quality metric. Nevertheless, the technique focuses

scientific visualization.

More recently, a study has been conducted by Lehmann et al. [90] to

examine the relationship between human perception and quality metrics.

The study is performed using seven quality metrics on various high

dimensional datasets. More than 100 subjects performed the experiment by

taking tasks on three type of visualization: parallel coordinates, scatterplot,

and radial visualization.

Aydin et al. in [91] propose a framework for the automatic rating

approach for the evaluation of photographic images. The work provides an

objective assessment of images by considering meaningful aesthetic

attributes of the image. Five aesthetic attributes are utilized, based on

general criteria of selection; sharpness, depth, clarity, colorfulness, and

tones. Furthermore, computationally measured metrics are defined on the

basis of aesthetic attributes to calculate objective analysis, collectively

depicted by image signature. The system is then evaluated by comparing the

subjective and objective prediction results. However, the work is limited to

expressiveness aspect of the images only.

Yet Moere et al. [92] carried out another study to evaluate how the

insight of the information visualization impacts the perspective of style

using online survey. Three types of demonstrators are used in their

experiment, namely analytical style demonstrator (ANA), magazine style

demonstrator (MAG), and artistic style demonstrator (ART). The study

shows that in spite of creating lucid diversity in the usability, usefulness, and

enjoyability, no variation was found in the insight in context of self-reported

depth, expert-reported depth, and resulting difficulty. The work presented

by Demiralp et al. [93] introduces distance metrics, called perceptual kernels

for the evaluation of visualization design. The kernels are used in automatic

visualization design which is perceptually pleasing. The perceptual kernels

are derived from the visual parameters, i.e., shape, size, and color combined

with their aggregate perceptual effect. These perceptual kernels are

compared with existing perceptual models.

Earlier, the aesthetic label layouts metric had been proposed in [94] by

Hartmann et al. Their approach considers both internal and external labels

for optimization using aesthetic attributes. The automated system finds all

labels in a visual item and classifies them as internal or external labels. The

classification scheme and layout algorithm mitigate the side effect of

conflicting-requirements and produces more readable and aesthetic visual

layout. The system focuses on pedagogy.

Pargnostics, a model presented in [95] to quantify the visual structure for

the parallel coordinates visualization technique. The model intends to fill

the gap between visual representation and the user tasks. Pargnostics focuses

on screen space metric, pixels, rather than focusing on the design issues.

The system provides users with ranked parallel coordinates to choose an

appropriate visualization. The selected visualization is then optimized using

the metrics. With the help of scatterplot visualization another work on

perception-based quality assessment is presented in [96]. A perceptual

model is constructed for various projections in perceptual space. The

approach uses the studies in psychophysics and multi-dimensional scale.

The projections are then ranked using the estimate value of suitability for

the specific user task on the dataset. Moreover, the distance in the ranking

spaced is optimized based on quality metric of scatterplot to make it

comparable with the perceptual model. However, the technique is human-

centric and need proper training. Proper evaluation is also needed to

examine the technique for applications of visualization.

Kong et al. [97] present a set of perceptual guidelines for creating

treemaps visualization. The study conducts a series of controlled

experiments to produce a set of guidelines for designing perceptually better

treemap-based visualization. The experiment has been deployed using

crowdsourcing through Amazon’s mechanical truck (MTruck). The

experiment examines the impact of perceptual parameters, i.e., aspect ratio,

luminance, and data density on value estimation tasks. The study

empirically demonstrates that the aspect ratio is correlated with area

judgment. However, it is not the case for luminance. Moreover, treemap-

based visualization has more data density as compared to a hierarchical bar

chart. Furthermore, the author sets the guidelines about aspect ratio, density

and luminance for creating perceptually better treemaps.

In the literature, several comprehensive surveys are presented on

visualization quality metrics for high dimensional data, such as [98] and

[16]. The work in both of these articles uses a systematic approach to

categorize the quality metrics and related concept. However, Lin et al.’s [98]

work is focuses more towards the prediction of perceptual quality in

photographs, while, Bertini et al.’s work [16] considers the data

visualization related research for their review.

Lam et al. [24] present seven guiding scenarios for evaluating

information visualization from a comprehensive literature analysis. Their

work incorporates exhaustive study of over 800 published articles in various

venues to find the common linkage between goal and adopted approach.

Subsequently, this work is extended by Isenberg et al. [99] for general

visualization scenarios. The systematic review of literature spelled over a

decade shows that the trends in evaluation techniques change over the

years.

A different methodology is presented by Harrison et al. [100] to model

the correlation of nine common visualizations using Weber’s law. The

experiment is carryout online using crowdsourcing techniques. The study

shows that the models based on Weber’s law provide a brief technique for

the quantification, ranking, and comparison of perceptual visualization’s

effectiveness. Similarly, the model also provides information on the

symmetries and asymmetries raised from performance difference during the

evaluation process. These characteristics are actually related to the visual

features of the visualization. Furthermore, the model leads towards the

foundation for developing benchmarks to explore how the perceptual laws

impact the design elements of visualization.

Cawthon et al. [101] investigate the relationship of visualization’s

aesthetic and its usability through empirical analysis. The study presents

two metrics, i.e., task abandonment, and erroneous response time to capture

aesthetics. An online survey has been performed with 285 participants and

eleven visualization techniques, to measure the perceptual aesthetic quality

of the presented visualization. The participants were asked to label the

visualization with one of three ranks, i.e., ugly, neutral, and beautiful.

Moreover, the survey asked the users about the dataset and visualization’s

effectiveness for a particular task.

In contrast to aesthetics and effectiveness, another important aspect is

insight, being a literal purpose of the visualization [102]. North et al. [103]

focus on the insight of a visualization and how to evaluate this

quantitatively. The limitation of task-centric evaluation based on controlled

experiment is elaborated. The authors’ state that the insight gained for

visualization is highly subjective to user’s motivation, domain knowledge,

and interest in the underlying data. Consequently, insight measurement is

difficult and need a direct method for its evaluation. Moreover, the article

lists some general characteristics of insight as, e.g., complex, deep,

qualitative, unexpected, relevant, etc.

Recently Borkin et al. [104] presented another metric for the information

utility and effective visualization design as the memorability of

visualization. A systematic large scale online study has been conducted for

hundreds of visualizations. The visualizations are categorized in different

types and investigated for the factors that make visualization memorable.

Results of the study reveal the point that design attributes, such as, color,

inclusion of human recognizable visual shapes, low data-to-ink ratio, and

high visual density enhance memorability of visualization. Moreover,

unique and unexpected visualizations are more memorable than common

visualization types. Nevertheless, the finding of study is the first step

towards the effective design of visualization that provides users with more

utility of information.

The modern age datasets are large and complex, therefore, the visual

exploration of such massive data is tedious. Schneidewind et al. [105]

present an automated approach to optimize pixel-based visualization.

Cluster analysis is used to partition data and find important parameters in

the dataset, followed by image analysis techniques and ranking process. The

system determines the optimal parameters setting for data visual

exploration.

Fuchs et al. [106] propose an interactive visual analysis combined with

machine learning techniques to support the insight from the user’s

perspective. The main objective is to automatically generate and verify

hypotheses for visual exploration on massive dataset. Therefore, a set of

fitness criteria is used with genetic algorithm (GA) to find the best

hypothesis from a large search space. These visual hypotheses are

formalized using fuzzy logic and evolutionary algorithm to investigate and

explain features of data. The fitness function used with the GA considers

the influence of feature similarity, individuality, and complexity,

respectively. Random selection is used to add individuals to the GA

population.

A computational model of human vision for visualization evaluation and

optimization is presented by Pineo et al. [8]. A quality metric called

effectiveness has been simulated based on the model of the perceptual

process of the human retina and primary visual cortex. Furthermore,

effectiveness metrics is used as a fitness function in hill-climbing algorithm-

based optimization. The model in evaluated using two different flow

visualizations. The model has twofold usability; firstly, the model fills the

gulf between perceptual theories and guidelines of visualization design,

secondly, an automated assistance tool for visualization production may be

built on these recommendations.

Elmqvist et al. [107] discuss visualization optimization in perspective of

color schemes for perceptual issues. The technique is used to dynamically

optimize color schemes based on the set of sampling lenses of the user

specified region. Furthermore, it provides visual search facility for the

ubiquitous low-level user task to effectively get the insight. The technique is

implemented using two prototypes using OpenGL and programmable

graphic processing units. Additionally, the several case studies have been

conducted for both information visualization and image inspection

applications. More recently, Lee et al. [108] present the concept of class

visibility to measure the utility of color optimization for data visualization.

The algorithm quantitatively enhances the composition of a color palette

and comparatively reduces the visual weight of large groups. The

experiments of the technique on two real world datasets confirm the

effectiveness of visibility metric for visual representations.

The work in [81] focuses on the storyline visualization, however, the

approach in this work is generic and can be applied to any visualization

technique. The work in [81] use classical GAs, having the change to stuck in

local optimum. Similarly, work in [85] focuses the quality quantification of

aesthetics in paintings. The approach presented in this work is for

quantifying a visualization technique using an EA to find an optimal layout

of a particular visualization technique. The metric in [89] is only for images

generated via DVF methods. It is evident from the aforementioned

literature survey and to the best of our knowledge that there is no work on a

set of metric that can measure the quality of multiple visualization

techniques. The major aim of this work is to introduce a generic measure to

quantify a visualization technique and later to use the same measure to

optimize the layout of a visualization technique. Figure 2.5 summarizes his

chapter in a tabulated form.

2.4 Chapter Summary

In this chapter major related work from the literature was discussed. This

work was categorized with three aspects: dynamic code analysis and

software visualization, automatic visualization, and visualization

optimization. The first part of the chapter focused on several techniques and

tool from the area of dynamic code analysis through bytecode

instrumentation and software visualization. Furthermore, the work on

program runtime analysis and APIs analysis were also reviewed and

discussed with their limitations. This was followed by the review of past

work from the visualization classification and automatic prediction of

suitable visualization. The last part of the chapter comprehensively

analyzed the previous work on visualization optimization, visualization

metric, and aesthetic and perceptual properties of visualization. The

following chapter will present the first part of the proposed work. It is a

solution to automatically select appropriate visualization technique based

on the given metadata about the data and the task that a user is required to

perform. The appropriate visualization is predicted based on an artificial

neural network (ANN)-based model which classifies the input data into one

of the eight predefined classes.

List of Acronyms

Figure 2.5 Overview of the literature review

Chapter 3 Data visualization technique selection

On selecting data visualization

technique

“The best way to predict the future is to study the past, or

prognosticate” Robert Kiyosaki

Advances in computing technology have been instrumental in

creating an assortment of powerful information visualization

techniques. However, the selection of a suitable and effective

visualization technique for a specific dataset and a data

mining task is not trivial. This work automatically selects an

appropriate visualization technique based on the given

metadata and the task that a user intends to perform. The

appropriate visualization is predicted based on an artificial

neural network (ANN)-based model which classifies the input

data into one of the eight predefined classes. A purpose built

dataset extracted from the existing knowledge in the discipline

is utilized to train the neural network. The dataset covers

eight visualization techniques, including: histogram, line

chart, pie chart, scatter plot, parallel coordinates, map,

treemap, and linked graph. Various architectures using

different numbers of hidden units, hidden layers, and input

and output data formats have been evaluated to find the

optimal neural network architecture. The performance of

neural networks is measured using: confusion matrix,

accuracy, precision, and sensitivity of the classification.

Optimal neural network architecture is determined by

convergence time and number of iterations. The results

Chapter

obtained from the best ANN architecture are compared with

five other classifiers, k-nearest neighbor, naïve Bayes, decision

tree, random forest, and support vector machine. The

proposed system outperforms four classifiers in terms of

accuracy and all five classifiers based on execution time. The

trained neural network is also tested on twenty real-world

benchmark datasets, where the proposed approach also

provides two alternate visualizations, in addition to the most

suitable one, for a particular dataset. A qualitative

comparison with the state-of-the-art approaches is also

presented. The results show that the proposed technique

assists in selecting an appropriate visualization technique for

a given dataset with high accuracy.

Information visualization is a common computer-based interactive

technique to graphically represent large volumes of data efficiently to

reinforce human cognition. The data is constantly generated in a diverse set of

fields ranging from satellite systems and physics experiments to software tools

running over an operating system. This produces billions of terabytes being

stored daily, which requires approaches to handle this flood of information

effectively [109, 110]. The problem of managing large sets of data can be

addressed from different perspectives depending on its type at hand. This may

include optimizing memory usage [111], designing hardware capable of

storing large amounts of data [112], extracting valuable information from data

[113], and visualizing the data [112]. Each of the aforementioned domains

can further be classified into many open problems. The work in this chapter

deals with the software visualization domain. Software visualization may be

static or dynamic representation of software tools based on their size,

structure, history, and behavior. With the advances in speed of computer

processing and graphics tools, new and powerful visualization techniques are

being developed. These contemporary techniques may be applied in data

mining, knowledge discovery applications, and other tasks [75, 114]. Selecting

a particular visualization technique for a pictorial representation of data has

always been subjective. An inappropriate visual representation may lead to

inadequate decision making [76, 115]. It will be very useful to know which

visualization technique is more appropriate for a particular task at hand. It

will be helpful to know which visualization technique is appropriate for a

particular task at hand.

Deciding for a particular visualization technique is mainly guided by the

problem statement, the dataset and the tasks that are to be accomplished. A

dataset may be characterized by its properties referred to as the metadata.

Some of these characteristics include data dimensions, data types, number of

attributes, multivariate attributes, etc. Similarly, various types of tasks, e.g.,

relationship, distribution, and trends need to be accomplished on the data [76,

116]. Frequently users are not interested in such system peculiarities. They

normally want to build automatic visualization based only on the data

characteristics and the desired tasks. No extra mapping among datasets and a

particular visualization technique exists. However, literature suggests that

some visualization techniques are more suitable for a specific data type, data

dimension or a particular task than other [115, 117, 118]. Hence, visualization

techniques may be classified on the basis of data they have to visualize in

conjunction with the tasks to be performed. This classification will provide

basic knowledge using which future data can be mapped to a best suited

visualization technique. For this purpose the remedy to the situation can be

twofold, a metadata driven classification of existing visualization techniques

and then an accurate prediction system for new data to be assigned to

appropriate classes.

To select an appropriate visualization technique based on the metadata and

the task that a user intends to perform is presented. The appropriate

visualization is selected trough an artificial neural network (ANN)-based

model which classifies the input data in one of the given eight classes. The

major bottleneck in this problem is the unavailability of such datasets with

sufficient training samples. With a careful investigation of the literature a

dataset consisting of the metadata and corresponding visualization technique

was custom-built. A single record of the dataset has the attributes: data

dimensions, number of attributes, type of primary attribute to be visualized,

and the task. After pre-processing of the dataset, different neural network

architectures are trained and tested with supervised learning techniques. The

results are validated using neural network performance evaluation metrics.

Further, the performance of ANN is compared with k-nearest neighbor (k-

NN), naïve Bayes, decision tree, random forest, and support vector machine

(SVM). The results are also compared with the current state-of-the-art

approaches [78, 79]. Along with one best strategy, the proposed approach is

extended to provide more flexibility. The system automatically selects three

best visualizations based on the ranking of the ANN output layer neurons’

values. This aspect of the proposed system is experimentally checked with

twenty real-world benchmark datasets. The proposed system can be utilized

as an intelligent assistance in the current word/data processing software

packages to help the user in selecting an appropriate visualization technique.

At present, the current word/data processing software packages simply

provides a list of visualization techniques to the users without taking into

consideration the actual data and its metadata. However, the proposed

solution takes into consideration both of these, i.e., the actual data and its

metadata, before recommending a visualization technique. This helps in

taking an informed decision.

3.1 Proposed system, dataset, and visualization techniques

To classify and predict a visualization technique the proposed approach

consists of a series of steps. Initially a dataset is created that later train the

ANN. A wide range of literature from an assortment of disciplines is included

to create the dataset. Using this corpus, a dataset comprising of six columns

Figure 3.1 System working for the visualization prediction

and 400 rows is created. The last column of the dataset is the class label, i.e.,

the visualization technique. The data is classified in one of the eight classes,

each representing a particular visualization technique (details on the dataset

are covered in Section 3.1). The dataset is pre-processed to remove the

irregularities and to provide uniformly important input variables to the

network. The ANN is trained using a supervised learning method with input

presented as a vector of elements yielding one of the eight visualization

techniques explained in Section 3.2. The classifier’s results are validated using

the test data. Several network architectures in terms of the number of layers,

number of hidden nodes, and network performance are evaluated to

determine an optimal classifier structure. The system components and its

structure are shown in Fig. 3.1.

The rest of this section discusses the dataset built for the proposed system.

This section also reviews eight visualization techniques, i.e., line chart, pie

chart, parallel coordinates, scatter plot, histogram, linked graphs, map, and

treemap. These techniques are chosen for their widespread use.

3.1.1. Building the dataset

Literature on information visualization shows that there is no standard

dataset which consists of metadata and corresponding data. The only

information available is how various datasets are being visualized or which

visualization techniques are suitable for a given data type [76]. A dataset is

formed on the basis of knowledge available about data characteristics and

underlying visualization techniques. The dataset comprises the metadata

utilized in recently published work and corresponding visualization

techniques. Knowledge is also gathered about the tasks relevant to specific

data [70, 76, 78]. The metadata consists of: data dimensions, primary

variable type, number of attributes in the dataset, the number of

records/rows to be visualized, and the relevant task to be carried out. This

chapter focuses on the four tasks, namely, relationship, trends, distribution,

and comparison. The dataset consists of 400 items each having five

attributes of the metadata and the sixth attribute referring to the

visualization technique suitable for the particular record. The sixth attribute

becomes the class label for the classifiers. Eight visualization techniques are

considered to construct the dataset which become eight classes to be

considered by the classifier. These classes are labelled as: histogram, pie

chart, map, treemap, parallel coordinates, scatter plot, linked graph, and

line chart each having 105, 55, 29, 81, 56, 23, 25, and 26 samples,

respectively. The numbers of samples occurring for a particular

visualization technique differ since they depend on their usefulness in the

contemporary work. The dataset1 is mathematically shown below. Fig. 3.2

shows the mapping between metadata and the tasks. = { : ∈ }, 1 http://ming.org.pk/datasets.htm

= { : ∈ }, = { : ∈ } then × → =< , , >. .

The dataset D consist of various instances, where each instance can be

represent by the features = { , , , … . }, = { , , … . }. The task set T is represent as, = { , , , … . . }. Similarly, for visualization we have the set V = { , , , … . . }.

This makes × × possible combinations for each instance of D, T, and

× × = ⋃ , . ℎ are two disjoint set

= × → , .

= × ≠ , . ℎ , = , , … , = , , … , = , , …

Figure 3.1 Visualization, tasks and metadata mapping

D represents set of metadata where each di has its own domain, T is set of

tasks and V is set of visualization techniques. Later a relation is found

between D and T using a suitable mapping in set V given by Eq. 3.1. A

simple instance of the dataset is; dimension = 1, number of attribute = 1, the

instances are up to 100 or less and primary variable type is continuous, given

the distribution task is to map the target visualization histogram.

Standard pre-processing techniques (listed in Section 5) are applied on the

dataset before it is submitted to the neural network. Table 1 shows the

characteristics of the dataset. Further detail on the attribute and tasks can be

found in[6, 70, 116].

Table 3.1 Dataset description

Attribute name Values/Range

Dimension One dimension, Two dimensions, Three or

more dimensions, Hierarchical

Primary variable Ordinal, Continuous

Categorical Geographical

Tasks Relationship, Trends

Distribution, Comparison

No. of attributes Numerical value , , …

No. of instances Numeric value , , …

Target/Visualization Histogram, Pie Chart, Line Chart, Parallel

Coordinates

Scatter Plot, Linked graph,

Map, Treemap

To create the test data, published articles from diverse fields utilizing the

eight visualization techniques considered in this work are reviewed. Pertinent

information such as dimensions of the data, primary variables, dataset size,

attributes used for visualization, and the task to be accomplished through

visualization is extracted to construct dataset. This required an extensive

search of a large number of articles. Article selection was based on domain

diversity, relevance to the subject and citation count. Table 3.2 summarizes

key articles considered for constructing the dataset. The domains and the

cumulative citations are mentioned for the key articles in Table 3.2.

Table 3.2 Key articles used in dataset construction

Visualization

articles Domains

Cumulative citations (As of

July 2016)

Histogram [67-69]

Swarm Intelligence, Soft

Computing 524

Pie Chart [70-72] Soft Computing, Genomics, 4908

[63, 73,

Scientific visualization,

Bioinformatics 828

Treemap [75-77] Software Engineering 816

Parallel

Coordinates

[39, 78,

79] Databases, Clustering 693

Scatter Plot [80-82]

Life Sciences, Behaviour

Analysis 11949

Linked graph [83-84] Physics, Mathematics 205

Line Chart [85-87] Chemistry, Biology, Medicine 826

3.1.2. Visualization techniques

Several visualization techniques have been presented over the years. As the

technology advances older techniques are replaced by the newer and

sophisticated approaches. However, the baseline remains the same; the use of

a particular visualization technique is subject to the task at hand. Therefore,

none of the techniques is universal[6, 14]. In this work, eight visualization

techniques are selected to perform classification. All techniques are known

and characterized to visualize a particular type of data. The selection of these

eight visualization techniques is based on their wide use in information

visualization and other multidisciplinary domains. Table 3.3 lists various

applications of these techniques for information visualization and

interdisciplinary usage.

Table 3.3 The eight visualization techniques used in recent literature

Visualization Type Reference

Histogram [120-122]

Pie Chart [121, 123, 124]

Map [121, 125]

Treemap [43, 126]

Parallel Coordinates [127-130]

Scatter plot [121, 131]

Linked graph [121, 131, 132]

Line Chart [132, 133]

In contemporary work more specific visualization techniques still exist

[115] Due to their specific nature these visualizations are not considered here.

This section discusses the eight selected visualization techniques.

Histogram: A histogram shown in Figure 3.3a is normally used with somewhat

regular type variables and shows their distribution over time. The histogram is

used in many statistical analysis and other applications [134]. In the current

perspective, if the intended task is distribution and the primary variable is

Figure 3.1 The eight visualizations used as class label

continuous, the target visualization should be a histogram. It is a common

visualization found in many statistical applications and software packages to

support programming languages.

Pie chart: The pie chart developed in the early 18th century [135], is another

common technique to visualize distribution of quantities as part of the overall

share (Figure 3.3b). Pie charts are good for categorical variables. They show

the distribution task on the dataset in many statistical and drawing

applications. A major limitation of pie charts is that, it can visualize data in

one or two dimensions.

Line chart: [, 49] [35].

To find trends in data, line chart (Figure 3.3c) is one of the best options

available and suitable for ordinal variables. It also shows good behavior for

small continuous and discrete data. Line charts are commonly used in

business data mining applications [136, 137] tasks and it is also available with

Google’s online visualization library [133].

Parallel Coordinates: Parallel coordinates is a visualization technique [127,

128] used for multidimensional and multivariate data, where each dimension

is comprised of hundreds of attributes as shown in Figure. 3.3d. The parallel

coordinates are useful for comparison and to show relationship across the

dimensions. It also shows the distribution of different variables along several

dimensions. The original technique is enhanced in [129, 130].

Scatter Plot: Scatter plot is a visualization technique that is built from the

Cartesian plane. A point on the plane shows the values of two variables,

mostly both of the variables are continuous or ordinal [6]. If data has more

than one dimension it can be shown by using a different color or size of the

point in a 2D plane (Figure 3.3(e)).

Linked graph: Linked graph shown in Figure 3.3(f) is a widely used

visualization for data having hierarchical or network relationship. One

advantage of using linked graph is that it can visualize many types of data.

The only problem is the layout of large graphs on a single screen. The graph

consists of a set of nodes and a set of edges, where edges are used to link

corresponding nodes. Several dimensions of data may be shown using the

position, size, and color of a node [6, 138].

Map: A special type of visualization technique used to display spatial or

physical distribution in two dimensional space (Figure 3.3g). Mostly, the data

has only one dimension and can be used with different variables. We choose

the geographical type of data to be classified with the class label map. Color

and size of the distribution area are used simultaneously to code different

information [70].

Treemap: Treemap is a popular space-filling approach used to visualize

hierarchical data on a single screen [43]. Multidimensional data of any type

can be visualized with treemaps and to show hierarchical relationship in the

data. Treemaps cover the entire screen with nested rectangles to show the

Figure 3.2 (a) Neural Network (b) Neuron

hierarchical structure (Figure 3.3h). Several applications and online facilities

are available for treemaps visualization[139, 140].

3.2 Artificial neural network preliminaries

The artificial neural network (ANN) is a soft computing technique inspired

by the biological neural system and a model of human brain cognition. ANN

is an adaptive system. Its functionality is different from traditional digital

computers and works in parallel as in case of human neurons [141, 142].

ANNs have applications in diverse areas such as engineering, medical,

computer games [114], and banking for classification and control [30]. A

neural network identifies and learns from the patterns presented to the

network in the form of input and corresponding target output. Typically

ANNs consist of one input layer and one output layer and may have one or

more hidden layers depending on the network architecture [143]. The number

of neurons in each layer may be different and depend on the problem

statement, often determined empirically by trial and error. One of the many

types of neural networks is known as feed-forward neural network (FFNN).

The data in FFNN only travels in forward direction from one layer to the

next, in contrast to feedback or recurrent network where a signal may travel in

backward direction or in same layer neurons. Multilayer neural networks are

those consisting of one or more hidden layers, known as the multilayer

perception (MLP)[144, 145].

The basic structure of neural networks is shown in Figure 3.4a, and the

basic structure of a neuron is shown in Figure 3.4b. As shown in Figure 3.4a

each layer’s neuron receives a weighted input from the predecessor layer and

gives an output to the neurons in successor layer. The weight of each layer is

adjusted to achieve network goal. Each processing neuron consists of input

function and the activation function as shown in Figure 3.4b. The input

function sums the weighted inputs (different input functions are possible) and

the activation function to put non-linearity and send it as an output to the

next layer. If there is an input vector X for a neuron having weights vector W

= [ + + + ⋯ ], = [ + + + ⋯ ].

Then neuron output O can be calculated as Eq. 3.5 where tanh is the

activation function

= ℎ ∑ .= . .

Several activation functions can be used with neural network, including:

linear, sigmoid, and hyperbolic tangent, where sigmoid and tanh are used for non-

linear neurons. The mathematical form of these functions is listed in Eq. (3.6)

- Eq. (3.8) where x in net input and f(x) is the output:

= = +𝑒− .

= tanh = −𝑒−+𝑒− .

Broadly, there are two learning methods, i.e., supervised learning and

unsupervised learning. The supervised learning method provides an input

pattern as well as the target output to the network. A common supervised

learning algorithm is backpropagation where the network error propagates in

the backward direction to adjust the weights and minimize the error. Another

modern algorithm is Levenberg-Marquardt (LM) which tries to minimize the

mean square error. LM is the fastest training algorithm for a regular size

network, although it does consume more memory compared to other

methods. The basic function of the algorithm is to minimize the error E which

is defined in Eq. (3.9).

= ∑ − .

3.3 Experiments and results

This section describes the experiments conducted and the results obtained.

The data was pre-processed first. Later, different architectures of the ANN are

tested for better accuracy. Additionally, to compare the ANN accuracy with

other classifiers, we use five other classifiers: k-nearest neighbor (k-NN),

naïve Bayes, decision tree, random forest, and SVM. We have also compared

our results with two state-of-the-art approaches in [78, 79]. The class labels

available in the dataset are used to measure the classifiers accuracy.

Measuring accuracy using the dataset’s class-labels helped in evaluating the

classifier’s results with the earlier usage of the predicted visualization. This

assures that the choice of a predicted visualization is not random. However,

to avoid any biases due to the custom-built dataset, Section 5.6 shows the

classifiers’ performance on 20 benchmark datasets.

As mentioned earlier, pre-processing is required before we use the data for

training ANN. The data pre-processing step is important to clean data from

noise and irregularities before it is presented to the network. The data pre-

processing includes: input output coding, normalization, and scaling. In fact,

some data are in the form not suitable for the neural network to operate on,

since neural network only accepts data in numerical form and produce output

in a specific range depending on the activation function. The dataset used in

this chapter consist of six attributes and 400 instances. Each instance of the

dataset represents metadata of the datasets used for visualization. The first

attribute shows the number of dimensions using the alphanumeric value of

1D, 2D, n-D, or hierarchical. The second attribute is a numerical showing the

number of attributes. The third attribute is also numeric indicating the

number of items in the dataset. The fourth attribute is an alphabetic indicating

primary variable type. Possible types of the primary variable include: ordinal,

categorical, continuous, and geographical. The fifth attribute is also

alphabetic attribute showing the task to be accomplished through

visualization. The possible values for the fifth attribute include: distribution,

relationship, comparison, and trends. Finally, the sixth attribute shows one of

the possible visualization techniques. The values for the sixth attribute

include: histogram, pie chart, map, treemap, parallel coordinates, scatter plot,

linked graph, and line chart. This makes only two attributes out of the five

(excluding the last attribute, i.e., the class label) having numeric values. Thus,

we will require encoding non-numeric values as numeric so that these can be

presented to the ANN. Additionally, once the alphanumeric attributes are

converted to their numeric equivalent, the range of values may differ for the

five attributes. Fig. 6 shows some variables in the original dataset having

higher curves than others. This indicates that the input values are not on the

same scale. To handle this scaling is required. We scale all the attributes using

Eq. (3.10).

, = 𝑖− 𝑖 − 𝑖 . .

Where , ,, , are original values, scaled value, maximum, and

minimum, respectively. The attributes are first scaled using Eq. (3.10) and

later the data is normalized. Normalization is done to confine all the input

values in a desirable range for the activation function. The formula used for

normalization is listed in Eq. (3.11).

− × 𝑖− 𝑖− 𝑖 + . .

Another requirement of the NN is nominal input variables, since NNs are

favourable to numerical or binary values [45, 46]. To transform alphanumeric

values to numeric ones, we used simple coding to convert the nominal value

to numerical ones, e.g., tasks are coded as {distribution = 1, relationship = 2,

trend= 3, comparison =4}.

Figure 3.5 Dataset larger values vs. smaller values

The results of experiments with the aforementioned encoding scheme were

not so promising, so we later used another type of encoding scheme known as

1-of-n encoding. To illustrate, here is a simple encoding for variable task

{distribution = 0001, relationship = 0010, trend= 0100, comparison =1000}. Other

nominal input values are similarly processed before they are fed into the

3.3.1. ANN Experiments

At this stage, the training dataset is ready to be classified through the

neural network. For the experiments we used MATLAB [144]. There are five

neurons in the input layer, each corresponding to one attribute of the dataset

and the number of neurons in the output layer is eight, one for each of the

eight class labels (the visualization techniques). The activation function is tan-

sigmoid and the connection weights are in the range -2 to 2. In the

experimental phase first several ANN structures with one and two hidden

layers are tried. Experiments using five other classifiers are also performed.

The results obtained with the five other classifiers and the best ANN are

compared. This section also provides a qualitative comparison of state-of-the-

art approaches presented in [78, 79] and the best ANN.

The initial parameters selected are: learning rate, activation function,

performance function, and goal for various ANN structures as listed in Table

3.5. The dataset consists of 400 items where we used 70% of the dataset for

training and 30% for testing. However, in cases where we use a validation set,

the dataset is randomly divided as 60% for training, 20% for testing, and 20%

for validation. The dataset is shuffled before proceeding for the training. For

training, input patterns and the desired output are presented to the network,

whereas in case of testing, only input patterns are presented. All the

experiments are carried out on Intel Core i5 machine with 2.5GH processor and

4 GB RAM.

Initially, only one hidden layered structure is used (Figure 3.6a), with one

to 25 hidden layer neurons, input and output layer neurons fixed to five and

eight, respectively. At the same time we perform the experiments with and

Table 3.5 Network structure and initial parameters

Hidden

Layers Hidden neurons/structure Initial network parameters

1-Layer

Hidden neurons 1-25

Act. function for hidden layer = tan-sigmoid

Act. function for output layer = linear

Perform function = MSE

Learning rate=.01

Train set Ratio = 0.6%

Val. set Ratio = 0.2%

Test set Ratio = 0.2%

Division = random

Maximum epochs = 500

Performance goal = .0100

2-Layers

05-05-02-8

05-05-05-8

05-07-05-8

05-08-05-8

05-08-07-8

05-12-11-8

05-15-10-8

05-20-14-8

05-24-16-8

05-30-20-8

Act. Function for hidden layer 1 = tan-sigmoid

Act. Function for hidden layer 2 = tan-sigmoid

Act. Function for output layer = linear

Perform Function = MSE

Learning rate=.01

Train set Ratio = 0.6%

Val. set Ratio = 0.2%

Test set Ratio = 0.2%

Division = random

Maximum epochs = 500

Performance goal = .0100

without the validation checks having the same parameters as show in Table

3.5. In the first case, we trained the network for 200 epochs by providing

training and test sets of the data. As the number of hidden neurons increases

from 1 to 25 the network takes different time duration for 200 epochs. The

training and test mean squared error (MSE) are reported with time taken to

complete for each network, see Appendix. It is observed that increasing the

number of neurons gives a better MSE performance. However, beyond 16

neurons further addition has negative effect on network as shown in Figure

3.7d. Figures 3.7a and 3.7b show the training and testing results of the two

network structures with 10 and 23 hidden neurons, respectively. The solid line

shows the training MSE and dashed line represents test MSE. In case of 10

neurons, the network is well trained and both lines show close proximity.

Although, the second network’s MSE shows improvement, testing results are

poor. This is due to over fitting of network with a large number of neurons.

Complete experimental results for various structures are presented in the

Appendix. Figure 3.8c shows the best case with 12 neurons, where test MSE

drops below training MSE with an overall accuracy of 98%.

To validate the network, another experiment is performed using randomly

divided data into three sets; training, validation, and test, with 60%, 20%, and

20%. The goal (MSE) threshold is set to 0.01, whereas all other parameters -

including the ANN structure remain the same. Validation check forces the

training to stop early, before reaching maximum epochs and thus prevent the

situation discussed in the previous case. With too few neurons (only two,

Figure 8a) the MSE for training, validation, and testing are very close and

remain constant after the first epoch. At epoch 45, the validation error starts

growing from the threshold (6 errors per epoch), and hence the training stops.

Since the MSE in this case is 0.0794, less than the set goal, the network is not

adopted. In the other case, as the network grows larger (as shown in Figure

3.8c) the training stops after 18 epochs. At this stage the training MSE is

0.0097 and validation MSE is 0.0077. This implies that the network is

properly trained at this point. As the number of hidden layer neurons increase

to 25, the gap between training and validation increases as shown in Figure

3.8b. Training in this case stops at epoch 9 as the validation error increases

and the testing curve starts deviating from the training curve.

(a) (b)

Figure 3.6 (a) Single hidden layer network (b) 2-hidden layered network

Figure 3.7 (a) 10 neurons (b) 23 neurons (c) best case with 12 neurons (d) no. of nodes vs. MSE

Figure 3.8 (a) 2-Hidden nodes (b) 25-Hidden nodes (c) 14- Hidden nodes

(a) (b)

Two hidden layered models of the ANN are also implemented. For most

applications, one hidden layer networks are sufficient. However, to find the

best model, we experimented with two hidden layers as well. The two hidden

layer network architecture is shown in Figure. 3.6b. The detailed results of the

experiment are shown in the Appendix. We maintained initial configuration

as described in Table 3.5. Validation errors are higher for fewer neurons in the

hidden layers as Figure 3.9a show MSE of 05-8-7-8 NN. The Figure 3.9b

shows MSE for 05-24-16-8 NN structure, as in the latter case, the three lines

are adjacent to each other. The comparison of different two hidden layered

ANN network structures is shown in Figure 3.9c. As the number of neurons

increase in the hidden layers, validation error declines. However, neural

networks consisting of a very large number of neurons take more computation

time and show no significant improvement in MSEs. The NN structure with

24 neurons in first hidden layer and 16 in the second layer show better

average MSE. As number of neurons increase beyond 30-24 structure, the

network computation time increases with no significant improvement in

3.3.2. The n-fold cross-validation

The n-folds cross-validation is another popular technique to train a

classifier. In this experiment the n-folds cross-validation is used by keeping the

value of n at 10. The cross validation is considered to be more robust and

reliable, at the same time it is good for generalizing error estimation. The n-

fold cross-validation technique is also more effective for small number of

training samples. The basic property of this method is randomness added to

the training samples, as the method does not use fixed partitions of training

and testing sets. In contrast to the fixed split out strategy, the whole dataset is

divided into n disjoint sets known as folds and each time the n-1 folds are used

for training and the remaining 1 set is used for testing the model. The model’s

accuracy is computed taking the average of all runs. The 10-folds cross-

validation method on several architectures with one and two hidden layers

having various numbers of neurons is experimented. The training, testing, and

validation MSEs are reported in the Appendix. We also calculated the

training and testing accuracy for one hidden layer structure which are

reported in the Appendix.

Figure 3.9 Two hidden layered structure analysis

Figure 3.10 Accuracy for different number of nodes in hidden layer

Figure 3.11 Hidden nodes vs. MSEs

Figure 3.10 shows the comparison in accuracies for training and testing

with NN structure of one hidden layer having hidden nodes from one to 25

using 10-folds cross-validation method. The maximum accuracy of 98.61% is

achieved in training, while with test data, the highest accuracy of 97.50% is

achieved using several hidden layer neurons. Figure 3.11 shows the

comparison of MSEs and hidden nodes for training, testing, and validation of

model for several hidden neurons. As the number of neurons in the hidden

layer increases, the error rate decreases to a level after which the change is

insignificant.

3.3.3. Performance analysis

Experiments with various one-hidden layered ANN structures produce

different results. The network MSE declines with the addition of more

neurons to the hidden layer in case of one-hidden layered NN architecture.

However, adding more neurons has the opposite effect on the network

causing the gap between training and test MSE to increase due to growing

validation errors. As shown in Figure 3.12 the training MSE quickly reduces

with the addition of hidden nodes. The validation curve shows spikes due to

the different number of validation errors of the network. The network with 14

hidden nodes show better results for training, validation, and test MSE as

compared to others. To evaluate the performance of classification, evaluation

factors based on confusion matrix are used. The confusion matrix defines the

parameters on the bases of correctly classified and misclassified items, such as

true positive (TP), true negative (TN), false positive (FP), and false negative (FN).

Other parameters are accuracy (Eq. 3.12), sensitivity (Eq. 3.13), precision (Eq.

3.14), and correlation coefficient (R2) to measure the performance of NN

model.

Figure 3.12 Hidden Nodes vs. MSE for 1 hidden layered ANN

= 𝑃+ 𝑒 ×

(3.12)

= 𝑃𝑃+ (3.13)

= 𝑃𝑃+ 𝑃 (3.14)

The average performance for all the parameters is taken since NN output

as eight classes is implemented. Table 3.6 presents the performance

evaluation of different one-hidden layered neural networks having MSE less

than 0.01. The one-hidden layered NN structure consisting of 14 hidden

neurons has the best performance with 97.50% accuracy.

Table 3.4 NN performance

Network Structure Accuracy % Sensitivity % Precision % R2

05-12-8 96 95 91 0.967

05-13-8 96 96 90 0.956

05-14-8 97.50 98 92 0.978

05-15-8 95 97 92 0.953

05-16-8 95 95 90 0.931

05-08-07-8 96 97 91 0.95

05-12-11-8 94 95 88 0.957

05-15-10-8 94 95 89 0.948

05-20-14-8 96 97 92 0.943

05-24-16-8 95 97 93 0.967

05-30-20-8 97 98 92 0.968

The confusion matrix is also used for the two-hidden layered structure to

extract network evaluation parameters, i.e., accuracy, sensitivity, precision,

and coefficient relation. The results of these parameters are already shown in

Table 3.6. Based on these experiments, the one-hidden layered ANN structure

consisting of 14 hidden neurons seems to be best performing architecture on

the given dataset. Figure 3.13 shows the confusion matrix of the best NN

architecture.

Histogram 100% 20% 0 0 0 0 0 0 93%

Pie Chart 0 80% 0 0 0 0 0 0 100%

Map 0 0 100% 0 0 0 0 0 100%

Treemap 0 0 0 100% 0 0 0 0 100%

Parallel Coord. 0 0 0 0 100% 0 0 0 100%

Scatter Plot 0 0 0 0 0 100% 0 0 100%

Linked Graph 0 0 0 0 0 0 100% 0 100%

Line Chart 0 0 0 0 0 0 0 100% 100%

100% 80% 100% 100% 100% 100% 100% 100% 97.50%

Target Class

Figure 3.13 Confusion matrix of the best ANN architecture

The performance metric values for: true positive rate (TP-rate), true negative rate (TN-rate),

false positive rate (FP-rate), and false negative rate (FN-rate) for the best ANN model are listed in

Table 3.6.

Table 3.6 Best ANN performance

Metric Value

TP-Rate 0.9700

TN-Rate 0.9956

FP-Rate 0.0043

FN-Rate 0.0300

To study the impact of various learning approaches for the ANN,

experiments are conducted using different approaches. These approaches

include: Levenberg-Marquardt, Rprop (resilient backpropagation), BFGS

quasi-Newton, GD-adaptive learning, scaled conjugate gradient, conjugate

gradient with Powell/Beale restarts, and Polak-Ribiére conjugate gradient.

The comparison is made based on average accuracy, training CPU time, and

training MSE. Table 3.7 shows that the Levenberg-Marquardt algorithm has

the best performance considering the accuracy.

Table 3.7 Impact of various learning approaches on the ANN

Learning algorithm

Average

accuracy (%)

Training MSE

Training CPU

Levenberg-Marquardt 97.5 0.008 2.3

Rprop 95 0.021 2.7

BFGS quasi-Newton 89 0.023 4.5

Gtraining MSE 78 0.053 2.2

Scaled conjugate gradient 83 0.034 1.9

Conjugate gradient with Powell/Beale

restarts 96 0.024 2.4

Polak-Ribiére conjugate gradient 95.5 0.025 2.5

3.4 Sensitivity analysis

Sensitivity analysis is used to explore the relative importance of model’s

inputs and to check how they impact output. In this study, sensitivity analysis

is performed using the stepwise method. In this procedure, the trained

network is presented with input parameters set, one at a time. The respective

behavior of the model is taken as output and network parameter, such as:

MSE, error rate, correlation coefficient (R2), and accuracy are recorded. The

combined effect of various parameters is also reported. As shown in Table

3.8, first two input parameters, i.e., dimension and primary variable of the

dataset are influential in selecting visualization technique. After removing the

dimension parameter, the error rate increases to 21% and MSE to 0.055. For

primary attribute the error rate reached 29.50% and MSE value to 0.058.

Other three input parameters have lower error rates if omitted from the input

and have almost the same influence on the output. Along with their

individual importance, the combined influence of most important parameters,

i.e., dimension and primary attribute are also evaluated in comparison with

less important parameters. In this case, when dimension and primary attribute

inputs are left out from the input, the error rate is achieved as low as 41% with

only 68% of correlation existing between the output and target values.

Table 3.8 Sensitivity analysis results

Input parameter

Error rate

% MSE R2

Dimension 21.00 0.055 0.82

No. Of attributes 12.75 0.023 0.90

No. Of instances 12.75 0.021 0.93

Primary attribute 29.50 0.058 0.73

Task 12.75 0.023 0.90

Dimension + Primary Attribute 41.00 0.073 0.68

No. Of attributes + No. Of instances +

Task 14.25 0.024 0.89

3.5 Comparison with other classifiers

The best ANN architecture’s results are compared with k-nearest neighbor

(k-NN), naïve Bayes, decision tree, random forest, and SVM.

The k-NN is a non-parametric supervised learning algorithm used

commonly for classification problems [146]. In contrast to other classifiers, k-

NN does not use training set for generalization; rather it uses a prototype for

new instances to be classified. The training of k-NN is fast, since only training

set is mapped to a feature space with different regions. The k-NN classifier

generally uses Euclidean distance to classify new instances. We used k-NN to

classify the dataset and the overall accuracy turns out to be 84%. Individual

class performance of k-NN is shown in Table 3.11.

Naïve Bayes is a probabilistic classifier based on Bayes theorem and is used

in many real-world complex problems. It uses maximum likelihood to classify

an instance to a class, where the probability of each feature is taken

independently. However, the classifier has low classification performance,

particularly for the features with higher interdependence. We used a kernel-

based naïve classifier achieving maximum accuracy to 81%. The naïve

classifier’s individual class performance is given in Table 3.11.

Decision tree is a hierarchical classifier. The decision tree consists of three

different types of nodes, i.e., root, internal, and leaf nodes [147]. The leaf

nodes are assigned class labels while the internal nodes are used for

classification of rules and decision making. A decision tree-based classifier is

used with the dataset, Table 3.11 lists the performance of decision trees.

Random forest is an ensemble classifier, consisting of a large number of

tree-based classifiers [148]. Each tree is built using independent random

vector of features selected from the training dataset. To classify unknown data

with random forest, all trees in the forest cast a unit vote for the popular class.

We used five features in the input dataset and eight output classes. Random

forest is trained and tested using 10-folds cross-validation method as described

earlier. The random forest is developed using various number of trees in the

forest and their respective overall accuracy is shown in Table 3.9. In this

experiment, class-wise performance of the classifier is also calculated as

shown in Table 3.11.

Table 3.9 Overall accuracy for random forest

No. of tree Accuracy (%) CPU time (seconds)

5 92.50

0.302801(average)

10 92.50

50 95.00

100 95.00

200 95.00

500 95.50

1000 95.00

SVM is a classifier originally developed to solve binary classification

problems. However, later the technique was adopted to handle multiclass

problems using various methods, i.e., one-against-all and one-against-one

[149,150]. SVM is used with four different kernels: linear, polynomial, RBF,

and sigmoid. The experiment is carried out with different ratios of training

and testing data as given in Table 3.10. The table shows the prediction

accuracy of SVM for the four kernels with respect to different training and test

sets of data. Table 3.11 shows the class-wise performance of SVM.

Table 3.10 SVM prediction accuracy

SVM kernel Accuracy Accuracy Accuracy CPU time (seconds)

(70% training) (80% training) (90% training)

(30% testing) (20% testing) (10% testing)

Linear (207 SVs) 85.83% 77.50% 87.50%

0.0312002

Polynomial (272 SVs) 41.67% 42.50% 40%

RBF (245 SVs) 76.67% 77.50% 82.50%

Sigmoid (253SVs) 53.33% 55% 57.50%

Table 3.11 Per class accuracy of different classifiers

Class label ANN SVM RF DT NB k-NN

Histogram 100 92.96 95.67 98 85.83 73

Pie Chart 92.3 82.41 96.5 99 80.83 69

Map 100 100 99 99 93.33 94.75

Treemap 100 98.99 98 98 92.5 96

Parallel Coordinates 100 92 100 98 93.33 92.5

Scatter Plot 100 94.47 98.16 98 93.33 75.57

Linked graph 100 93.97 100 97 92.33 68

Line Chart 100 95 100 100 94.83 87

Average accuracy 99.03 93.73 98.79 98.37 90.79 81.98

Table 3.11 Average accuracy and CPU time of classifiers

Performance ANN SVM RF DT NB k-NN

Accuracy (%) 97.5 92 98 95 81 84

F-Measure 0.9634 0.9085 .9798 0.9156 0.7867 0.8067

AUC 0.9820 0.9619 0.9923 0.9590 0.8996 0.8698

CPU time(seconds) 0.0243 0.0312 0.302801 0.0468 0.0468 0.073601

Table 3.12 lists the overall accuracy of each classifier. The table shows the

CPU time taken, F-measure, and AUC (area under curve) for the classifiers.

The Friedman test, a non-parametric statistical test, is applied to determine

the significant difference among the classifiers. The threshold for Friedman

test is 0.01, where values greater than this threshold indicate better

performance of the ANN. Table 3. 13 lists the results of Friedman test. All

comparisons are taken with 99% confidence level. As shown in Table 3.13

ANN’s accuracy is significantly better than the other classifiers except for RF.

However, in case of RF the difference in accuracy is not significant.

The sensitivity analysis for the five additional classifiers is also performed

as was done for the ANN in Section 5.4. Table 3.14 shows the results of

sensitivity analysis for each classifier. Some variables are more sensitive to the

model. The classifier’s error rate increases when the primary attribute is

removed from the input data. It also shows that if both the primary attribute

and the dimension parameter of the input are dropped the accuracy for all the

classifier is decreased. This has the most significant effect on NB which results

in an error rate of almost 50%. The tree-based classifier DT is comparatively

less sensitive to the removal of these input parameters.

3.6 Ranking three best visualizations

Throughout the chapter the visualization selection methodology is

discussed. However, the user is not always satisfied with a single suitable

visualization, especially when there is no obvious advantage. To provide a

more flexible approach, the system is enhanced to automatically select three

best visualizations out of eight. The three selected visualizations are ranked

based on the neural network output. The system recommends only one best

visualization based on the output layer neuron having maximum value. The

modified version of the proposed approach is tested on twenty real-world

benchmark dataset. The selection criteria for the twenty benchmark dataset

was diversity in terms of their domain, number of attributes, number of

instances, and the primary variable type. These datasets are summarized in

Table 3.15.

The results of this experiment are listed in Table 3.15. The trained neural

network is simulated with the metadata of the 20 datasets mentioned in Table

15. The system provides three best visualizations, where the first visualization

in each case is the most suitable one whereas the other two ranked 2nd and 3rd

, respectively.

To further verify the performance of the best ANN in an unknown

environment, we create a new dataset using the twenty benchmark datasets

mentioned in Table 3.15. Metadata of these twenty benchmark datasets are

used to predict appropriate visualization. The attributes and their types for

this newly created dataset are the same as being in the original dataset

explained in Section 3.1.

The dataset is then used as input to the ANN model which is already

trained. The ANN predicts the correct output with 95.4% accuracy. Table

3.17 shows the predicted visualization by the ANN, where each of the

visualization is predicted for the particular task specified against the case.

Table 3.13 Accuracy comparison using Friedman test

Friedman test

p-value

Confidence

Level (%)

ANN-SVM 0.001 99

ANN-RF 0.011 99

ANN-DT 0.001 99

ANN-NB 0.002 99

ANN-KNN 0.005 99

Table 3.14 Sensitivity analysis using classifiers- error rate (%)

Input parameter ANN SVM RF DT NB k-NN

Dimension 21 23.7 17.4 11 30 30

No. Of attributes 12.75 18 12.3 14.5 37.5 10.5

No. Of instances 12.75 20 11.5 10.4 27.4 9

Primary attribute 29.5 28.5 28 27 38 38

Task 12.75 11.4 14.2 11 21 13.5

Dimension + Primary attribute 41 45 42.3 25 50 48.4

No. of attributes + No. of instances + Task 14.25 24 11.8 30 32 28

Table 3.15 Dataset description

Dataset No. of attributes No. of instances Variable type

Iris2 4 150 Real

Web Page Ranking3 5 332 Categorical

Chess 4 6 28056 Categorical

Wine5 13 178 Integer

Breast Cancer 6 32 569 Real

Planning Relax7 13 182 Real

Online social media keywords8 35 51 Integer

Abalone9 8 4177 Categorical

Heart Disease10

75 303 Categorical

Cvd311

6 117 Categorical

19 428 Real

Flags13

30 194 Categorical

Ozone Level Detection14

73 2536 Real

Bird 15

26 12 Categorical

World oil production16

27 220 Categorical

6 187 Categorical

Doctorate student by state18

21 53 Categorical

Journal articles19

2 1256 Real

Internet usage20

72 10104 Categorical

Image segmentation21

19 2310 Integer

2 https://archive.ics.uci.edu/ml/datasets/Iris

3 http://archive.ics.uci.edu/ml/datasets/Syskill+and+Webert+Web+Page+Ratings

4 http://archive.ics.uci.edu/ml/datasets/Chess+%28King-Rook+vs.+King%29

5 http://archive.ics.uci.edu/ml/datasets/Wine

6 http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

7 http://archive.ics.uci.edu/ml/datasets/Planning+Relax

8 http://archive.ics.uci.edu/ml/datasets/Predict+keywords+activities+in+a+online+social+media

9 http://archive.ics.uci.edu/ml/datasets/Abalone

10 http://archive.ics.uci.edu/ml/datasets/Heart+Disease

11 https://www.quandl.com/data/DMDRN/

12 http://www.amstat.org/publications/jse/jse_data_archive.htm

13 http://archive.ics.uci.edu/ml/datasets/Flags

14 http://archive.ics.uci.edu/ml/datasets/Ozone+Level+Detection

15 http://ec.europa.eu/eurostat/product?code=tsien170

16 https://datamarket.com/data/set/17tl/total-world-oil-production-

barrels#!ds=17tl!kqb=6&display=line 17

https://snap.stanford.edu/data/ 18

trends.collegeboard.org 19

https://www.oclc.org/data/data-sets-services.en.html 20

http://archive.ics.uci.edu/ml/datasets/Internet+Usage+Data 21

http://archive.ics.uci.edu/ml/datasets/Statlog+%28Image+Segmentation%29

Table 3.17 Dataset with three best visualizations

Visualizations

Dataset Best

Best 3rd

Iris Line Chart Pie Chart Histogram

Web Page Ranking Histogram Line Chart TreeMap

Chess Linked graph Map TreeMap

Wine Line Chart Histogram TreeMap

Breast Cancer Line Chart Histogram Linked graph

Planning Relax Parallel Coord. Map TreeMap

Online social media keywords Histogram Line Chart TreeMap

Abalone Parallel Coord. Pie Chart Map

Heart Disease TreeMap Histogram Line Chart

Cvd3 TreeMap Pie Chart Scatter Plot

Car Line Chart Histogram Parallel Coord.

Flags Parallel Coord. Linked graph Pie Chart

Ozone Level Detection Histogram Line Chart TreeMap

Bird TreeMap Histogram Line Chart

World oil production Parallel Coord. Scatter Plot Pie Chart

P2P TreeMap Histogram Linked graph

Doctorate student by state TreeMap Scatter Plot Pie Chart

Journal articles Line Chart Histogram Map

Internet usage Pie Chart Parallel Coord. Map

Image segmentation Parallel Coord. Linked graph Line Chart

Table 3.18 Benchmark datasets and the predicted visualization based on task

Dataset Task Visualization

Iris Relationship Scatterplot

Webpage Ranking Comparison Line Chart

Chess Comparison Parallel Coordinates

Wine Distribution Histogram

Breast Cancer Relationship Map

Planning Relax Distribution Map

Online social media keywords Comparison Histogram

Abalone Distribution Parallel Coordinates

Heart Disease Relationship Histogram

Cvd3 Distribution Line Chart

Car Comparison Histogram

Flags Distribution Histogram

Ozone Level Detection Distribution Histogram

Bird Relationship Histogram

World oil production Distribution Histogram

P2P Relationship Parallel Coordinates

Doctorate student by state Trend Parallel Coordinates

Journal articles Distribution Map

Internet usage Trend Line Chart

Image segmentation Distribution Line Chart

The newly built dataset is then used as input to the ANN model which is

already trained. This time ANN predicts the correct output with 95.4%

accuracy. Table 3.16 shows the predicted visualization by the ANN, where

each of the visualization is predicted for the particular task specified against

the case.

We demonstrate the information visualization results for the iris dataset.

Table A.8 in the Appendix-A shows seven out of the eight visualizations

using the sample dataset, i.e., iris. Map visualization is not displayed using

the iris dataset as it requires spatial coordinates. We refer the reader to Figure

3.3(g) for the map visualization.

Table 3.28 in the appendix shows seven out of the eight visualizations

using the sample dataset, i.e., iris. Map visualization is not displayed using

the iris dataset as it requires spatial coordinates. The reader is referred to Fig.

3.4(g) for the map visualization. Table 3.18 lists the performance of ANN and

five other classifiers in unknown environment created using 20 benchmark

datasets.

3.7 Comparison with state-of-the-art

As discussed in section 3.2, work in [78] presents a visualization selection

system using a rule-based system. This system automatically selects

visualization type and its properties based on the data and corresponding

metadata. The data and metadata is checked against the fixed rules. The

system consists of basic charts commonly used in businesses. In contrast,

neural network-based system is used to select a visualization based on the

metadata. There are some basic differences between the neural network-based

system and rules-based system. The rule-based system works on rules build by

humans; whereas, neural network learns from the training data and are more

efficient. A comparative study of various intelligent and expert systems is

reported in [151]. The proposed approach used metadata to select a

visualization method. To the metadata the task needed by the user to be

performed on data are added as well. Another limitation of rule-based system

is the requirement of domain knowledge to build rules, which is not required

in case of neural network. The rules in [78] are fixed and managing or adding

new rules itself is an issue. The proposed system is more general, only training

is required for predicting specific visualization. Eight visualization techniques

are used in the current work and more visualization techniques can be added

to the system. Another advantage is that once trained, neural network works

faster as compared to the rules-based system.

Another closely related work on automatic visualization selection is

presented in [79]. The system uses fitness scores to select visualization type

for a dataset. The fitness value is computed from the metadata of original data

and visualization type metadata as well. A rule-based engine is utilized to

determine the fitness value for each visualization type available in the system.

The user is then provided with more than one visualization options ordered

by their fitness value. The implementation of the system provides only with

charts visualization, which is a limitation of the system. The system’s

complexity is dependent on the rules-engine, as the rules-engine has to keep

both data mapping rules and chart selection rules. At the same time, the

system needs to store both the metadata for the dataset and also for the

visualization techniques. Since a qualitative comparison between the

proposed system and state-of-the-art in [78, 79] is not possible, Table 3.18

qualitatively compares the three.

3.8 Dataset extensibility

To demonstrate the extensibility of the dataset the best performing ANN is

customized and additional articles are added to the dataset. A plain data table

and infographics are added to the dataset as new classes. In addition, the total

number of visualizations in the document is also added to the dataset as an

attribute. With this addition of the attribute all dataset instances were revised

and the records where the total number of visualizations in the document was

not available were removed to avoid any noise. The representative articles

used for entries in the dataset related to the plain data table and infographics

include [88], [89], and [90]. This resulted in the revised dataset to have 80

records. Two neurons are added to the output layer of the best performing

ANN to accommodate the newly added classes. The customized ANN is

trained using the revised dataset. This experiment resulted 84% average

accuracy of the ANN. Table 3.20 shows the average accuracy of all the

classifiers on the revised dataset.

Table 3.19 Classifiers performance using revised datasets

Performance ANN SVM RF DT NB k-NN

Accuracy (%) 84 72 70 80 65 70

Table 3.20 Comparison between the proposed system and state-of-the-art

Feature Proposed system Rules-based system in [13] Rank-based system in [14]

Basic working Learning Human rules /business rules

Rules-engine

Visualization selection Metadata, Tasks on data Data, metadata

Metadata of visualization

and data

Domain knowledge Not required Required Required

Visualization type 8 different types Common business type Charts, maybe others

Rules management General/learning on input data Fixed Fixed

Adding new visualization Easy/only training required Relatively difficult Relatively difficult

Complexity Number of neurons/layers Rules/number of statements Complexity of rules-engine

3.9 Adding New Visualization Techniques

The dataset is also extended by adding few modern visualization techniques,

including: cone tree, sunburst, storyline, bubble chart, radial chart and area

graph. The ANN-based model is then evaluated using this enhanced dataset.

3.10 Discussion

The selection of an appropriate visualization method for a particular

dataset is not trivial. Normally the process for visualization selection is either

through trial and error or one’s experience. There is a complex relation

between visualization technique and the dataset to be visualized.

Visualization technique may change with different number of attribute types,

number of attributes, and primary task. This work has investigated ANN-

based model for automatic selection of a visualization technique based on the

metadata of the dataset. Various ANN architectures are tested with different

values of configuration parameters, as summarized in Table 3.4. The ANN

models having one and two hidden layers with different number of neurons

were designed and tested using training, testing, and validation sets. The

ANN model takes metadata as input of the dataset and gives a visualization

technique as output. Although one-layer model gives an accuracy of 97.50%,

we tested two layered models to countercheck for better accuracy. The

experiment shows that by increasing one hidden layer, the accuracy of the

model remains around 97%, while it increases the computational complexity

of the model. The ANN model of one hidden layer with 14 neurons is

adopted, as it gave the highest accuracy with the least number of neurons.

The performance results are compared using different metrics, as given in

Table 5. One hidden layered model with 14 neurons performs better, since the

model has higher accuracy, sensitivity, precision, and R2. The results show

that ANN model explains the ambiguous relation between dataset and

visualization technique with high accuracy.

Training using 10-folds cross-validation method is also performed to check

the reliability of the model. The performance of model in terms of accuracy

did not improve much but 10-fold cross-validation gave more robust model.

The results from the ANN are compared with five other classifiers: k-NN,

naïve Bayes, decision tree, random forest, and SVM. All classifiers are

implemented and checked with their optimum structure and were executed on

the same machine. Each classifier was tested for different parameter

configurations, e.g., kernel function in case of SVM, number of trees in case

of RF, and number of neighbours in case of k-NN. The accuracy for ANN,

SVM, RF, DT, NB, and k-NN was 97.5%, 92%, 98%, 95%, 81%, and 84%,

respectively as shown in Table 3.13. The ANN model outperformed all the

classifiers except RF, where both classifiers have taken almost same CPU

time, while other classifiers such as k-NN took more time to complete. The

NB comparatively shows worst performance with an accuracy of only 81%,

while taking almost 100% more time than the ANN. Individual class

comparison shows that both ANN and RF have closer results and are better

than other classifiers (see Table 3.12). RF give 100% accuracy for three classes

in the dataset while ANN gives 100% accuracy for seven classes.

The discussion above shows that only RF performance was closer to ANN.

The ANN model’s basic configuration and complexity depends on the

number of neurons, while RF model’s complexity is based on the number of

trees in the forest. The RF’s accuracy also depends on how uncorrelated the

trees in the forest are, while ANN model is based on the learning algorithm

and activation function of the neurons. Observing closely the range of

accuracies for both ANN and RF models, it can be seen that ANN has lowest

accuracy of 96% with one layer architecture consisting of 12 neurons, while

RF has the lowest accuracy of 92% with five tree forest. The performance of

RF is only better if there are more than 500 trees in the forest, which adds to

the complexity of RF.

The proposed approach was also compared with the related state-of-the-art

approaches which were two rule-based systems. The study shows that the

proposed approach has several advantages over the existing ones. Flexibility

is an appealing aspect of the proposed system. The ANN-based system does

not need direct domain knowledge and can handle the complex relation of

data, whereas building rules need domain expertise in case of rule-based

system.

3.11 Chapter summary

Appropriate visualization technique selection for a specific data is essential.

This study presented an automatic selection of information visualization

technique based on the metadata of a dataset. The proposed solution was

based on an artificial neural network (ANN) which was compared for

performance with five other classifiers and two most closely related works.

The ANN-based prediction model of automated visualization selection

outperformed the k-nearest neighbour, naïve Bayes, decision tree, random

forest, and support vector machine based on accuracy and/or time consumed.

The dataset used consisted of eight classes and the proposed ANN-based

model provided with a generic framework to accommodate more than eight

classes. Appropriate data may be added in the dataset and neurons in the

input/output layer. In contrast to the current states-of-the-art approaches,

which are based on rule-based systems, the proposed solution is generic and

can accommodate new input patterns. The work brings new perspective in the

field of visualization; where new visualizations may be added to the dataset in

order to build a comprehensive database. The dataset will then provide a

foundation for an expert system with a knowledge base.

This work can be extended in several directions in the future. We used

eight visualization techniques, more visualization techniques may be added to

the current dataset. Right along this, visualization weight may be added to

increase the selection probability of a particular visualization technique. We

used four tasks in our dataset and one task is considered for the selection of

visualization, the work can be extended to incorporate more tasks,

particularly user specific ones. Another interesting future aspect would be

developing a library based on this work and integrate it with various

development environments, e.g., data mining packages, electronic

worksheets, and online services. Another future direction is the optimization

or configuration of selected visualization according to user requirements. A

user study, such as control experiment may be performed to evaluate the

output of the system and at the same time in future, user feedback on selected

visualization can be added to the system.

Chapter 4 Visualization Optimization: An EC-Based Approach

Quantifying and Optimizing

Visualization

“Measure what is measurable, and make measurable what is not so”

Galileo Galilei

Advances in computing technology and computer graphics

engulfed with huge collections of data have introduced new

visualization techniques. This gives users many choices of

visualization techniques to gain an insight about the dataset

at hand. However, selecting the most suitable visualization

for a given dataset and the task to be performed on the data is

subjective. The work presented here introduces a set of

visualization metrics to quantify visualization techniques.

Based on a comprehensive literature survey, we propose

effectiveness, expressiveness, readability, and interactivity as

the visualization metrics. Using these metrics, a framework

for optimizing the layout of a visualization technique is also

presented. The framework is based on an evolutionary

algorithm (EA) which uses treemaps as a case study. The EA

starts with a randomly initialized population, where each

chromosome of the population represents one complete

treemap. Using the genetic operators and the proposed

visualization metrics as an objective function, the EA finds

the optimum visualization layout. The visualizations that

evolved are compared with the state-of-the-art treemap

visualization tool through a user study. The user study

utilizes benchmark tasks for the evaluation. A comparison is

also performed using direct assessment, where internal and

external visualization metrics are used. Results are further

verified using analysis of variance (ANOVA) test. The results

suggest better performance of the proposed metrics and the

EA-based framework for optimizing visualization layout. The

proposed methodology can also be extended to other

visualization techniques.

Chapter

The increased use of computing devices in almost every field of life is

generating data at a much higher pace than ever before. This is mostly due to

the advances in data processing speed, communication technology, and

storage capacity. Such burst of emerging data paved the need for techniques

that could extract useful information hidden in the data. The techniques such

as predictive analysis, descriptive analysis, or visual inspection can address

such issues. Like predictive and descriptive analysis, multiple options are

available for the visual inspection of data [152]. Over the years, various

information visualization techniques have been proposed ranging from a

simple pie chart and line graph to more sophisticated techniques such as

treemaps, boxplot, and parallel coordinates. Selection of a particular

visualization technique depends on the type of data, its size, the task to be

performed, and the problem domain. Visualization provides humans with an

ability to explore and analyse large volumes of data visually using cognition

and perception. Therefore, establishes the augments for the decision making

capabilities of humans. However, due to its interdisciplinary nature, often

interpreting visualization can be tedious for the domain experts. A domain

expert needs answers to complex questions by visualizing their data. During

this process, one basic question a domain expert may ask, “is this a better

visualization?” or “can the visualization quality be further enhanced with respect to

aesthetic and perception aspects?” At the same time there is a need for

aesthetically better and perceptually pleasing visualizations [91] which are, at

the same time, helpful in extracting useful patterns in the data. These issues

can be addressed if there were a metric that could quantify a particular

visualization technique or its layout. Initial work on the quantification of

visualization can be seen in [81, 82, 85, 155]. However, these are either initial

definitions [81, 82] or limited to only a few aspects of a specific visualization

technique [85, 155]. Further, to automate the process of optimizing a visual

display, techniques are needed that can autonomously present an optimal

layout of a visualization from the viewer’s perspective.

This work presents an approach for the automated optimization of

visualization layout. For this purpose, a set of visualization metrics consisting

of four components, including: effectiveness, expressiveness, readability, and

interactivity are proposed. These metrics are selected and formulated based on

a literature survey. To create visualizations and optimize their layout, an

evolutionary algorithm (EA) is used. The proposed set of visualization

metrics is used as an objective function for the EA. Treemap is adopted as

visualization method in the experiments. The effectiveness of the proposed

approach is evaluated using a controlled experiment based on benchmark

tasks. It is also compared with a state-of-the-art treemap visualization tool

using both internal and external evaluation metrics.

4.1 The information visualization metrics

The work in [156] suggests moving beyond user studies/controlled

experiments and find newer evaluation methods for information visualization.

Over the years, different metrics have been proposed to measure quality of

visual information based on various aspects. Most of the work in this area is

inspired from the original work of Tufte [157] and Mackinlay [158]. The work

presented by Huang et al. [159] and Bennett et al. [160] discuss several metrics

related to graph visualization. A perceptual metric for scatter plot is presented

by Albuquerque et al. [96] while treemaps related metrics are proposed in

[161] and [162]. Aritra et al. [95] suggest screen space metrics for the effective

use of parallel coordinates visualization. This section reviews the proposed

metrics to quantify the information visualization techniques. The proposed

metrics are formulated based on a comprehensive literature survey that covers

the previously published work and theories on various aspects of an appealing

visualization technique.

Visualization is used to represent the data using shapes, colors and their

patterns so that the hidden information in the data becomes obvious to the

viewer. Visualization is an alternate way to present the raw data to

communicate the information contained therein. This makes effectiveness an

important aspect of any visualization technique. Visualization techniques (or

layout of a particular technique) that fails to adequately

highlight/communicate the hidden information in the data are less effective

than those which do it more precisely. Effectiveness is important when

visualizing high dimensional data. To measure the effectiveness of

visualization, it can be seen from the visualization’s ability to communicate

the intended message [164]. The visualization technique should take into

account all the components of the data. Depending on the particular

visualization technique, either the data is presented directly or it is shown

indirectly using aggregated quantities. Visualizations that consider all data

during the rendering process are more expressive and thus show better picture

of the data.

In order to communicate the desired information using visualization, the

depicted visual patterns should be readable. Distorted, blurred, or overlapping

regions in the visualization make it less readable and influence the

effectiveness and expressiveness of the visualization [95]. A suitable choice of

colors can further increase the readability of a visualization technique. In

addition to coloring, the choice of shapes, their layout, size, and position of

visualization components also influence the readability. Readability can be

further enhanced using labels and tooltips. Most of the times readability is

influenced by size of data where some visualization techniques like, parallel

coordinate visualization, are less readable due to huge datasets and others, for

instance treemaps, are more appropriate to visualize large datasets. To allow

various operations on visualization, it needs to be interactive. All

visualizations are not necessarily interactive, however, interactivity adds the

value to a particular visualization technique where the information can be

displayed at various levels using options like, filtering, zooming, and querying.

Based on the discussion above, four metrics to quantify a visualization

technique are identified, namely, effectiveness, expressiveness, readability, and

interactivity. Table 4.1 lists the work that mentioned such aspects.

Table 4.1 Aspects mentioned in literature for better visualization

Aspects Previous works

Effectiveness [45], [51], [13], [42], [52], [53], [54], [55], [38], [56], [57], [4], [21],

[25], [58], [45], [59], [60], [46], [24], [26], [61], [21]

Expressiveness [45], [51], [56], [54], [50], [4], [58], [60], [62], [61]

Readability [48], [54], [32], [56], [63], [64], [65]

Interactivity [50], [54], [66], [67], [68], [69], [62], [61]

4.1.1. Effectiveness

For a given visualization and a dataset (that is to be visualized) X = (X0, X1,

…, Xn-1) with n attributes, the mathematical formulation of the effectiveness is

shown as:

effectiveness = , , = ∑ 𝑖 𝑖 𝑖𝑖=∑ 𝑖 𝑖𝑖= , (4.1)

effectiveness = { , ∑ = ,=∑ 𝑖 𝑖 𝑖𝑖=∑ 𝑖 𝑖𝑖= , ℎ , (4.2)

where,

Ww=[Ww0, W

w1, …. , Ww

n] is a weight matrix having weights for each of the

item in X. The range of weights is [0,1],

and VWb=[VWb0, VWb

1, … , VWbn] is a visualization weight bit matrix

corresponding to each item in X. An item in VWb is assigned a value of 1 if its

corresponding item in X is being visualized; otherwise a value of 0 is

assigned. The minimum value of effectiveness can be 0. Higher value of

effectiveness will show better visualization.

4.1.2. Expressiveness

For a given visualization and a dataset (that is to be visualized) X = (X0, X1,

…, Xn-1) with n attributes, the mathematical formulation of the expressiveness is:

expressiveness = , , = ∑ 𝑖 𝑖 𝑖𝑖=∑ 𝑖 𝑖𝑖= (4.3)

expressiveness ={ , ∑ ==, ∑ ==∑ =∑ = , ℎ (4.4)

where,

Wb=[Wb0, Wb

1, …. , Wbn] is a bit matrix that contains a value of 1 if its

corresponding element in X is being visualized; otherwise a value of 0 is

assigned, and

VWw=[VWw0, VWw

1, … , VWwn] is a visualization weight matrix corresponding

to each item in X. The range of weights in VWw is [0, 1]. The minimum value

of expressiveness can be 0. A higher value of expressiveness will show better

visualization.

4.1.3. Readability

For a given visualization and a dataset X = (X0, X1, …, Xn-1) with n

attributes, the mathematical formulation of the readability is:

readability = , , = ∑ 𝑖 𝑖 𝑖𝑖= + σ ) (4.5)

where,

Ww= [Ww0, W

w1, …. , Ww

n] is a weight matrix that contain weights [0 or 1] for

each of the item in X.

VWb= [VWb0, VWb

assigned, and

RWw= [RW0w, RW1

w,… , RWnw ] is a readability weigh matrix corresponding to

each item in X based on the color assigned during the visualization process.

Values of RWw range between [0.2, 1]. It can be a numeric value of 0.2, 0.4,

0.6, 0.8, or 1. This gives the flexibility to assign weight to the colors based on

a range of RGB combinations. The value of readability is greater than 0, where

lower value of readability shows less readable visualization. Table 4.2 lists the

possible ranges of weights for RWw. The value σ ( ) is the standard

deviation of the values in RWw.

Table 4.2 Value range for RWw

R or G or B R or G or B R or G or B RWw

0-50 25% less than first column’s value 25% less than first column’s value 0.2

200-255 25% less than first column’s value 25% less than first column’s value 1

1 1 1 1

0 0 0 1

4.1.4. Interactivity

For a given visualization and a dataset X = {X0, X1,…, Xn} having n

attributes, the mathematical formulation of the interactivity is

interactivity = , , = ∑ 𝑖 𝑖 𝐼 𝑖𝑖= (4.6)

where,

VWb= [VWb0, VWb

assigned, and

IWb= [IW0b, IW1

b,… , IWnb ] is an interactivity bit matrix corresponding to each

item in X. IWb is set to 1 if the visualized component is interactive; otherwise

it is 0. A higher value of interactivity indicates a more interactive visualization.

4.1.5. The combined fitness function

The combined fitness function linearly adds four metrics. An overall fitness

function can, therefore, be represented as a sum of effectiveness, expressiveness,

readability, and interactivity values. To influence the weight of each value, a

constant multiplier is used. The combined fitness function (CFF) would then

CFF = ( , , + , , + , , + , ,

The value for α, , , and range between [0, 1]. In the experiments these

weights are set to 1. However, Section 5.2.1 empirically shows their effect on

the visualization evolved.

4.2 Proposed solution

The problem statement at hand deals with the issue of quantifying the

visualization technique on how “good” it displays the data and later, based on

the quantifiable aspects, optimize the particular visualization technique for

better visual representation. The proposed solution uses an EA-based

approach to optimize the layout of a visualization technique. To quantify

visualization, four metrics: effectiveness, expressiveness, readability, and

interactivity are used. Figure 4.1 shows the complete structure of the proposed

system. It starts with an initial population of candidate solutions (also called

chromosomes or individuals), where each individual represents a complete

visualization. As a case study, we used treemap 22 visualization technique.

Treemap is a space-constrained visualization of hierarchical structures. A

mapping function converts the given chromosome into a complete

visualization. In order to evaluate the fitness of the chromosome a combined

fitness function (also called the objective function) is used. The objective

function consists of the proposed visualization metrics presented in Section 3

(Eq. (4.7)). Once the fitness of each chromosome is evaluated, those with

http://www.cs.umd.edu/hcil/treemap/

Figure 4.1 Proposed system working

worst objective function values are discarded. Remaining chromosomes are

used to fill-in the population using genetic reproduction operators of

crossover and mutation. Various combinations of crossover and mutation are

used. This procedure is repeated for a fixed number of iterations until the

solution converges.

In experiments we used various combinations of the objective functions,

mutation rate, and crossover. The EA converges to the optimum visualization

based on the combined fitness function. In addition, the EA also finds the

optimum visualization using the individual visualization metric of

effectiveness, expressiveness, readability, and interactivity. This results five

visualizations. These five visualizations are evaluated and compared with a

state-of-the-art treemap visualization tool using user study and direct

assessment. For the user study visualization benchmarks are used. The direct

assessment is performed using internal and external visualization metrics.

4.2.1. Problem formulation

The visualization layout optimization problem is formulated using three

types of sets, visualization metric, design parameters, and the visualization

method. A particular visualization is expressed in terms of the design

parameters to optimize the layout and aesthetical properties. These

parameters are then mapped to the visualization metrics. Let M be a set

consisting of the four visualization metrics.

= { , , , } (4.8)

Let D be the set of n design parameters that serve as attributes to a

visualization technique, each having a specific range of values from the

domain of real numbers R.

= { , , . . . , } (4.9)

Every visualization technique has design parameters from set D and can be

measured by M, where each member of M is described by the set of elements

in D. Similarly, a set V represents the visualization techniques. The mapping

of these three sets is shown in Figure 4.2. The top layer represents

visualization type (i.e., space) where different visualization kinds are

instantiated (e.g., v1, . . . ,vn). The middle layer depicts the design variable (p1, . .

. , pn), where each design variable may be used with any of the visualization

type. The bottom layer represents the four visualization metrics (m1, m2, m3,

m4), where each metric is comprised of one or more design variables. The

objective is to find the optimal design parameters set ′ ⊇ for a particular

visualization vi that satisfies the metric mj.

= { , , … , } (4.10)

Figure 4.2 Mapping of visualization parameters and metrics

← × (4.11)

The quality visualization is the one having optimal design parameter values

for all its metrics. This quality function can be expressed in terms of

visualization metrics as:

= + + + (4.12)

where, each M consists of design parameters required for the particular

metric.

4.2.2. Chromosome encoding

The chromosome used in this work is a one-dimensional array having N

cells where, N is the number of features available in a visualization technique.

The chromosome is of fixed size; however, the chromosome length may vary

for different visualization techniques. In case of treemaps, each chromosome

has twelve genes. These genes represent following attributes of the treemmap

visualization: layout, border, border size, border color, label, label size, label

location, color, hierarchy, leaf node, size, and interactive. The RGB color

scheme is used for its wide application [174] in the literature related to the

current proposal. However, the use of HSV space may also result in

perceptually more attractive visualizations. Table 4.3 lists twelve genes, their

possible values (alleles), and their description.

Structure of a sample chromosome and its corresponding treemap is

illustrated in Figure 4.4 Each cell of the chromosome represents one

component of the visualization technique. The cells can have an integer value

within the range mentioned in Table 4.3. The first cell of the chromosome

contains 2, indicating squarified layout of the treemap, having no border.

Although the third cell of the chromosome contains a border size of 3, it is

ignored since the border option is not available for this treemap. Similarly, the

color scheme for the border is also ignored. Labels for the sample treemap in

Figure 4.4 are enabled having size 8 and are center-aligned. RGB color

scheme is used for the treemap with hierarchy and leaf nodes enabled with

interactivity switched off. A mapping function is used to convert the given

chromosome into a complete treemap.

Table 4.3 Description of the genes in a chromosome

Genes Alleles Description

Layout 0-Slice dice, 1-Strip, 2-Squarified Three layout algorithms are considered for Treemap visualization; each layout

has its own pros and cons

Border 0-No, 1-Yes Treemap may have a border or the border may not be applied

Border size 1-50 If border is applied to a treemap, then these several sizes can be used

Border color RGB color/256 Each color is represent by the RGB color palette

Label 0-No, 1-Yes The treemap may use labels

Label size 6-20 The labels may vary in size

Label Location 0-Right, 1-Center, 2-Left The label has three location options

Color RGB color/256 Label color is selected from the 256 RGB colors

Hierarchy 0-No, 1-Yes The hierarchy information between the item may or may not be shown

Leaf Node 0-No, 1-Yes Leaf node in the treemap may or may not be shown

Size Relative to displays area/full screen Treemap size is relative to screen

Interactive 0-No, 1-Yes Treemap interactivity can be enabled or disabled

Figure 4.3 A sample chromosome and its corresponding treemap

4.2.3. Reproduction operators

The reproduction operators used in this work are the crossover and

mutation operations. In our experiments, we tested their different variations

on EA. The experiments are conducted both with and without crossover.

However, mutation is used in both experiments. A single point crossover

mechanism is used and six mutation rates are tested to obtain the optimum

one. Once the fitness of all chromosomes in the population is evaluated, the

crossover and mutation operations are performed. For crossover, the process

is designed as follows:

Randomly select two chromosomes from the existing population.

Generate a random number, Cp, between 0 and N, where N is the

length of the chromosome. Cp becomes the crossover point and the

data is swapped to generate one child chromosome.

The above two steps are repeated till the number of required

individuals in the population is complete.

Only one child is reproduced using the selected parent pair.

For mutation, the process is designed as follows:

Once all child chromosomes are generated through crossover, a random

number mpi, called mutation probability, is generated between 0 and 100

for each child chromosome.

For each child, if the value of mpi is less than the mutation rate, any

randomly selected child cell’s value is changed to some other

permissible one.

The crossover and mutation operations are illustrated in Figure 4.4 for a

chromosome of length twelve.

Figure 4.4 Crossover and mutation operations.

After the reproduction operators are applied, the new population is

generated and tested for fitness using the objective function. EA gives many

options for selecting the population for the next generation. The first selection

is random, where no reference is made to the fitness of the chromosome.

Each chromosome, regardless of the fitness, has an equal chance of selection

in the next generation. The second method is the proportional selection in

which the chromosomes are selected based on their fitness. Individuals are

selected based on their fitness relative to the fitness of all other chromosomes.

Another popular selection method is rank based selection. It uses the rank

ordering of the fitness values instead of the actual fitness. The proportional

selection method represented in Eq. (4.13) is used. In Eq. (4.13) ′ is

the probability that the individual ′ will be selected and 𝐴 ′ is the

fitness value of the individual as mentioned in Eq. (4.7). Later in Section

5.2.5 experiments are also conducted using proportional selection. Out of m,

m/2 chromosomes are selected as parents based on their fitness value, and the

rest are generated using the reproduction operators applied to the top m/2

chromosomes in the population.

′ = 𝐸𝐴 𝐶′∑ 𝐸𝐴 𝐶′= (4.13)

4.3 Experiments and results

The experiments conducted and the results obtained are described in this

section. The treemap visualization, which is used as a case study, is already

explained in Section 2.4. The simulations are run using the combined fitness

function. The EA is applied on each visualization metric individually. This

produces five set of results which are compared using the internal metrics.

The results are also compared with a randomly created visualization and a

visualization created using the state-of-the-art tool for treemaps. The EA is

used for various mutation rates to select the most suitable one. A controlled

experiment using twenty participants is performed for the visualizations

evolved. The benchmark tasks are used for the controlled experiment and the

results are verified using analysis of variance (ANOVA) test. All experiments

were carried out using Intel Core i5 machine with 2.5GHz processor and 4

GB RAM. Table 4.4 summarizes the EA parameter settings for various

simulations.

Table 4.4 EA parameter settings

Parameter Value(s)

Number of

populations 1

Initial population

size 500

Reproduction

operators Crossover and mutation, mutation only

Crossover type Single point

Mutation rate

0.05%, 0.5%, 1%, 5%, 10%, 15%, 20%, and 25% with 25%

and 50% elitism

Crossover

Single point crossover, where each chromosome

contributes 50% of it attribute to new child.

Stopping criterion 1000 iterations / convergence

4.3.1. Treemap

Treemap is a visualization technique used for scientific visualization as well

as for information visualization with many variations. Treemaps are used to

visualize hierarchical data in many domains, including, business, news, and

Figure 4.5 A tree with its corresponding treemap

software visualization. Since its inception, different variations of the original

treemap have been proposed. Initially treemap was proposed for visualization

of hard disk contents having thousands of files. Since then, a variety of

treemap layouts have been proposed to visualize large volumes of data on a

single screen. Treemaps are based on a tree like hierarchical structure where

each attribute corresponds to a level in the tree. While building a treemap, the

screen is divided into nested rectangles. Each rectangle corresponds to a node

in the tree. The smallest and innermost rectangles represent leaf nodes while

the enclosed rectangles represent the outer nodes. The color and size of each

rectangle show different dimensions or attributes of data as shown in Figure

4.5. Treemap has many advantages which makes it a primary choice in our

approach and used as a case study. When size, color, and dimensions are

associated with a tree like structure, unviable patterns become prominent. The

optimum combination of size, color, and dimensions for a treemap will be

searched by the EA using the proposed visualization metrics. Another

advantage of using treemap is to search for optimum utilization of space to

visualize thousands of items on the screen simultaneously.

Figure 4.6 Mutation rates

Iterations

Mutation-0.1 Mutation-0.5 Mutation-1

Mutation-5 Mutation-10 Mutation-15

4.3.2. EA results

EA is used to evolve the population using the combined fitness function.

The best chromosomes based on the combined fitness function, effectiveness,

expressiveness, readability, and interactivity are archived. This generates five

chromosomes to be saved after each iteration. During the evolution process

six mutation rates are tested, namely, 0.1, 0.5%, 1%, 5%, 10%, and 15%. The

mutation rate of 10% appears to converge faster. Figure 4.6 shows the

convergence speed on six mutation rates. These results represent the average

values obtained from ten EA runs. Each run of EA evolved the population for

1000 iterations.

Figure 4.7 shows the combined fitness function-based convergence graph of

EA using 10% mutation for 1000 iterations. It also shows the convergence of

the population using the four metrics of visualization, i.e., effectiveness,

expressiveness, readability, and interactivity. The x-axis in Figure 4.7 shows the

number of iterations and y-axis represents the normalized fitness values of

four individual fitness criteria and the combined fitness function. Since all

fitness metrics for each chromosome will have varying range of values, we

normalized those to the same scale to compare as shown in Figure 4.7. Each

line in Figure 4.7 shows the best fitness of its relevant criterion achieved till

the particular iteration. For the experiment in Figure 4.7 the population is

evolved using the combined fitness function. This is the reason to the

inconsistent peaks and dips for the individual metrics, except for the

combined fitness function.

In addition to the evaluation of EA population using the combined fitness

function, four other populations are evolved using the effectiveness,

expressiveness, readability, and interactivity as an objective function, respectively.

During this experiment the population was evolved using one of the four

individual metrics. However, the best individuals based on other three criteria

were also saved for further analysis. Figure 4.8 to Figure 4.11 show the

convergence of the EA using effectiveness, expressiveness, readability, and

interactivity as an objective function, respectively. Table 4.5 shows the

normalized fitness value and the iteration number where the best solution was

found. This shows the relation between various fitness criteria and their effect

on each other. The fitness values almost remain same across various objective

functions, however, the number of iterations vary as the objective function

changes.

Figure 4.7 Convergence of the EA using the combined fitness function

Iterations

Combined Effectiveness

Expressiveness Readability

Interactivity

Figure 4.8 Convergence of the EA using effectiveness as fitness function

Figure 4.9 Convergence of the EA using expressiveness as fitness function

Figure 4.10 Convergence of the EA using interactivity as fitness function

Figure 4.11 Convergence of the EA using readability as fitness function

Iterations

Expressiveness

Effectiveness

Interactivity

Readability

Combined

Iterations

Expressiveness

Effectiveness

Interactivity

Readability

Combined

Iterations

Expressiveness

Effectiveness

Interactivity

Readability

Combined

Iterations

Expressiveness

Effectiveness

Interactivity

Readability

Combined

4.3.3. Evaluation

The experiments mentioned in Section 4.3.2 gives a total of five

visualizations evolved using the combined fitness function, effectiveness,

expressiveness, readability, and interactivity, respectively. To evaluate the results

we used the empirical method (user studies) as it is an effective form of

quantitative evaluation for information visualization. In addition, we also

assessed the visualizations evaluated using the direct method based on the

internal and external metrics. The internal metrics are the four visualization

quantification measures represented in Eq. (4.1), Eq. (4.3), Eq. (4.5), Eq.

(4.6), and the combined fitness function as presented in Eq. (4.7). The

external evaluation metrics are listed below.

= 𝐴 − 𝑇 − 𝐸 √ (4.14)

where, E is the visualization efficiency as suggested in [12], ZA , ZT , and ZME are standard z-score for accuracy, time, and mental efforts, respectively.

The second external evaluation metric is the quality of visualization (Q).

= 𝐴 + − − √ .

where, 𝐴, , , and are standard z-score of accuracy, visualization

score, time, and mental effort, respectively. Eq. (4..15) defines the quality of

visualization in terms of four dependent variables, response time, mental

efforts, accuracy, and visualization score. It is useful to find the difference

between standard z-score of accuracy and scores to response time and effort.

Table 4.5 Various combinations of the objective function

Objective Function (Iteration number for the converged solution)

Objective function

Best fitness using

the objective

function

Effectiveness Expressiveness Readability Interactivity Combined

Combined 11 (138) 03(50) 02(200) 03(138) 01(30) 11(138)

Effectiveness 03 (30) 03(30) 02(80) 03(100) 01(40) 09(650)

Expressiveness 02(190) 03(32) 02(190) 03(260) 01(25) 11(190)

Readability 02(290) 03(38) 02(300) 02(290) 1(40) 10(100)

Interactivity 01(25) 03(70) 03(100) 03(150) 01(25) 11(140)

4.3.3.1. User study

The user study is carried out to establish the fact that the evolved

visualizations are better than the randomly created visualization and the one

created using state-of-the-art treemap tool. Additionally, the user study also

investigates the usefulness of the evolved visualization using various

combinations of the proposed metrics. Twenty volunteers, the graduate and

post-graduate students of the university, participated. All participants had the

familiarity with computers, basic concept of visualization techniques, and its

usage. Their vision was normal during the course of experiment.

Table 4.6 The five benchmark tasks

Task no. Description

1 Which Java program has the maximum number of objects?

2 Which collection API is mostly used?

3 How many Java programs use more than 3 collection APIs?

4 Which java program use large number of ArrayList objects?

5 What is the total number of APIs used by each Java program?

The participants were asked to perform five benchmark tasks listed in Table

4.6. These tasks were designed specifically for treemap-based visualization

[175,176]. The tasks reflected data collected from Java programs through

dynamic analysis. The dataset contained information about the collection

APIs usage, i.e., type, instances name, package, class, and method name.

Moreover, the tasks were general in nature and collection APIs or knowledge

of Java was not required to perform those tasks. Participants were provided

with multiple choice questions with four possible options for each of the task.

They were asked to perform each of these tasks using the six visualizations

(one random and five evolved visualizations). These visualizations are

described in the Appendix. Major goal of the user study was to evaluate the

quality of various visualization layouts in terms of their perceptual properties.

Five visualizations, i.e., evolved with the combined fitness function, and that

for effectiveness, expressiveness, readability, and interactivity, respectively were

compared with a randomly created visualization and visualization created

using the state-of-the-art tool.

Response time, mental efforts, accuracy, and visualization score were

recorded for each task. For 5 tasks × 6 visualizations × 20 participants gave us

a total of 600 responses. The response time for each task was recorded in

seconds. The mental effort was rated at a scale from 1 to 5 (1-minimum effort,

5-maximum effort). Similarly, accuracy was rated at the scale from 1 to 5 (1-

all wrong, 5-all correct). Other values were assigned according to subjects’

responses. Participants were asked to rank the visualizations for each task

respectively on a scale of 1 to 5 (1-lowest rank, 5-highest rank). Moreover,

efficiency and quality metric were computed from the recorded responses for

all visualizations and tasks using the equations discussed earlier, i.e., Eq.

(4.14) and Eq. (4.15).

The user study results were analyzed formally and investigated the

perceptual quality of evolved visualization using the combined fitness

function through hypothesis testing. The null hypothesis and alternate

hypothesis are:

H0: Correlation between human ratings of the evolved visualization (using

combined fitness function) and the perceptual quality is due to randomness.

H1: The evolved visualization using the combined fitness is perceptually

better.

The user study was completed in different sessions using same material and

stimuli. Initially, the participants were introduced to the tasks, visualizations,

and administrative procedure that were to be adopted during the experiment.

All participants were given questionnaire to fill the necessary data. Each

participant was provided with six visualizations on computer screen. This

procedure took 20 minutes on average for each participant. Based on the four

parameters a summary of the study is presented in Table 4.7, showing

minimum, maximum, average, and standard deviation (SD) values for each of

the evolved visualization, random visualization, and visualization created

using state-of-the-art (SoTA) tool23.

As the results show, evolved visualization using the combined fitness is

perceptually better than the random visualization. Table 4.7 summarizes the

experiment’s statistics for all visualizations with respect to the dependent

variables, i.e., time, mental efforts, accuracy, and visualization score. For each

dependent variable the mean value using the evolved visualization based on

the combined fitness function is better. If we compare numerical values for the

evolved visualization with the random visualization, evolved visualizations

take less time with (M=16.72, SD=5.08) as compared to the random

visualization time (M=19.4, SD=5.2). For the mental effort, the evolved

visualizations require less effort (M=2.4, SD=0.97) as compared to the random

visualization. Similarly, the accuracy and score for the evolved visualizations

is higher with (M=4.12, SD=0.77) and (M=4.1, SD=0.71), respectively as

compared to random visualization’s accuracy of (M=3.67, SD=0.99) and score

(M=3.02, SD=0.88). Thus, the values of all dependent variables are better for

the visualization evolved with the combined fitness function.

Table 4.8 lists the mean values of the dependent parameters: time, mental

efforts, accuracy, and visualization score. The table also lists mean values for

the external evaluation criterion (Eq. (4.14) and Eq. (4.15)). The results

indicate that the evolved visualizations are better as compared to the random

one and the visualization created using SoTA tool. On average, the

participants take more time for the random and SoTA visualization producing

less accuracy as compared to the evolved visualizations. Furthermore,

participants exerted more mental efforts in case of random visualization.

Figure 4.12 shows the summaries of the results for dependent parameters:

time, mental efforts, accuracy, and score for the six visualizations using five

http://www.cs.umd.edu/hcil/treemap/

benchmark tasks. The visualization evolved using the combined fitness

function shows better results for all four factors. The random visualization has

taken larger average time and was less accurate.

Figures 4.13-4.16 show the participants against each dependent parameter,

i.e., accuracy, efforts, score, and time for all visualizations. The scatter plot

shows participants on horizontal axis and each dependent parameter on y-

axis. The visualization types are shown by using different colors. The

relationship in these figures shows visualizations’ effect on the users’

performance to accomplish the task. In case of Figure 4.13 the accuracy for

visualization evolved using the combined fitness function is better for most of

the participants. Moreover, the participants of this visualization require

comparatively less mental efforts to perform the task. In case of the random

visualization, more efforts are needed (see Figure 4.14). Furthermore, Figure

4.15 show that visualization evolved using the combined fitness function and

SoTA get a higher score as compared to the random visualization. The time

taken by the participants to perform their task with random visualization is

much larger as compared to the combined visualization as shown in Figure

Figure 4.17 shows the boxplots of the response time, efforts, accuracy, and

score for five evolved visualizations, random visualization, and the

visualization created using SoTA tool. As shown in Figure 4.17 the random

visualization requires more time as compared to evolved visualizations using

the response time indicator. Similarly, the time duration varies largely across

various subjects for the random visualization. The random visualization also

requires more mental efforts to perform tasks as compared to the evolved

visualizations. As far as accuracy is concerned, visualization evolved using

the combined fitness function performs better.

A comparison of the visualizations used in this study is shown for mean

value of the dependent variables in Figure 4.18, where each coloured line

represent one of the visualization. As shown in the figure, for random

visualization the participants take more time to perform different tasks in

comparison with other visualizations. Moreover, the visualizations evolved

using the combined fitness function and SoTA visualization requires less

effort while achieving better accuracy and gets better scores. Additionally,

random visualization needs more mental efforts and the resultant accuracy is

lower than other visualizations.

In order to study the effect of the weights in Eq. (4.7) twenty-four

visualizations are evolved with various combinations of weights having values

0, 0.2, 0.5, 0.8, or 1. The results for this experiment are listed in Figure 4.19,

where the major x-axis lists the values of the weights for effectiveness, minor x-

axis lists the values of the weights for expressiveness, the major y-axis lists the

values of the weights for readability, and minor y-axis list the values of the

weights for interactivity. The value in each cell represents the subjective ratings

by the participants in percentage for the visualization evolved with weights set

to the values represented by the major and minor x/y-axis. Higher percentage

indicates better evolved visualization. The results suggest better rating of the

visualizations evolved with all weights set to an equal value.

Expressiveness

0 0.2 0.5 0.8 1

0% 1% 1% 1% 90% 1

Interactiv

90 3% 2% 95% 11% 0.8

20% 15% 88% 80% 13% 0.5

30% 88% 45% 90% 88% 0.2

N/A 30% 40% 30% 20% 0

0 0.2 0.5 0.8 1

Effectiveness

Figure 4.19 Weight analysis of the combined fitness function

Table 4.7 Summary of user study scores

Visualization evolved through

Factor Combine Effectiveness Expressiveness Readability Interactivity Random SoTA

Min. 8 8 8 10 10 10 10

Max. 29 29 30 30 30 32 29

Mean 16.72 16.84 17.57 17.78 17.31 19.4 18.17

SD 5.08 4.83 4.82 3.6 4.85 5.2 4.46

Mental

efforts

Min. 1 1 1 1 1 1 1

Max. 4 5 5 5 5 5 4

Mean 2.4 2.68 2.67 2.93 2.67 3.08 2.60

SD 0.97 0.86 0.99 0.83 0.99 0.97 0.95

Accuracy

Min. 2 2 2 2 2 2 2

Max. 5 5 5 5 5 5 5

Mean 4.12 3.98 3.95 3.75 3.95 3.67 4.01

SD 0.77 0.89 0.94 0.88 0.94 0.99 0.74

Min. 3 2 2 2 2 1 2

Max. 5 5 5 5 5 5 5

Mean 4.1 3.75 3.71 3.49 3.71 3.02 3.95

SD 0.71 0.9 0.91 0.88 0.91 0.88 0.78

Table 4.8 Mean values for the dependent parameters

Visualization evolved through

Factor Combine Effectiveness Expressiveness Readability Interactivity Random SoTA

Time 16.72 15.84 17.57 17.78 17.31 19.4

Mental efforts 2.4 2.68 2.67 2.93 2.67 3.08

Accuracy 4.12 3.98 3.95 3.75 3.95 3.67

Score 4.1 3.75 3.71 3.49 3.71 3.02

Efficiency -0.46 -0.47 -0.85 -0.89 -0.99 -1.16

-00.18

Quality -0.02 -0.01 0.02 0 -0.03 -0.07

-00.06

Figure 4.12 Dependent variables summaries

Figure 4.13 Scatterplot for mean accuracy

Figure 4.14 Scatterplot for mean effort

Figure 4.15 Scatterplot for mean score

Figure. 4.16 Scatterplot for mean time

Figure. 4.17 Box-plots of the dependent variables for each visualization

Figure 4.17 shows the boxplots of the response time, efforts, accuracy, and

score for the five evolved visualizations, random visualization, and the

visualization created using SoTA tool. As shown in Figure 4.17 the random

visualization takes more time as compared to evolved visualizations using the

response time indicator. Similarly, the time duration varies largely across

various subjects for the random visualization. The random visualization also

consumes more mental efforts to perform tasks as compared to the evolved

visualizations. In case of accuracy, visualization evolved using the combined

fitness function performs better.

A comparison of the visualizations used in this study is shown for mean

value of the dependent variables is Figure 4.18, where each coloured line

represents one of the visualization. As shown in the Figure, for random

visualization the participants take more time to perform different tasks in

comparison with other visualizations. Moreover, the visualizations evolved

using the combined fitness function and SoTA visualization requires less

effort while achieving better accuracy and gets better scores. Additionally,

random visualization needs more mental efforts and the resultant accuracy is

lower than other visualizations.

Figure 4.18 Mean values of the dependent variables

4.3.3.2. Analysis of variance and post hoc analysis

To confirm the statistically significant variation between these

visualizations, repeated analysis of variance (ANOVA) test is performed.

ANOVA test verifies the significant difference among the evolved, random,

and SoTA visualization with respective to time, mental efforts, accuracy, and

visualization scores. The results in Table 4.9 suggest a statistically significant

difference among the visualizations regardless of other factors F(30, 2646) =

8.57, p<0.0005, 2= 0.07). Similarly, by combining the visualizations with

tasks there is still a significant difference F(120, 3253) = 5.20, p<0.0005, 2=

0.16).

This statistical analysis shows that there is significant difference among the

visualizations while taking the dependent variable time into consideration,

F(6, 665)= 13.3, p<0.0005 (M=17.61, SD=4.82). For the dependent variable

mental effort, F(6, 665)= 7.13, p<0.0005 (M=2.7, SD=0.96). Considering

accuracy F(6, 665) = 7.15, p<0.0005 (M=3.90, SD=0.90), and for visualization

score F(6, 665)= 18.18, p<0.0005 (M=3.67, SD=0.91). Taking the combination

of seven visualizations and the five tasks, the results in Table 4.9 suggests a

significant difference. If we compare the combined effect (interaction) of

visualization and tasks as shown in Table 4.9, there is also a significant

difference with respect to all four dependent variables. Considering the

response time, the statistics show difference on F(24, 665) = 5.91, p<0.0005

(M=17.61, SD=4.82). For the dependent variable mental effort, F(24, 665)=

7.13, p<0.0005 (M=2.7, SD=0.96). Considering accuracy F(24, 665) = 7.15,

p<0.0005 (M=3.90, SD=0.90) and for visualization score F(24, 665)= 18.18,

p<0.0005 (M=3.67, SD=0.91).

A post hoc analysis of least square difference (LSD) is performed for each

dependent variable to investigate the real significant difference between

visualizations. The multiple comparison test shows that there is significant

difference in the mean time taken between visualization evolved using

combined fitness function and the random visualization (p=0.0005). There is

also a significant difference based on mean time between the visualization

evolved using combined fitness function and SoTA visualization (p=0.002).

However, no significant difference is observed between the visualization

evolved using combined fitness function and those evolved using effectiveness

(p=0.061), expressiveness (p=0.071), and interactivity (p=0.15). There is no

significant difference in mean time taken between SoTA and visualization

evolved using expressiveness (p=0.19), readability (p=0.45), and interactivity

(p=0.6).

Considering mental efforts, the visualization evolved using combined

fitness function has significant difference with all other visualizations for

p<.05. Moreover, the random visualization requires more mental effort to

perform tasks compared to all other visualizations (p<.05) shown in Table

4.10. The mean mental effort excreted in case of SoTA is not significant with

the visualizations evolved using effectiveness (p=.41) and expressiveness (p=.49).

In case of accuracy, the visualization evolved using the combined fitness

function has significant difference in its mean value with the random

visualization (p=.0005). There is no significant difference based on mean

accuracy between SoTA and other evolved visualizations except those evolved

using the combined fitness function and readability.

The visualization evolved using the combined fitness function is also

statistically better than the random visualization based on mean efficiency

(p=.0005). However, the difference between combined and other evolved

visualization is not significant (p>.05) except for the one evolved for readability

(p=.0005). In case of quality as a factor, we see significant difference in mean

quality of the visualization evolved using combined fitness function as

compared to the random and SoTA visualizations (p=.040 and p=.0005).

These results make us reject the null hypothesis H0.

Table 4.11 shows the effect of visualization on the dependent variables. The

results suggest that the visualization’s perceptual quality we get with the

combined fitness function correlate with time and efforts positively across all

the visualizations and negatively correlated with accuracy and visualization

scores. The visualizations are explained by 55%, 19%, 15%, and 21% of the

variance in time, mental efforts, accuracy, and score, respectively. All the

aforementioned results are significant at (p<0.0005) with F statistics

mentioned in Table 4.11.

In addition to the F statistics, to conduct a non-parametric (distribution free)

test, Kruskal-Wallis H test is utilized to find statistically significant difference

in our experiment. Furthermore, post hoc analysis of the results is performed

using Wilcoxon signed-rank test to find the statistical difference between the

visualizations in the case study. There is statistically significant difference in

time among the different visualizations, χ 2(6) = 31.23, p < 0.001, with a mean

time rank of 280.34 for the visualization evolved using the combined fitness

function, 278.51 for visualization evolved using effectiveness, 348.79 for

expressiveness, 384.10 for readability, 336.24 for interactivity, 415.19 for the

random visualization and 376.20 for SoTA. Based on effort there is also

statistically significant difference among the various visualizations, χ 2(6) =

28.68, p < 0.001, with a mean effort rank of 287.49 for the visualization

evolved using the combined fitness function, 378.19 for effectiveness, 340.60 for

random visualization and 318.34 for SoTA. For accuracy the test show

statistically significant difference among the different visualizations, χ 2(6) =

36.46, p < 0.001, with a mean accuracy rank of 418.50 for the visualization

evolved using the combined fitness function, 365.94 for effectiveness, 361.79 for

random and 366.55 for SoTA visualization. Table 4.12 lists these results.

Table 4.9 ANOVA test result for dependent variables

Source

Time Mental Efforts Accuracy Vis-Score

Visualization 6 13.33 0.13 7.13 0.06 7.15 0.06 18.18 0.14

Error 665 -133.75 -5.54 -5.56

Tasks(only) 4 156.7 0.48 12.27 0.07 9.01 0.05 5.07 0.03

Error 665 -1719 -8.65 -6.62 -4.25

Visualization

24 5.91 0 3 0.1 1.68 0.06 2.4 0.08

Error 665 -63.03 -2.35 -1.3 -1.68

*All Results are taken at p<.001.

Table 4.10 Post hoc analysis

Factor F-statistics Visualizations* pairs with significant difference**

Time 13.3

(1,4)(.001), (1,6 )(.0005), (1,7 )(.002)

(2,3)(.0005), (2,4)(.0005), (2,5)(.002), (2,6)(.0005), (2,7)(.0005)

(3,4)(.0005), (3,6)(.0005)

(4,5)(.040),(4,6)(.014)

(5,6)(.0005)

(6,7)(.008)

Mental efforts 7.13

(1,2)(.026), (1,3 )(.032),(1,4)(.0005), (1,5)(.0032), (1,6 )(.0005),

(1,7 )(.041)

(2,4)(.045),(2,6)(.002), (2,7)(.020)

(3,4)(.035), (3,6)(.001)

(4,5)(.035),(4,6)(.044)(4,7)(.026)

(5,6)(.001)

(6,7)(.001)

Accuracy 7.15

(1,2)(.040), (1,3 )(.021), (1,4 )(.0005), (1,5)(.021), (1,6)(.0005)

(2,4)(.0005),(2,6)(.010)

(3,4)(.001), (3,6)(.021)

(4,5)(.001),(4,6)(.043)(4,7)(.0005)

(5,6)(.021)

(6,7)(.005)

Score 18.2

(1,2)(.004), (1,3 )(.001), (1,4 )(.0005),(1,5)(.001) (1,6)(.0005)

(2,4)(.033),(2,6)(.0005)

(3,6)(.0005)

(4,6)(.0005)(4,7)(.0005)

(5,6)(.0005)

(6,7)(.0005)

Efficiency 7.8

(1,4)(.0005),(1,6 )(.041), (1,7 )(.005)

(2,4)(.012),(2,7)(.0005)

(3,4)(.015), (3,7)(.0005)

(4,5)(.014),(4,6)(.001)

(5,7)(.0005)

(6,7)(.0005)

Quality 15.7

(1,4)(.0005), (1,5 ),(1,6 )(.040), (1,7 )(.0005)

(2,4)(.004), (2,7)(.0005)

(3,4)(.003), (3,7)(.0005)

(4,5)(.002),(4,6)(.0005),(4,7)(.0005)

(5,7)(.0005)

(6,7)(.0005)

* Combined (1), Effectiveness (2), Expressiveness (3), Readability (4), Interactivity (5), Random

(6), SoTA (7)

**All test are significant at alpha = 0.05

Table 4.11 Visualization’s effect on dependent variables

Predictor Dependent aspect Beta (β) R2 F-value*

Visualization Time 0.18 0.55 20.23

Mental efforts 0.2 0.19 21.04

Accuracy -0.17 0.15 18.36

Score -0.3 0.21 60.04

*F statistic on p<.005

Table 4.12 Non-parametric test

Visualization type

Square*

Combined Effectiveness Expressiveness Readability Interactivity Random SoTA

Time 31.23 280.34 278.51 348.79 384.1 336.24 415.19 376.2

Effort 28.68 287.49 378.19 340.6 391.79 340.48 418.49 318.3

Accuracy 36.46 418.5 365.94 361.79 271.35 361.79 306.47 366.6

Score 80.12 438.48 365.51 357.13 312.19 357.13 219.44 403.6

*For all cases P< 0.001

Table 4.13 Wilcoxon signed-rank test

Visualization type

Parameter Test

value Combined Effectiveness Expressiveness Readability Interactivity SoTA

Time Z -4.45 -5.98 -3.19 -1.61 -3.74 -1.8

p <.001 <.001 0.001 0.1 <.001 0.071

Effort Z -4.9 -3 -2.76 -1.6 -2.76 -3.08

p <.001 0.003 0.006 0.19 0.006 0.002

Accuracy Z -4.89 -2.51 -1.99 -1.01 -1.99 -2.87

p <.001 0.01 0.04 0.2 0.04 0.004

Score Z -6.57 -5.05 -4.83 -3.3 -4.83 -6.42

p <.001 <.001 <.001 0.001 <.001 <.001

4.3.4. Direct method

The evolved visualizations, random visualization, and SoTA are also

compared using the direct method. Table 4.14 lists the values obtained for the

external evaluation criterion (Eq. (4.14) and Eq. (4.15)) for all these

visualizations. The efficiency of the visualizations is calculated using Eq.

(4.14) which takes into consideration accuracy, time, and mental efforts,

required for these visualizations. Lower value of efficiency indicates a more

efficient visualization. The results in Table 4.14 shows the visualization

evolved using the combined fitness function scores best using the external

criteria of visualization efficiency. The second best visualization is the one

evolved using effectiveness. The worst performing visualization based on

efficiency is the random visualization. The visualization created using SoTA

tool scores third best using the efficiency metric. The second external

visualization metric used is the visualization quality, calculated using Eq.

(4.15). The visualization quality metric takes into consideration accuracy,

visualization score, time, and mental effort required by the visualization under

consideration. The lower value of quality metric represents a better

visualization. Considering quality as an external metric the visualization

evolved using the combined fitness functions seem to perform the second best.

Whereas, the visualization evolved using effectiveness performs the best. Based

on the quality metric the visualization evolved using expressiveness performs the

worst.

Table 4.14 External metric values for the visualizations

Visualization Efficiency Quality

Combined -0.46 -0.02

Effectiveness -0.47 -0.01

Expressiveness -0.85 0.02

Readability -0.89 0

Interactivity -0.99 -0.03

Random -1.16 -0.07

SoTA -0.51 -0.06

Table 4.15 User Study Results For the Evolved, SoTA, and Seven Other Non-

Treemap Visualizations

Visualization Code

Subject ID A B C D E F G H I

1 4 2 2 2 4 0 0 6 6

A-Parallel

Coordinates

2 2 2 4 2 4 0 4 4 6

B- Sunburst

3 4 4 2 2 4 2 2 6 6

C- Circular Packing

4 2 2 4 2 2 2 2 4 4

D- Line chart

5 6 4 2 2 2 0 0 2 4

E- Pie chart

6 2 2 2 2 2 0 4 4 4

F- Scatterplot

7 4 4 2 2 4 0 4 6 6

G-Bar chart

8 2 2 4 0 4 0 0 2 6

H-Evolved

9 2 2 2 2 2 2 4 6 4

I-SoTA

10 2 2 2 2 2 2 2 4 4

11 0 0 2 2 4 0 4 6 6

6-Very much Useful

12 4 4 4 0 4 2 4 4 6

4-Useful

13 2 4 2 0 4 0 0 2 6

2- Neutral

14 2 4 4 2 4 0 0 4 6

0- Not Useful

15 6 6 2 4 4 2 2 6 6

16 6 6 0 0 4 2 4 6 6

17 2 2 4 0 4 0 0 2 4

18 2 2 4 0 2 0 0 2 4

19 6 4 2 2 4 2 2 4 6

20 4 4 2 2 4 2 2 4 4

%-liked 53.33 51.67 43.33 25.00 56.67 15.00 33.33 70.00 86.67

Average 3.20 3.10 2.60 1.50 3.40 0.90 2.00 4.20 5.20

Median 2.00 3.00 2.00 2.00 4.00 0.00 2.00 4.00 6.00

Std. dev. 1.77 1.52 1.14 1.10 0.94 1.02 1.72 1.58 1.01

4.4 Discussion

To study the usefulness of the evolved visualization against other non-

treemap-based visualizations a comparison is made using seven other

visualization techniques. These include: parallel coordinates, sunburst, circular

packing, line chart, pie chart, scatterplot, and bar chart. The same data is used

with all the visualization techniques utilized in this experiment. These seven

visualizations are created using a tool provided by datapine24 and are listed in

Appendix-B. A user investigation is performed to identify the most and the

least useful visualization. A total of 20 participants were engaged in this

experiment, and were chosen from target groups. The participants were

graduate level students having background knowledge of computer

programming, data visualization and Java collection APIs. The participants

were taken on voluntary bases, having motivation to the task. They had

different programming experience in terms of the number of years. Each

participant was shown the seven other visualizations, evolved visualization

and the visualization created using the SoTA visualization tool for an equal

amount of time, i.e. 18 minutes (2 minutes per visualization). Later, they were

asked to rank these on the scale of 0-6. Where 0 indicate “not useful”, 2 is for

“neutral”, 4 shows “useful” and 6 for “very much useful”. The visualizations

were labelled with character identifiers instead of using their names. Table 4.15

lists the results of this experiment. The results support the usefulness of the

evolved visualization as compared with other seven visualizations. The

visualization created using the SoTA tool is rated second best. Friedman test is

performed on the user evaluation results to find a statistical difference among

the nine visualizations. Furthermore, post hoc analysis of the result is done by

performing Wilcoxon signed-rank test to find the statistical difference between

various visualizations. The test shows a statistically significant difference

among the visualizations with respect to likeness, χ2 (8) = 92.98, p < 0.001,

with median likeness for parallel coordinates (2), sunburst (3), circular packing

(2), line chart (2), pie chart (4), scatter plot (0), bar chart (2), evolved (4) and

SoTA (6), respectively. Wilcoxon singed-rank test results in Table 16 also

suggest difference between evolved and other visualizations. The table shows

the z and p values for each of the visualization in comparison to the evolved

one. All results are checked on confidence level of 95%, where alpha equals

0.05. Figure 4.19 shows the boxplots for the user evaluation.

https://www.datapine.com/

Table 4.19 Wilcoxon singed-rank test results

Test value Parallel

Coordinates

Sunburst

Circular

Packing

chart Scatterplot

chart SoTA

Z -1.93 -2.23 -2.26 -3.95 -3.74 -3.99 -3.82 -2.48

p 0.043 0.2 0.009 <001 <.001 <.001 <.001 0.013

Figure 4.19 Boxplot for the user evaluation results

4.5 Variation in the EA settings

The EA can be executed with a variety of setting for each of its components,

like the population size, mutation rate, crossover strategy and elitism scheme,

to name a few. The results in the Section 5 are conducted with the optimal EA

settings. However, this section covers some of the settings and their effect on

the results. For calculating the fitness of a chromosome each individual is

executed ten times and an average is taken. Figure 4.20 shows the standard

deviation (SD) of the normalized fitness value by taking fitness based on six

different numbers of samples for averaging. It can be seen that for two, five,

and eight samples the SD is higher, however, this seems to settle after ten.

Thus we opt for ten samples for averaging. In the experiments the number of

Parallel

Coordinates

Sunburst Circular

Packing

Line chart Pie chart Scatterplot Bar chart Evovled SoTA

iterations for the EA is fixed to 1000. However, Figure 4.21 shows the

normalized fitness (combined) for around 1500 iterations using three

populations. It can be seen that the best solution is found before 1000

iterations, thus justifying the stopping criterion in this case. The current

solution uses an EA-based approach, to be specific, a genetic algorithm (GA)

is employed for optimizing the visualization layout. There can be other

optimization approaches utilized for this task. Evolution Strategy (ES) being

one of these. To demonstrate this, Figure 4.22 shows the convergence results

for EA-GA and (1+1)-ES. Results show convergence of both approaches,

however, the EA-GA converges quickly as compared to the (1+1)-ES. The

reason for this is the reliance of ES on mutation only. Additionally, the ES

replaces parent chromosome only if the mutated solution performs better. The

aforementioned experiments use random parent selection for the reproduction

operation. There can be other options for this, a probabilistic procedure like

roulette wheel can also be utilized. Figure 4.23 lists the convergence results

with random and the proportional selection of individuals for the reproduction

operation. For the proportional selection the fitness of all the chromosomes in

the current population are summed and each chromosome is assigned relative

fitness. This relative fitness is calculated by dividing the chromosome’s fitness

by the total fitness of the current population. Later, a roulette wheel is spun,

resulting in the higher selection probability of the chromosomes with larger

fitness. The results in Figure 4.23 indicates that with the probabilistic selection

of the individuals, EA finds the best solution quickly, i.e., within 450 iterations

as compared to the random selection. However, this decreases the selective

pressure during the remaining iterations, thus hindering the EA to find even

better solutions.

Iterations

Population-1

Population-2

Population-3

Figure 4.20 Number of samples for fitness averaging and their standard deviation

Figure 4.21 Number of iterations vs. convergence

2 5 8 10 15 20

No. of samples for averaging

Iterations

(1+1)-ES

Iterations

EA-With random selection

EA-With probabilistic selection

Figure 4.22 Convergence using EA-GA and (1+1)-ES

Fig.ure 4.23 EA with random and probabilistic selection

4.6 Discussion

This study used EA for the optimization of visualization based on

quantitative assessments, i.e., the visualization metrics. The individual metrics

included effectiveness, expressiveness, interactivity, and readability. A combined

fitness function was built from the liner combination of these metrics; this

combined fitness function was then used to evolve a population of EA. A user

study was designed to evaluate the effectiveness of the resultant visualizations

using benchmark tasks. The user study recorded several parameters, including:

response time, mental efforts, accuracy, and visualization score for each

participant on every task. The same experiment was performed for the five

evolved visualizations, random visualization, and SoTA. Furthermore, two

external metrics; efficiency and quality were computed from the standardized

z-score of dependent parameters using Eq. (4.14) and Eq. (4.15).

Analysis of the user study shows that there is a significant difference among

the visualizations, both numerically and statistically. This is depicted in Table

4.7 and from the boxplot in Figure 4.17 which shows the mean values of the

dependent variables. The visualization evolved using the combined fitness

function performs better in all aspects when compared with the random

visualization. Moreover, the visualization builds with SoTA tool also

performed well as compared to the random visualization. However, this is not

the case for all the evolved visualizations using individual metric in isolation.

Nevertheless, the time taken by random visualization (M=19.4) is greater than

SoTA (M=18.17) and the visualization evolved using the combined fitness

function (M=16.72).

The statistical test’s results also show that there is a significant difference

among the visualizations evolved using fitness function proposed in this work

and the other visualization create using state-of-the-art tool. As shown with the

ANOVA test the visualization build from the combined fitness function

performs better based on the dependent variables. However, the significant

difference with random visualization is greater as compared to other evolved

visualizations. The visualizations evolved with fitness function other than the

combined fitness function also perform better in some aspects when compared

with the random visualization and SoTA. The results show that difference also

exist between evolved and random visualization across the benchmark tasks.

The post hoc analyses of the tests reveal that the visualization evolved using

the combined fitness function performs better than all other visualizations

considered in the experiment. The visualization evolved using the combined

fitness function achieves more accuracy with less mental efforts and requires

less time. Visualization evolved with fitness function other than the combined

performs better than the random visualization. However, these visualizations

comparatively perform at the lower side when compared with SoTA using

various aspects.

The proposed visualization matrices and the optimization approach can also

be utilized for visualization techniques other than the treemap. The individual

metrics, effectiveness, expressiveness, readability, and interactivity are devised

keeping in view the generic visualization aspects. Thus, these can be adopted

without significant modification. However, for other visualization techniques

the chromosome encoding and the mapping function requires customization

depending on the particular visualization technique’s features.

4.7 Chapter Summary

The work in this chapter proposed a set of visualization metrics to evaluate

visualization techniques. The proposed metrics were based on a

comprehensive literature survey. These metrics included effectiveness,

expressiveness, readability, and interactivity. Experiments demonstrated the impact

on the aesthetical and perceptual aspects of visualization due to the metrics.

The work employed an evolutionary algorithm (EA) to optimize layout of a

visualization technique. The aforementioned visualization metrics were

combined to form a fitness function for the EA. The treemap visualization was

employed as a case study for the layout optimization task. The EA evolved five

visualizations using effectiveness, expressiveness, readability, interactivity, and the

combined fitness function. These five evolved visualizations were compared

with a randomly created visualization and visualization created using a state-

of-the-art treemap visualization tool. The comparison was made using internal

and external evaluation metrics. A user study was also conducted on the

evolved visualizations using benchmark tasks. The user study was followed by

analysis of variance test. The results suggest effectiveness of the proposed

visualization metrics and the EA-based approach for optimizing treemaps

layout. The visualization evolved using the combined fitness function was

more effective than the visualizations optimized for effectiveness, expressiveness,

readability, and interactivity in isolation. All evolved visualizations performed

better than the randomly created one. This was due to no reference to the

aesthetics and perceptual features in the randomly created visualization.

Analysis of the user study showed a significant difference among the

visualizations, both numerically and statistically. The visualization evolved

using the combined fitness function also achieved more accuracy with less

mental efforts and required less time. It is observed that the individual criterion

for gauging visualization quality plays and important role, however, when

combined together they produce better visualizations. This produces a visual

layout that can be effective, expressive, readable and interactive at the same

time. There can be situations where the problem domain may require a

visualization to be more effective than to be expressive, or may not require

interactivity. In such cases the weights for the effectiveness, expressiveness,

readability, and interactivity in the combined fitness function can be set to a

lower value or ignored all together by assigning zero. This provides a general

framework for quantifying and optimizing visualization layouts. The next

chapter will present the third contribution of this dissertation by discussing

and elaborating dynamic code analysis and visualization of collection APIs for

Java program.

Chapter 5 Visualizing of Traces of Java Program

Visualizing Trace of Java Collection

APIs by Dynamic Bytecode

Instrumentation

“We're entering a new world in which data may be more important than

software” Tim O'Reilly

Object oriented languages are widely used to help the software

developer using dynamic data structures which evolve during

program execution. However, the task of program

comprehension and performance analysis necessitates the

understanding of data structures used in a program.

Particularly, in understanding which application

programming interface (API’s) objects are used during

runtime of a program. This chapter aims to give a concise

visualization of a program’s code and to provide the user with an interactive environment to explore details. This chapter

presents an interactive visualization tool. A given program is

tracked during execution time and data is recorded into a log

file. The log file is then converted to XML format which

proceeds to the visualization component. The visualization

provides a global view about the usage of collection API

objects at different locations during program execution. An

empirical study is conducted to evaluate the impact of the

proposed visualization in program comprehension. The

experimental group, on average, completes the tasks in

approximately 45% less time as compared to the control

group. Results show that the proposed approach enables

programmers to comprehend more information with less effort

and time. Performance of the proposed approach is also

evaluated using twenty benchmark software tools. The

proposed approach helps the developer in understanding Java

collection API’s object usage and assists in program comprehension and maintenance.

Chapter

Table 5.1 An example of collection API objects analysis using clustering

Cluster

Objects

created

Objects

destroyed

Objects

invoked

invocations

locations

1 12 10 10 991 5

2 329 328 328 74928 6

3 2867 2867 2867 10001 3

4 2767 2767 2767 8301 2

5 2424 2418 2 27 2

6 24 16 16 3130 17

Modern software tools have become more complex due to ever increasing

functionality and complex interaction of components. The static analysis of

software no longer presents the best picture since the object’s runtime

behavior may be substantially different. It is true for object-oriented software

due to its intrinsic nature, i.e., inheritance, polymorphism, and dynamic

binding. These properties make object-oriented programs understandable,

however the behaviour of the program become complex [30]. The

performance of objects during runtime shapes the behavior of a program.

Developers are always keen to know how objects evolve during program

execution. Object-oriented software has a vast usage of objects of different

data structures, such as Java collection APIs which makes the code difficult to

understand. Programmers are always interested in optimizing their code to

make it efficient. Cases where code size is huge and objects from other

multiple classes are instantiated, makes it difficult to understand the code.

Additionally, programmers are also interested to know the number of objects

created for a particular class and their hierarchy while the code executes. This

is useful in optimizing code of the frequently used classes. Modern object-

oriented software is built using different components like, third party libraries

accessible through APIs. Inefficient data structures and API usage degrade

the program performance [167]. The developers must know the APIs used in

a program during execution and how its objects evolve.

Large program code normally uses collection APIs to handle various

features. Although use of these collection APIs has the advantage of

extensibility, large number of collection API objects may cause the code to

consume more memory and/or time. This may degrade program

performance. Therefore, for better performance the programmer must know

the location of the program in source code where objects are created to

optimize APIs usage [20].

The work in [20] identifies performance issues of the program based on

usage, location, and relevance of API objects. Table 5.1 (extracted from an

experiment in [20]) shows that cluster 1, 2, and 4 perform normal, based on

their object creation and usage. However, cluster 5 creates a large number of

objects with no method invoked on these objects during their lifetime. This

makes cluster 5 an ideal example of object(s) location where the developer

needs to focus to optimize the code.

To support program comprehension, most of the existing approaches [18,

30] are based on static analysis and thus do not provide a complete picture.

Visualization is an effective technique used for program comprehension [168,

169]. In this context, treemap [43] is a powerful visualization method used for

hierarchical information visualization [45]. The use of visualization can

simplify the evaluation and detection of API usage. Particularly, such

visualization can portray the best picture to the developer for optimizing the

code if it is based on runtime information about the software. This chapter

addresses the problem of program comprehension using visualization.

The chapter presents a treemap-based visualization tool to visualize Java

collection framework objects based on dynamic data. The tool presents a

global view to the developer about the objects location and state. Using the

proposed approach, the developer can inspect where the object was created in

a program. The information can be viewed at different levels, i.e., package,

class or method. The approach is evaluated using twenty benchmark software

tools. The evaluation is based on the delay caused due to instrumentation and

an internal evaluation metrics, where the results show better performance.

5.1. Proposed System for Java Tree Visualization

This section elaborates the proposed system. Initially an overview of the

system is given and then it is further elaborated in three steps. 1)

Instrumentation: This step deals with the development of instrumentation

code to extract runtime information from a Java program. 2) Data Collection

and Analysis: This phase records the significant data which is collected

during program execution under the instrumentation code. 3) Visualization:

This step deals with the final data visualization using treemap providing

different views of the recorded data. A mathematical formulation of the

proposed solution is developed as follows:

The set of collection API objects in a particular Java application are

represented using Eq. (5.1). = { , , , | ∈ , ∈ , ∈ , ∈ } (5.1)

Where, p represents a set of Packages, c is a set of Classes, m represents a set of

Methods, and t is an item from the set of Types. The sets Package, Class, Method,

and Collection are represented using Eq. (5.2), Eq. (5.3), Eq. (5.4), and Eq.

(5.5), respectively.

= {⋃= | } .

= {⋃𝑃= | ∈ , ∈ } .

ℎ = {⋃𝐶= | ∈ , ∈ } .

= {⋃= | ℎ 𝐴 } .

Having all these sets, we can represent all the collections objects in a

particular java application using Eq. (5.6) and Eq. (5.7).

𝐴 = ⋃ ⋃ ℎ .

𝐴 = ∑ ∑ ∑=𝐶= .𝑃

5.1.1. Java traces visualization system overview

As stated earlier, the proposed system consists of three steps, shown in

Figure 5.1. In the instrumentation phase, we have written code to extract

required information from a Java program during runtime. The

instrumentation is an effective technique for dynamic analysis. A selective

instrumentation approach is utilized to minimize the performance

overheads, where all sections of program are not tracked, rather, only those

methods and lines are tracked where an object is instantiated. The original

bytecode of a targeted program remains unchanged and the probe is inserted

only during load time of the class. This approach minimizes the

instrumentation overhead. As the code block of a target program executes,

it generates the information about its runtime behavior. This information is

handled by the utility code and recorded into a log file. Various key features

are collected in the log file, including: object type, method name, package

name, thread, timestamp, and line number. The log file is then converted to

XML tree format and proceeds as input to the visualization component. The

XML tree structure consists of a dummy root node, branches, and leaf

nodes. The transformation process of text file into XML tree starts by

building the tree having dummy root node and then adding each branch to

the root down to the leaf level. The visualization part is used to depict

treemap-based visualization. The system is implemented in Java using

Eclipse IDE 4.3.

For instrumentation, we focus on the locations, where the object is

instantiated through analysis of runtime data. During program execution,

the information is generated and stored in the log file which is later used to

build the visualization. Selective instrumentation is used to avoid any major

degradation of performance in the targeted program.

Figure 5.1 System overview of the system for Java traces visualization

5.1.2. Instrumentation

The proposed system first requires instrumentation of the classes. The

instrumentation code is built using the bytecode engineering library (BCEL)

[170] to parse the class files. Some alternatives for this purpose are also

available, including third party tool like: ASM [171], Java virtual machine

profiler interface (JVMPI), and Java virtual machine tools interface (JVMTI).

BCEL provides flexible control over the instrumentation process. It may also

be used with different JVM implementations. There are two key packages in

BCEL, instrumentator and utility. Both these packages work together and

have different functions. The instrumentator package contains classes used to

add probe to the bytecode prior to execution and report when these events

occur. The utility package classes do the supplementary work of monitoring

the interesting events and record the information reported by instrumentator

classes. The utility classes are also responsible for generating log file to store

the collected information.

One of the problems with the dynamic analysis is the large amount of

data generated during program execution. We use selective tracking to

minimize the runtime overhead and restrict the size of log file. Our tracking

classes insert probes only to specific locations, like: object creation, object

destruction, and method entry. Only two tracking functions for collection

API, mutator and accessor are used. These methods are responsible for

changing object state. Additionally, add and remove methods are used to

change the number of elements (size) of any collection objects. While in

contrast, the mutator is responsible to modify an object by changing value of

private fields. Since the focus is on the object instantiation location; these

two methods are responsible for the change of state of the objects. When a

method is invoked on an object, these two methods are used to access

private field of that objects.

5.1.3. Data Collection

Once the instrumentation classes are ready, the next step is the data

collection and analysis. A target application is executed under the control of

instrumentation code. The code adds the additional probe to Java bytecode

and then program executes. The interesting events are written into the log

file. The probe is inserted in respective sites, for example, after each new

operator to report the object creation and a unique hashcode is written to

the log file for newly created objects. The specific features we extract

through instrumentation are: 1) object unique identifier, 2) object owner

thread, 3) object type, 4) time of event, and 5) object status. Some static

information is also collected, which includes: 1) source code line number, 2)

package, 3) class, and 4) method name for the object. Figure 5.2 provide a

snapshot of the log file. The program’s runtime information extraction has

the advantage of precision, however, information may not be complete and

may only show some particular aspects. The information can be extracted

only for those classes which are loaded at that time and their code is being

executed.

5.1.4. Visualization and user interaction

The key component of the proposed system is visualization of collection

APIs. As mentioned earlier, we utilized treemap visualization technique.

Treemap space-filling approach enables to show a large amount of

information (millions of objects) on a single screen. The system is

implemented in Java, where the graphical user interface (GUI) capabilities

are built using the Java Swing toolkit. Treemap visualization is

implemented using the layout presented in [51] with Prefuse toolkit [172].

Prefuse is a powerful tool for interactive visualization, having support for

different types of file formats. One of the advantages of using Prefuse is its

object oriented design pattern and support for various layout algorithms.

The treemap is used with labels, where each treemap node is decorated with

a label at the top corner, except for leaf node. The leaf nodes do not come

with label which gives a clear look to the treemap. For better look and

distinction between the nodes, a border to each node is applied. The

treemap starts with a dummy root, shown at the top (root) of the treemap.

There are two objectives of this visualization, first to give a compact view of

all information at a single glance and second is to provide the user an

interactive environment to explore further details. The color shows the level

of node. As we go from root to the leaf nodes, it gets lighter. For colouring

RGB color scheme is used. Same color for all nodes is used, as our aim is to

find the hierarchical structure of object creation.

Figure 5.2 Segment of log file

The developer is interested in the program location where the objects are

instantiated, treemap-based visualization is used to hierarchically visualize

this information. The hierarchy descends from the package to class and to

the method. We take a dummy root node at level 0 followed by package at

level 1, class at level 2, method at level 3, and type at level 4. Finally, the

object is at leaf node of the treemap.

5.1.4.1 Global view

The system shows a big picture of the whole information to the user at a

glance. The user may interact with this view to find details and the data can

be filtered into different levels. The API can be viewed in a hierarchy from

package to method level. In this view the user can overview the entire

information on one screen. Figure 5.3 show the global view. The root node

in Figure 5.3 represents a dummy node to start with and is parent for other

1,main,2011-09-13 17:48:02,ab014ef2-9672-4638-a856-80e705035f09,java.util.Vector,org.gjt.sp.jedit.jEdit,<clinit>,3091

3,main,2011-09-13 17:48:02,ab014ef2-9672-4638-a856-80e705035f09,org.gjt.sp.jedit.jEdit,main,115

1,main,2011-09-13 17:48:02,ea33cf16-4a41-4835-9017-eba563cae8ad,java.util.Vector,org.gjt.sp.jedit.io.VFSManager,<clinit>,460

1,main,2011-09-13 17:48:02,dafbc723-26a7-4cad-afdb-200da8d6982e,java.util.LinkedList,org.gjt.sp.jedit.EditBus$HandlerList,<init>,399

1,main,2011-09-13 17:48:02,b9e1c609-4caf-472b-8698-702b2f225372,java.util.LinkedList,org.gjt.sp.jedit.EditBus$HandlerList,<init>,400

1,main,2011-09-13 17:48:02,f0093e4d-d135-4d89-b461-

a4f19dd6fd15,org.gjt.sp.jedit.EditBus$HandlerList,org.gjt.sp.jedit.EditBus,<clinit>,197

4,main,2011-09-13 17:48:02,f0093e4d-d135-4d89-b461-a4f19dd6fd15,org.gjt.sp.jedit.EditBus,addToBus,140

4,main,2011-09-13 17:48:02,f0093e4d-d135-4d89-b461-a4f19dd6fd15,org.gjt.sp.jedit.EditBus$HandlerList,addComponent,394

1,main,2011-09-13 17:48:02,15a96c0a-aab7-4f71-a8a4-b8f16accd1c8,java.util.LinkedList,org.gjt.sp.jedit.EditBus$HandlerList,safeGet,297

3,main,2011-09-13 17:48:02,15a96c0a-aab7-4f71-a8a4-b8f16accd1c8,org.gjt.sp.jedit.EditBus$HandlerList,addComponent,394

1,main,2011-09-13 17:48:02,586eb2e9-75e1-4a61-9d34-a7cf15f70106,java.util.Hashtable,org.gjt.sp.jedit.io.VFSManager,<clinit>,463

1,main,2011-09-13 17:48:02,76271a65-e332-4cda-95b5-2eb60a51e3d2,java.util.Hashtable,org.gjt.sp.jedit.io.VFSManager,<clinit>,464

nodes. The leaf nodes shown in this view, represent the objects of each

collection type. The objects are grouped in the view by their type. The view

also has support for tooltips which appears as the mouse moves over a

treemap node showing the important information about that node. The user

interface always show the collection APIs with its number of objects

created. One issue with this treemap is that larger nodes cover more area

and can be seen more clearly as compared to the smaller nodes. As shown

in Figure 5.3 (a) with increase in the number of inner nodes the respective

rectangles get smaller and hence produce space efficient visualization.

Although we have inserted probes both at object creation and object

destruction levels, however, the visualization shows all objects created

during program execution. It is due to the need that the developer is

primarily interested in the objects and their locations in memory for longer

5.1.4.2 Interactivity and sub views

The system also supports interactivity. The interaction with the system is

available through the interface provided to explore the detailed information

and insight. The user can zoom in a node by clicking. For example, to

overview different collection API objects in a particular package, the

treemap will show only that particular node and its sub nodes against a

mouse click. During this, information of all other nodes becomes invisible.

Similarly, the user can see information about nodes by bringing the mouse

over that particular node. Figure 5.4 presents a package wise view of the

visualization. The user can explore the information on the bases of its type,

package, class, and thread. User can explore hashtable collection and is able

to see the information about this collection, such as: number of objects so

far created and the package creating these objects. By using this information

user would be able to deduce the frequently used collection APIs and the

classes responsible for the creation of these objects. The developer may use

this information to improve program performance and maintenance issues.

Figure 5.5 shows another view of collection API objects on the basis of

mutator methods called by different objects during program execution. Each

inner small rectangle represents objects that call a particular method. The

user interface provides text-based search facility to find specific information

in a particular visualization. The search facility is available at various views

of the system. Figure 5.6 show the search result for HashTable objects in

medium purple color which were used by different classes of jEdit [173].

5.2. Case study

We take several Java open source software tools, execute these under our

instrumentation program, and collect the execution traces for each. Table

5.2 lists the key features of log file generated by each program run for 200

seconds on a 2.93 GHz Intel Core i3 system. Figure 5.7 shows a

visualization result generated by all programs collectively and shows the

number of objects created for each collection API. Appendix-B contains the

visualization of ten software tools listed in Table 5.2. Collecting events over

a longer time is also possible, provided large storage is available. Table 5.2

shows the results on ten programs executed for the same duration with

varying log file size, objects created, and other attributes. It is due to the fact

that the internal implementation of each program is different and has no

direct correlation with the run time.

Table 5.2 Log File Detail

Program Running time Log file size Object created

Accessor methods

called

Mutator methods

called

(Sec.) (MBs) (Approx.) (Approx.) (Approx.)

Prefuse 200 110 54 1045600 16636

Browser 200 63 230 523530 63290

Eclipse 200 20 7765 78400 37790

JMoney 200 15 11935 54200 31965

JEdit 200 12 2170 59800 4500

freemind 200 10 11900 55000 936

Fire 200 10 4660 38100 18957

M3D 200 06 40 71 57200

JHotdraw 200 03 1670 8770 24058

jImage

200 0.4 80 2540 680

Figure 5.3 Visualization main view

Figure 5.4 Visualization Package-wise view

Figure 5.5 Mutator methods view

Figure 5.6 Search result for HashTable

Using the proposed visualization, a comparison of different programs,

based on their use of APIs is performed. In Figure 5.8 (a) the visualization is

shown for M3D program which uses only two types of collection objects,

i.e., vector (19 objects) and stack (17 objects). Figure 5.8 (b) shows that

JMoney program creates over 7000 objects of arraylist and almost 3000

objects of hash types. Other rectangles show HashMap and HashSet

collection APIs, respectively with over 800 leaf nodes inside each rectangle.

Figure 5.9 shows the objects created in Eclipse and the mutator methods.

Through visualization it is found that Eclipse creates over 7000 collection

objects. Therefore, this program made an extensive use of such collection

objects. Table 5.3 shows summary of collection API objects created during

the execution of test case.

5.3. Performance evaluation and comparison

To evaluate the performance in supporting the programmer in

understanding the collection API usage, we conducted multiple

experiments. Since the empirical method is effective form of quantitative

evaluation of information visualization tools [55], we used the same. A

controlled experiment was designed to evaluate the effectiveness of

visualization tool in the task related to the understanding of API usage in

large programs. The objective of the experiment is to measure the impact of

the proposed visualization in better software/program comprehension in a

small time. We evaluated the proposed approach using twenty benchmark

software tools. The evaluation is based on the delay due to instrumentation

and an internal evaluation metric. The overhead time and slowdown factor

are measured to investigate the performance of instrumentation code. These

experiments are performed for both the target applications and benchmark

tools.

Table 5.3 Collection APIs per program

Collection APIs

(Numbers of objects created)

Program HashTable ArrayList HashMap LinkedList HashSet Vector Stack StringBuffer Other

JEdit 908 310 280 300 - 250 201 - -

JHotdraw - 1616 - - - -

Prefuse - 24 6 - - - - - 30

jImage 12 2 10 - - 37 - - 12

Browser - 139 - - - - - - 60

JMoney 1240 7880 890 -- 980 - - - 2

Eclipse - 1235 2378 - 950 - - - 100

Fire - 1270 2320 - 957 - - - 70

freemind - - -

- - - 10550 50

M3D - - - - - 19 17 - -

Figure 5.7 All programs visualization

Figure 5.8 Collection APIs objects usage (a) M3D (b) JMoney

Figure 5.9 Eclipse objects vs. mutators calls

5.3.1. Experiment Design

This experiment is used to quantitatively evaluate the effectiveness of

visualization, which helps the developer to identify various collection APIs

used during program execution. At the same time, the evaluation also

searches for the insight provided by the tool to find new information. To

answer following research questions were formulated.

1. Does the use of the proposed visualization reduce the time needed

to find collection APIs in a program?

2. Does the use of the proposed visualization reveal the hierarchy

(package, class, method) for objects of respective APIs?

3. Whether or not a developer can determine the APIs and the number

of objects for each API?

4. Does this information helps developer in program comprehension?

Two null hypothesis are devised. The hypotheses with their respective

alternates are given in Table 5.4.

Table 5.4 Null hypotheses with their alternatives

Null Hypothesis Alternate Hypothesis

H10 : The tool does not reduce the time to

understand the collection APIs usage in a

program

H1: The tool reduces the time to understand the collection

APIs usage in a program

H20: The tool is not useful in

comprehension tasks. H2: The tool is useful in comprehension tasks.

A. Subjects

A total of 24 subjects participated in a controlled experiment, and were

chosen from target groups. The demographics of these subjects are listed in

Table 5.5. The subjects were master and bachelor level students having

background knowledge of computer programming and Java collection

APIs, however the subjects had no experience with the proposed

visualization. The subjects were selected on voluntary basis, having

motivation for the tasks. They were divided into two groups, an

experimental group and a control group. The experimental group was

provided with the visualization tool, while the control group was not

provided with the tool.

Table 5.5 Demographics of the subjects

Characteristics Control Group Experimental Group

Age (years) 21-29 21-29

Education (years) 15-18 15-18

Programming experience (years) 1-5 1-5

B. Object System and Tasks

A rich number of open source software tools are available: ten software

tools as mentioned in Table 5.2 were selected. One of these ten tools is a

popular text editor jEdit. It consists of around 900 classes with 32 packages

and approximately 5000 methods. Primary rationale behind the selection of

jEdit is its popularity among Java developers as well as its source code is

available. In addition to this, nine other tools are also used in evaluating the

proposed visualization. The subject systems (mentioned in Table 5.2) are

open source tools/applications and we put our tracking classes in their

application path before executing these applications. The probe is inserted

into the application’s classes during load time and these do not change the

application’s original bytecode. As the program executes the probe, it also

runs and generates event-based information which is stored into the log file.

The basic concern of the tasks in our experiment is that whether the

subject is able to find and identify the collection APIs used in a software

tool? The subject should also be able to understand the hierarchy of

package, class, and method for a particular collection API object. The

respective tasks with descriptions are listed in Table 5.6. The tasks are

presented to both (experimental and control) groups as open questions.

These tasks are mostly related to program understanding and analysis

through collection API objects. The tasks include: a) find the collection

APIs used in a program, b) mostly used collection type, and c)

package/class in a program, where a particular type of collection object was

created.

Table 5.6 Task description

Task Description

T1 Identify the collection APIs used in a program

T2 Identify the packages and classes of collection APIs in a program

T3 List 3 classes responsible for creation of most objects of a particular API

T4 Identify methods that create maximum number of collection APIs objects

C. Experiment Procedure

The experiments were performed in different sessions with both the

groups on the computer systems having similar specifications. The subjects

were given time to familiarize with the task and the environment. The

experimental group was provided with the tool, while the control group just

used the IDE and text data to extract the information. The visualization was

available to the experimental group, while the control group had no such

facility. The subjects were asked to find the collection API usage, where

they could easily see the collections and their usage across various packages

and classes. The experiment’s data were collected and recorded for further

analysis and evaluation. We performed the experiments for both the

hypotheses simultaneously.

The first experiment is related to the hypothesis that, “user could

understand the collection API usage easily using the proposed tool and

reduce the time required for this task.” The subjects were given task to

understand collection API usage. The independent variable in the

experiment was the visualization tool while the dependent variables were

the time taken to complete task and its accuracy. The experimental group

quickly identified the collection API used in the program through the

visualization. The control group took more time and was not able to

identify all the collection API usages. To evaluate the second hypothesis,

we recorded scores given by each subject after a particular task was

performed. The score is on a scale of 0-5, where 0 stands for “not useful”,

while 5 stands for “very useful”.

D. Variables Analysis

The experiment has some independent and dependent variables involved.

The independent variables for the experiment are the availability of

proposed visualization tool and the system size under consideration.

Dependent variables are the completion time of a particular task and the

task usefulness per task score provided by the system. Table 5.7 shows the

time taken and the points for usability by each subject of the experimental

group. Table 5.8 shows the time taken and the points for usability by each

subject of the control group. The task analysis, as shown in Table 5.9,

describe statistics per task for the experiment.

Table 5.7 Experimental group statistics for time and usability score (0-5)

Subject Task1

(min.)

usability

Subject 1 17 13 5 11 46 5 3 4 0 12

Subject 2 15 16 6 12 49 4 2 1 3 10

Subject 3 10 9 4 11 34 5 2 4 3 14

Subject 4 9 12 5 8 34 5 1 4 4 14

Subject 5 12 8 7 8 35 3 2 5 4 14

Subject 6 8 9 4 11 32 3 4 4 1 12

Subject 7 11 8 6 6 31 4 3 2 4 13

Subject 8 8 9 7 6 30 4 3 0 3 10

Subject 9 17 14 5 8 44 3 2 4 3 12

Subject 10 16 9 4 9 38 4 2 2 1 9

Subject 11 13 11 6 8 38 5 1 3 1 10

Subject 12 11 9 7 10 37 4 1 3 0 8

Average 12.30 10.60 5.50 9.00 37.30 4.10 2.20 3.00 2.30 11.50

Median 11.50 9.00 5.50 8.50 36.00 4.00 2.00 3.50 3.00 12.00

Std. dev. 3.30 2.60 1.20 2.00 6.10 0.79 0.94 1.48 1.50 2.10

Table 5.8 Control group statistics for time and usability score (0-5)

Subject Task1

(mins.)

Total time

(mins.)

usability (0-

usability

Subject 1 19 18 13 11 61 2 3 1 0 6

Subject 2 18 16 11 12 57 2 3 2 2 9

Subject 3 17 19 10 14 60 3 2 3 3 11

Subject 4 18 12 13 11 54 2 3 0 0 5

Subject 5 17 15 11 16 59 3 1 3 3 10

Subject 6 19 17 7 12 55 2 2 1 2 7

Subject 7 16 11 9 11 47 3 1 3 3 10

Subject 8 15 14 15 9 53 2 3 1 0 6

Subject 9 18 16 8 10 52 3 2 2 1 8

Subject 10 14 13 9 12 48 2 1 2 2 7

Subject 11 18 15 11 9 53 1 3 3 2 9

Subject 12 17 13 8 13 51 2 3 1 0 6

Average 17.2 14.9 10.4 11.7 54.2 2.30 2.30 1.80 1.50 7.80

Median 17.5 15.0 10.5 11.5 53.5 2.00 2.50 2.00 2.00 7.50

Std. dev. 1.5 2.4 2.4 2.0 4.5 0.62 0.87 1.03 1.20 1.90

Figure 5.10 Boxplot representation of the results from control and experimental groups

20 28 36 44 52 60 68

Con.group

Exp.group

Time (minutes)

2 6 10 14 18

Con.group

Exp.group

Score (points)

Table 5.9 Per task comparison

E. Analysis of time completion (hypotheses H10)

The first null hypothesis state that the tool does not reduce the time

required to understand collection API usage in a program. The mean time

for the experimental group is 37.3 minutes and for the control group this is

54.2 minutes. As the results in Table 5.10 show, experimental group on

average completes the tasks in approximately 45% less time as compared to

the control group. The Shapiro-Wilk test is performed to verify the

normality of samples in both the groups. The results are shown in Table

5.10, where if the value of W is greater than the critical value the null

hypothesis cannot be rejected and the samples are confirmed for normality.

The Levene test gives value greater than 0.05 indicating same variance in

both samples. For 12 samples in each group t-statistics were calculated with

α =0.05 as confidence level, which give the p-value as 0.000009. The p-

value is much lower than α, thus the null hypotheses H10 is rejected. It is

concluded that there is considerable difference between the groups. In most

of cases, the tool helped the user to understand APIs usage. Figure 5.10

depicts the results of time taken by both groups as boxplot showing the total

time taken by each subject for all tasks. As shown in the figure, the range of

time in terms of minutes is less for the experimental group as compared to

the control group.

Time (minutes) Usability (Points)

Experiment

Control

% Diff. Experiment

Control

% Diff.

Task1 147 206 -28.64 49 27 81.48

Task2 127 179 -29.05 26 27 3.7

Task3 66 125 -47.2 36 22 63.63

Task4 108 140 -22.85 27 18 50

Table 5.10 Results statistics

Group Mean Diff. Max Min SD

One-tail Student

t-test Levene

Shapiro-Wilk

p-value t df p-

ol 54.2

61 47 4.5

0.0000

20 0.73

0.82 0.97

iment 37.3 49 30 6.1 0.23 0.91

Usability

ol 7.3

4 0 1.9

0.0001

21 0.28

0.16 0.90

iment 11.5 5 0 2.1 0.23

F. Analysis of the tool’s usability score (hypotheses H20)

In case of second hypothesis, the experimental group score is 37 % better

compared to the control group’s mean point scores as shown in Table 5.10.

We also calculate the t-statistics for both groups. After verifying the

normality through Shapiro-Wilk test and variance equality through Levene

test t-test is performed for the usability experiment result. For α = 0.05

confidence level, the p-value is calculated to 0.00019 which is much less

than α. Therefore, the null hypotheses H20 is rejected.

Figure 5.10 (b) shows the boxplot of scores in terms of points for both the

groups by taking the sum of total score for each subject. The range of points

for experimental group is higher than the control group; hence the usability

of proposed tool is better.

G. Task analysis

This section lists analysis of the tasks for both types of activities, i.e., time

taken and usability score. Figure 5.11 (a) shows the average time taken for

each task by the experimental and control groups. In general, the

experimental group performs better in all tasks as compared to the control

group. Since the control group is to evaluate the text file of several MBs,

they took more time. In contrast, the experimental group visually analyze

all data and quickly perform their task. Similarly, Figure 5.11 (b) does a

comparison of the average usability score for both these groups. For all the

tasks the proposed tool scored better compared to score assigned by control

group.

5.4. Performance evaluation

This section shows the performance of the proposed tool over ten Java

open source software tools used in the case study. The proposed approach

evaluation using twenty benchmark software tools is also shown.

The instrumentation process always brings performance overhead for the

target software. Due to selective instrumentation, the increase in program

execution time and the trace writing time is negligible. Table 5.11 lists the

time taken for each of ten software tools with and without instrumentation

code. It also lists the size of the log file in megabytes (MB) generated for a

session of 200 seconds. The instrumentation overhead degrades the startup

of a particular program with some factor. The time taken by program to

write trace data to the hard disk is ignored. During startup Eclipse takes

(a) (b)

Figure 5.11 Per-task analysis

Task1 Task2 Task3 Task4

Experimental Group

Task1 Task2 Task3 Task4A

Experimental Group

more time since it has a larger number of classes to load. In case of

instrumentation for Eclipse, the program takes almost twice as much time to

load due to the extra code that tool puts at runtime. However, after startup

Eclipse runs smoothly and the instrumentation did not slow down the

application.

Table 5.11 Software time taken while loading

Time taken in seconds

Software Without

instrumentation With instrumentation Log file size in MB

Prefuse 10 17 110

Browser 4 8 63

Eclipse 25 49 20

JMoney 3 7 15

JEdit 5 12 12

Freemind 3 5 10

Fire 5 11 10

M3D 4 9 6

JHotdraw 6 11 3

jImage 4 7 0.4

The slowdown factor, Sf, is calculated using Eq. (5.8). Where, 𝑔 𝑒 is

the log file size, ′ is the time taken without instrumentation, and is the

time taken with instrumentation.

= (( 𝑔 𝑒′ ) − ( 𝑔 𝑒)) .

Figure 5.12 show the slowdown for ten software tools used in this

chapter. It is clear from Figure 5.12 that for eight out of ten software tools

the slowdown is less than 3, for five it is even less than 1 which is negligible.

However, for two software tools (Browser and Prefuse) slowdown is high,

i.e., 4.5 and 7.8, respectively. This is due to the fact that these two software

tools instantiate less number of objects in their start-up. The value of

slowdown decreases where the log file size is large and also the software

tool that instantiates a large number of objects in its start-up phase. This

amortizes the instrumentation overhead.

Figure 5.12 Slowdown for ten software tools due to instrumentation

Software tools

Slowdown

Figure 5.13 Runtime overhead for benchmarks tools

Figure 5.14 Slowdown factor and log file size

Software tools

Without instrumentation

With instrumentation

Software tools

SlowDown File[MB]

To further verify the performance of the proposed approach we calculate

the time delay caused due to instrumentation using the benchmark tools of

DaCapo and SPECJVM2008. This gives a total of twenty software tools to

be used in verifying performance of the proposed approach. Figure 5.13

shows the runtime overhead due to instrumentation code. Each benchmark

application ran with and without instrumentation. The runtime of each

application is normalized to the original code time. The application with

lesser number of classes have lower overhead considering their load time.

While others with large number of classes have higher impact on the

runtime. As shown in Figure 5.13, serial, compress, derby, and transform

take more time which is mainly due to the larger base code.

The slowdown factor, Sf, is calculated using Eq. (5.8) for twenty

benchmark applications. The results of which are shown in Figure 5.14. The

slowdown factor is contributed by the instrumentation code and the output

file size. Each benchmark output file is normalized and then the slowdown

factor is computed. Figure 5.14 also shows the relative comparison of

output file and slowdown factor for twenty benchmark tools. The graph in

Figure 5.14 shows that, as the file size increases the slowdown factor also

increases.

5.5. Chapter Summary

This chapter presented a treemap-based tool to support Java collection

API usage in a software program. The proposed tool enables the developer

to comprehend the API usage at different levels of abstraction. Dynamic

data are collected during program execution and visualized using treemap.

The tool provides interactive facility to evaluate the APIs usage at method,

class, and package level and at the same time provides a global view. The

proposed system is helpful in general program comprehension, API usage

evaluation, and supports program maintenance activities. The evaluation

results confirm the tool’s effectiveness. The proposed tool can be used for a

software corpus to analyze the frequency of collection API usage. It has also

been evaluated using twenty benchmark software tools. The results showed

better performance.

Chapter 6 Conclusion and Future work

Conclusions and Future Work

"Nothing ever becomes real until it is experienced." John Keats

This dissertation contributes towards the domain of

information visualization from three aspects: automatic

selection of visualization technique, quantifying visualization

and layout optimization, and usage of visualization for

dynamically collected software data. The main objective of

this work was to build computational intelligence-based

framework for visualization technique prediction and

optimization. The visualization technique prediction for a

particular dataset is based on the characteristics of the dataset

and the related tasks that are to be performed on the data.

Furthermore, the study analysed and formulated

visualization metrics computed using the existing knowledge

of human perceptual theories. The visualization metrics were

used to automatically build a perceptually and aesthetically

appealing visualization. Moreover, visualization-based tool

was utilized to understand and get the insight of dynamically

collected data of collection APIs in Java programs. A

bytecode instrumentation framework was developed to collect

Java collection APIs objects trace from a program and

visualized using treemaps-based visualization technique. This

chapter revisits the research questions posed in the beginning

of this dissertation. The chapter also lists some limitations of

this work followed by prospective directions.

Chapter

Information visualization is becoming ubiquitous in every field,

especially in domains where large volumes of data are needed to be

analyzed for pattern recognition, data mining, and knowledge discovery.

Visualization, being an efficient and effective approach, assists the user to

get the insight of data quickly. The visual analysis expedites the decision

making process by providing early insight of data. However, visualization

is not merely pretty picture, since an inappropriate visualization could lead

to erroneous decision. The work presented in this dissertation tackles

information visualization with three aspects: an appropriate visualization

selection, quantifying and optimizing visualization, and using treemaps-

based visualization to analyse Java collection APIs usage data collected

through bytecode instrumentation. A careful investigation of existing

research in the field of visualization shows that selection of appropriate

visualization for a particular dataset is mainly influenced by the metadata

and related tasks. Furthermore using the existing knowledge and empirical

research findings, the study devised visualization metrics to build aesthetic

visualization that can be computationally measured. It utilized

computational intelligence-based framework for automatic visualization

selection and used evolutionary computation for better visualization

layout. Moreover, the bytecode instrumentation and treemaps-based

visualization was used as an effective method to understand Java collection

APIs usage in Java program during the course of execution. In the next

section, the research questions posed in the first chapter have been revisited

with the findings.

6.1. Primary research questions

The introductory chapter of this dissertation posed a few primary

research questions, the answers to which were investigated in the rest of the

dissertation. The research questions covered three aspects of information

visualization this research undertook: automatic visualization selection,

visualization optimization, and using treeemaps visualization for the

comprehension of runtime Java APIs usage. This section revisits those

primary research questions one-by-one and discusses the answers provided

in different sections of this dissertation.

RQ-1: What are important characteristics of a dataset that

influence the selection of visualization technique?

This research question is addressed in chapter 3 by

conducting careful investigations of existing literature and

through development experience of visualization tools. It

was found that there is no standard dataset or rules

available that guide a naïve user to select an appropriate

visualization technique for his/her dataset. Nevertheless,

some visualization types are more suitable in a particular

context for visualization. Each dataset has its own

characteristics known as metadata, that metadata actually

plays an important role in the selection of a visualization

technique. This study found and formulated four main

metadata attributes; data dimensions, number of instances,

number of attributes, and primary attribute type, which are

important for the selection of an appropriate visualization.

The dimension of a dataset could be 1D, 2D, or 3D and the

data can be hierarchical too. Additionally, four primary

attribute types considered were: ordinal, continuous,

categorical, and geographical.

Furthermore, the study found that the task that is to be

accomplished with a dataset is also a key factor in the

selection of visualization technique. After a careful

treatment of the available literature, four tasks, i.e.,

relationship, trends, distribution, and comparison were

taken into account. Therefore, along the metadata

attributes, these tasks also influence the selection of an

appropriate visualization technique for a specific dataset.

Sensitivity analysis of each parameter was performed to

show the relative importance and individual influence. It

has been found that the most important characteristics of

the dataset for visualization selection are dimension,

primary attribute and task. Hence, the removal of each

input parameter from the model influence the accuracy.

However, when dimension and primary attributes are

mutually left from the input the error rate increase by more

than 41%

RQ-2: How metadata and a particular task related to the data, be

used to predict visualization for a dataset?

While incorporating second research question, chapter 3

focused on a systematic way to handle metadata, relevant

tasks, and visualization. Using the contemporary

knowledge in the literature, a novel dataset was built,

mapping metadata and tasks that are to be accomplished

through visualization. The newly created dataset

comprised of metadata about the original dataset, the

relevant tasks that needs to be performed, and the

visualization technique used for that particular dataset.

One main issue with building such dataset is the limited

availability of the constituting instances. A dataset with

almost four hundred instances consisting of records on

eight different visualization techniques was built. This

particular dataset was then utilized for training and testing

of the artificial neural network (ANN)-based model to

classify the current instances and predict visualization for

unseen data instances. Input to the ANN model consists of

four metadata attributes and relevant task while the output

is an appropriate visualization from a set of eight visual

techniques. An exhaustive training experiment was

performed with different ANN models in combination with

various training parameters. The empirical results showed

that ANN-based computational intelligence model

accurately predicts visualization by incorporating the

metadata and relevant tasks. The detail of various aspects

of these experiments may be seen in chapter 3.

RQ-3: What is the best CI model to predict visualization based on

metadata?

The question has been addressed by putting emphasis on

selection of best possible computational intelligence-based

model. In chapter 3, the experiment and discussion section

is comprehensively sought out for various steps and

techniques to achieve the objective. Initially, several feed-

forward neural networks (FFNN)-based models were

deployed with various numbers of neurons in the hidden

layer, while keeping the input and output neurons fixed.

Each model was provided with five input neurons and

eight neurons in the output layer. The deployed models

were tested with different combination of training method

and neural network parameters to make sure all possible

scenarios are evaluated. Performance parameters, i.e.,

accuracy, sensitivity, precision, and correlation coefficient

were used to evaluate each model. The detail analysis

depicts that the ANN model with 14 neurons in the hidden

layer show the highest accuracy of almost 98%. Thus

becoming the best possible model for the problem in hand.

To compare the best ANN model with other five well-

known classifiers, another exhaustive experiment was

carried out to ensure the suitability of the best model. The

relative performance of these classifiers on common

parameters showed that the ANN-based model is best

suitable CI-based approach in the context of automatic

visualization prediction.

RQ-4: What aesthetic and perpetual design parameters are

important for specific visualization?

The field of information visualization is concerned about

the evaluation and creating optimized visualizations which

are aesthetically better and perceptually pleasing. However,

building such visualizations is a non-trivial task,

particularly for the naïve user. The existing knowledge in

the field of empirical research in visualization, human

computer interaction (HCI), cognitive science, and

perceptual theories could be formulated to find the basic

characteristics for better visualization. While answering

this question, chapter 4 provides a rigorous study on the

evaluation and investigation of all such contemporary

theories and knowledge. Each visualization technique

apparently has some unique properties, i.e., border size and

colour for treemaps, number of crossing lines for parallel

coordinates, and node colours for graphs. However, there

are some common design attributes that contribute to the

better visualization creation. The detailed studies show that

these parameters are of both types, common in all

visualizations and unique for a particular visualization

technique. Moreover, optimal or sub-optimal values of

these parameters could be found by careful investigation of

the existing knowledge. Chapters 4 analyses and explores

various design parameters for treemap visualization

technique.

RQ-5: How the visualization features and design parameters map

to the visualization metrics?

Over the years, many metrics have been presented in

literature to evaluate and compare visualizations.

However, these visualization metrics are mostly theoretical

and due to the subjective nature of the problem, automatic

visualization optimization is difficult. Chapter 4 proposed

four visualization metrics, i.e., effectiveness,

expressiveness, readability, and interactivity. These metrics

were exploited computationally to automatically optimize

visualization. The basic idea is the mapping of

visualization specific attributes to the visualization metrics.

Furthermore, each visualization attribute may be mapped

to more than one metrics at the same time. The mapping

process was based on the domain knowledge that exists in

the literature for each visualization metrics. Additionally, a

combined metric provided weight to each attribute that

enables the customization of visualization with specific

attributes.

RQ-6: How the visualization metrics computationally evolve to

optimize the layout of a visualization technique?

The question related to the optimization and

computationally evolution of metrics is also addressed in

chapter 4. A CI technique was used to formulate a

framework that computed and evolves a set of possible

solutions using the visualization metrics. Additionally, the

visualization metrics were then utilized to build a common

fitness function with a linear sum of the four metrics.

Moreover, each metric was combined with weights that

could vary. In addition, the problem was formulated by

encoding visualization design parameters into chromosome

of fixed length. A random population of individual

chromosome was created that provide an initial seed to the

system. The framework used several evolutionary

operators and the combined fitness function to evolve an

initial population in search of the best possible solution.

The framework was evaluated with different configurations

of operator and related parameter values to get the

optimum results. A treemap-based case study was

presented to validate the effectiveness of the proposed

framework. The exhaustive experimental results show that

the framework provided a computationally sound method

to optimize the visualization for better aesthetic and

perceptual properties. Furthermore, the optimized

visualization was then evaluated using user surveys and

statistical analysis of the result.

RQ-7: Which types of Collection APIs are frequently used by a

given/target Java program during the execution?

Java developers utilize different kinds of application

programming interface (APIs), i.e., swing, abstract window

toolkit (AWT), and Util for various functionalities in their

programs. Java collection APIs are mostly used as program

data structure to handle data and variables. The efficient

usage of these APIs is critical for program performance,

particularly in case of large commercial applications.

Chapter 5 mainly focused on the answer of this question.

The solution to this is twofold, the extraction of relevant

information from a program during the course of

execution, where no source code is available in advance.

Secondly, this information must be effectively presented to

the developer to clearly identify the APIs used and assist in

program comprehension. Initially, a bytecode

instrumentation tool was developed to extract the snapshot

of collection APIs usage in a program. The proposed tool

was used to trace each class with probe code. The probe

was responsible for the collection APIs usage information,

which was handled and stored into a log file by the utility

module. Additionally, the log file was given as input to the

treemaps-based visualization module to effectively get the

insight. An investigation was made by evaluating ten Java-

based applications for collection APIs usage. The detailed

study showed that these applications used more than eight

collection APIs, including HahsTable, ArrayList, LinkedList,

and Vector. Moreover, JEdit, a Java-based text editor used

six collection APIs, while others, such as JHotdraw used

only ArrayList. The respective visualization and collection

APIs usage of each application may be found in chapter 5.

RQ-8: Which packages/classes/methods of target program are

responsible to instantiate Collection APIs objects?

Another correlated and interesting scenario in the solution

of RQ-7 is the sections of a program responsible for

collection APIs usage in terms of the number of objects

instantiated. The answer to this question is also important

for the program comprehension and maintenance. The

instrumentation tool discussed in chapter 5 is used to

extract various types of information from a program during

its execution. The information also includes spatial

information about the collection APIs objects instantiation.

The extracted information include: the package, class, and

method name along with the line number in source code,

where a particular collection APIs is used. The utility

module was responsible to log a single record per newly

instantiated object of each collection API type. The

visualization module was then utilized to depict this

hierarchical information using treemap-based visualization

technique. The detail analysis of respective visualization of

different applications showed that there were some

packages or class responsible for the large amount of

collection API objects. Moreover, large applications, i.e.,

JEdit and Eclipse used a large number of packages and

classes and they heavily used collection APIs.

RQ-9: How dynamic bytecode instrumentation is used to extract

APIs objects traces with minimal runtime overheads?

This question addresses the issues related to the

development and working of bytecode instrumentation-

based dynamic analysis tool. The instrumentation tool

developed in chapter 5 use minimal runtime overhead to

avoid the performance degradation in the target

application. Initially, in development phase the proposed

tool is kept simple and only actual application based classes

were selected for instrumentation. Furthermore, a selective

instrumentation was adopted where only collection API

objects creation location is tracked. Moreover, the

proposed instrumentation tool was evaluated for

performance analysis using real world applications and

standard benchmark tools. The detailed experiments

showed that the slowdown factor and runtime overhead of

the proposed tool was not high and avoided any

degradation in the target application’s performance.

However, in case of collection API intensive applications

where large numbers of objects were instantiated, the

degradation was comparatively higher.

RQ-10: Can treemap-based visualization be utilized for the

analysis of Collection APIs objects of a particular Java program?

Another main objective was to utilize and evaluate the

treemap-based visualization for collection API trace data

understanding. The bytecode instrumentation extracted

runtime information from a program and stored into a log

file. Getting the insight about such a large text file is not

trivial. The information in the log file is of hierarchical

nature, i.e., package, class, and method. Simultaneously,

treemap visualization is space-filling hierarchical technique

that depicts large hierarchical information on a single

screen. The visualization tool described in chapter 5 was

based on treemap visualization, where user could see

information about the collection APIs on a single screen.

The tool provided interactive facility to change package,

class, and method wise hierarchy. The effectiveness of the

proposed tool was evaluated using controlled experiment

and exhaustively elaborated in chapter 5. The results of

statistical analysis of controlled experiment showed that

the proposed visualization tool was better with respect to

time and usability for collection APIs trace comprehension.

6.2. Summary of the findings

The dissertation explored the proposed work from three aspects related to

information visualization; automatic visualization selection, visualization

optimization based on quantifiable metrics, and the utilization of

visualization and bytecode instrumentation for code comprehension. This

section summarizes the major findings in the context of these three aspects.

An appropriate visualization selection for a specific dataset is inevitable

for the user. The work presented proposes an automatic visualization

selection based on the metadata of dataset to be visualized and its relevant

tasks. The study found important metadata attributes and tasks after a

careful investigation of the existing literature. A novel dataset was built

comprising of metadata and tasks that already used in information

visualization community. Additionally, the knowledge about visualization

technique and their supported task was utilized to enhance the newly

created dataset. Furthermore, an ANN-based model was deployed with

fixed number of neurons in the input and output layer. The generic ANN

model accommodated eights visualization techniques; histogram, line chart,

pie chart, scatter plot, parallel coordinates, map, treemaps, and linked

graph, to properly train and test the proposed model. In addition, the

proposed model was evaluated using five well known classifiers and state-

of-the-art automatic visualization selection systems. The exhaustive

comparison showed that the proposed ANN-based model could be utilized

for the problem with high accuracy and less computational resources. To

the best of our knowledge, the work brings new perspective in the field of

visualization, where new visualizations may be added to the dataset in

order to build a comprehensive database. The dataset, therefore, provides a

foundation for an expert system to create a knowledge base.

In the perspective of the second aspect, the dissertation contributed in the

visualization optimization. The previous aspect of automatic visualization

selection concentrates on how to predict an appropriate visualization for a

specific dataset. The supplementary step is provided with optimization of

selected visualization based on aesthetic and perceptual theories. The

proposed framework is fed with visualization design parameters and

metrics. Furthermore, the design parameters are extracted from the

contemporary knowledge in literature through careful analysis.

Additionally, the ranges of values for each parameter were investigated

empirically. These design parameters are important in setting the aesthetic

and perceptual properties of any visualization. Yet another main objective

of this work is to formulate visualization metrics to be computationally

measured and compared. Therein, four visualization metrics: effectiveness,

expressiveness, readability, and interactivity were defined in terms of

visualization design parameters. The visualization optimization process was

formulated with the development of evolutionary algorithm-based

framework. The proposed framework and visualization metrics were

evaluated using several experiments and case studies. The analysis showed

that the new formulation of visualization metric and the evolutionary

algorithms-based optimization techniques provided better visualization with

respect to aesthetic and perceptual qualities. Furthermore, the proposed

approach is yet another step towards the automatic evaluation of

visualization objectively.

The last part of the dissertation focused on the information usage for the

purpose of program comprehension and understanding. It concentrates on

the collection APIs usage in Java program during execution. The proposed

work in this regards is twofold: the extraction of relevant information from

program and to represent visually it to the user. To perform dynamic

analysis of a program, bytecode instrumentation tool was developed with

the objective to keep the runtime overhead minimal. The tool was then

utilized for selective instrumentation of the target program to capture

collection API objects instantiation traces. The tool was tested on ten real

world applications and twenty benchmark tools to evaluate the runtime

overheads and slowdown impact. The traces collected during the

instrumentation step were brought into the visualization module as an

input. The proposed visualization tool was based on treemap visualization

to depict the hierarchical information on a single screen. The visualization

module provided interactive and search facility to evaluate the collection

APIs on various aspects, i.e., types, package, or class-wise. Furthermore,

the visualization part was evaluated using controlled experiment with 24

participants. The statistical analysis of the controlled experiment showed

the tool’s effectiveness and usability in the context of understanding the

collection APIs usage through visualization.

6.3. Limitations

All research suffers some limitations; same is the case with the work

presented in this dissertation. This study is perceived with some limitations

regarding the dataset. Firstly, the current dataset consist of only 400

instances for training and testing of the classifier. Secondly, only 8

information visualization techniques are considered while building dataset.

However, the dataset may be extended and more visualization techniques

may be incorporated. In this case, the ANN model shall be trained with the

enhanced dataset. The visualization metric optimization experiment is

limited to the knowledge and theories already taken and established true.

Furthermore, in case of instrumentation tool, subjective applications must

be Java-based and bytecode is available in jar file. The subjective application

and instrumentation tool is running on the same system and there are no

classloader issues. The instrumentation process is also limited to only those

classes that were loaded during the execution.

6.4. Future work

In the previous sections of this chapter all the answers to the posed

research questions were formulated and discussed. Nevertheless, the work

presented in this dissertation may be extended in several directions for

future research. This section provides a discussion of future work in the

context of three main aspects covered throughout the dissertation.

Automatic visualization selection aspect may be extended in several

directions in future. As we used eight visualization techniques, exploring

more visualization techniques for integration to current dataset is one

perspective. Visualization weight may be added to increase the selection

probability of a particular visualization technique for a specific dataset.

Similarly, we have used four tasks in this dissertation, where only one task

is considered for the selection of visualization. However, the work could be

extended to incorporate more than one task with the dataset. Another

interesting future aspect would be developing a library based on this work

and to integrate it into various development environments, e.g., data mining

packages, electronic worksheets, and online services. A user study, such as

control experiment may be performed to evaluate the output of system and

at the same time in future user feedback on selected visualization could be

added to the system for customization.

Visualization optimization aspect also leads to various directions in which the

research may be extended. One of the obvious future directions of the

proposed work is the implementation to other visualization technique, i.e.,

graph, progress bars, and parallel coordinates. Similarly, yet another

perspective is to find more generic visualization specific parameters and

their evolution for visualization optimization. Further research may include

the tasks that user requires to accomplish with a particular visualization for

optimization. Still another extension may be to the work is to build

sophisticated user interface to evaluate the results. Finally, a web-based

optimization environment may be built to facilitate online visualization.

Treemaps visualization and instrumentation also has its future aspects. The

study may be processed for the cognitive effects on programmers from such

package/class visualization that are responsible for creating/accessing

runtime collections of Java objects. It will also be interesting to apply some

data mining techniques, like: clustering, temporal analysis, and association

rules mining on the data collected via instrumentation, to discover useful

patterns. Information like: which method has been invoked the most in a

given package, which objects are surprisingly the most frequently used

together, clusters of related objects, outlier detection, and many such

implicit patterns are expected to be extracted using data mining techniques

from the recorded data. From the visualization perspective, there is an

opportunity to optimize the visualization of such large datasets in order to

make visualization more effective. For this purpose, optimization

techniques from computational intelligence may be utilized.

References

[1] M. Tsytsarau and T. Palpanas, "Survey on mining subjective data on the

web," Data Mining and Knowledge Discovery, vol. 24, pp. 478-514, 2012.

[2] D. Westerman, P. R. Spence, and B. Van Der Heide, "Social Media as

Information Source: Recency of Updates and Credibility of Information," Journal of Computer-Mediated Communication, vol. 19, pp. 171-183, 2014.

[3] C. W. Y. Wong, K.-h. Lai, T. C. E. Cheng, and Y. H. V. Lun, "The role of

IT-enabled collaborative decision making in inter-organizational information integration to improve customer service performance," International Journal of

Production Economics, vol. 159, pp. 56-65, 1// 2015.

[4] C. S. Parr, R. Guralnick, N. Cellinese, and R. D. Page, "Evolutionary

informatics: unifying knowledge about the diversity of life," Trends in ecology

& evolution, vol. 27, pp. 94-103, 2012.

[5] L. Gwanhoo, J. A. Espinosa, and W. H. DeLone, "Task Environment

Complexity, Global Team Dispersion, Process Capabilities, and Coordination in Software Development," Software Engineering, IEEE

Transactions on, vol. 39, pp. 1753-1771, 2013.

[6] G. J. Myatt, Making sense of data: A practical guide to exploratory data analysis

and data mining: John Wiley & Sons, 2007.

[7] F. Gorunescu, "Exploratory Data Analysis," in Data Mining, 2011, pp. 57-

157. [8] D. Pineo and C. Ware, "Data visualization optimization via computational

modeling of perception," Visualization and Computer Graphics, IEEE

[9] M. Gleicher, D. Albers, R. Walker, I. Jusufi, C. D. Hansen, and J. C.

Roberts, "Visual comparison for information visualization," Information

Visualization, vol. 10, pp. 289-309, 2011.

[10] L. Grammel, M. Tory, and M. Storey, "How Information Visualization

Novices Construct Visualizations," Visualization and Computer Graphics, IEEE

[11] S. Lallé, D. Toker, C. Conati, and G. Carenini, "Prediction of Users'

Learning Curves for Adaptation while Using an Information Visualization," presented at the Proceedings of the 20th International Conference on Intelligent User Interfaces, Atlanta, Georgia, USA, 2015.

References

[12] W. Huang, P. Eades, and S.-H. Hong, "Measuring effectiveness of graph visualizations: A cognitive load perspective," Information Visualization, vol. 8,

pp. 139-152, 2009. [13] B. Cornelissen, A. Zaidman, A. Van Deursen, L. Moonen, and R. Koschke,

"A systematic survey of program comprehension through dynamic analysis," Software Engineering, IEEE Transactions on, vol. 35, pp. 684-702, 2009.

[14] C. Ware, Information visualization: perception for design: Elsevier, 2012.

[15] S. Liu, W. Cui, Y. Wu, and M. Liu, "A survey on information visualization:

recent advances and challenges," The Visual Computer, vol. 30, pp. 1373-1393,

2014. [16] E. Bertini, A. Tatu, and D. Keim, "Quality metrics in high-dimensional data

visualization: an overview and systematization," Visualization and Computer

Graphics, IEEE Transactions on, vol. 17, pp. 2203-2212, 2011.

[17] I. Sommerville, D. Cliff, R. Calinescu, J. Keen, T. Kelly, M. Kwiatkowska, et

al., "Large-scale complex IT systems," Communications of the ACM, vol. 55,

pp. 71-77, 2012. [18] J. Wu, X.-x. Jia, Y.-p. Liu, and G.-h. Li, "Java object behavior modeling and

visualization," in Software Engineering Advances, International Conference on,

2006, pp. 60-60. [19] B. Mao, Y. Ban, and L. Harrie, "A multiple representation data structure for

dynamic visualisation of generalised 3D city models," ISPRS Journal of

Photogrammetry and Remote Sensing, vol. 66, pp. 198-208, 2011.

[20] M. A. Khan, S. Muhammad, and T. Muhammad, "Identifying performance

issues based on method invocation patterns of an API," in Proceedings of the 18th International Conference on Evaluation and Assessment in Software

Engineering, 2014, p. 51.

[21] P. Caserta and O. Zendra, "JBInsTrace: A tracer of Java and JRE classes at

basic-block granularity by dynamically instrumenting bytecode," Science of

Computer Programming, vol. 79, pp. 116-125, 2014.

[22] D. A. Keim, M. C. Hao, U. Dayal, and M. Hsu, "Pixel bar charts: a

visualization technique for very large multi-attribute data sets," Information

Visualization, vol. 1, pp. 20-34, 2002.

[23] T. O. Aydin, A. Smolic, and M. Gross, "Automated Aesthetic Analysis of

Photographic Images," IEEE Transactions on Visualization and Computer

Graphics, vol. 21, pp. 31-42, 2015.

[24] H. Lam, E. Bertini, P. Isenberg, C. Plaisant, and S. Carpendale, "Empirical

studies in information visualization: Seven scenarios," Visualization and

Computer Graphics, IEEE Transactions on, vol. 18, pp. 1520-1536, 2012.

References

[25] J. Bertin, "Semiology of graphics: diagrams, networks, maps," 1983. [26] E. R. Tufte and P. Graves-Morris, The visual display of quantitative information

vol. 2: Graphics press Cheshire, CT, 1983. [27] W. De Pauw, E. Jensen, N. Mitchell, G. Sevitsky, J. Vlissides, and J. Yang,

"Visualizing the execution of Java programs," in Software Visualization, ed:

Springer, 2002, pp. 151-162. [28] M. Shahin, P. Liang, and M. A. Babar, "A systematic review of software

architecture visualization techniques," Journal of Systems and Software, vol. 94,

pp. 161-185, 8// 2014. [29] B. Cornelissen, A. Zaidman, D. Holten, L. Moonen, A. van Deursen, and J.

J. van Wijk, "Execution trace analysis through massive sequence and circular bundle views," Journal of Systems and Software, vol. 81, pp. 2252-2268, 2008.

[30] P. Caserta and O. Zendra, "Visualization of the static aspects of software: a

survey," IEEE transactions on visualization and computer graphics, vol. 17, pp.

913-933, 2011. [31] K. Jezek, J. Dietrich, and P. Brada, "How Java APIs break – An empirical

study," Information and Software Technology.

[32] J. Singer and C. Kirkham, "Dynamic analysis of Java program concepts for

visualization and profiling," Science of Computer Programming, vol. 70, pp. 111-

126, 2008. [33] P. Lengauer, V. Bitto, and H. Mössenböck, "Accurate and Efficient Object

Tracing for Java Applications," pp. 51-62, 2015. [34] L. Marek, Y. Zheng, D. Ansaloni, L. Bulej, A. Sarimbekov, W. Binder, et al.,

"Introduction to dynamic program analysis with DiSL," Science of Computer

Programming, vol. 98, pp. 100-115, 2015.

[35] S. Diehl, Software visualization: visualizing the structure, behaviour, and evolution

of software: Springer Science & Business Media, 2007.

[36] R. Koschke, "Software visualization in software maintenance, reverse

engineering, and re-engineering: a research survey," Journal of Software

Maintenance and Evolution: Research and Practice, vol. 15, pp. 87-109, 2003.

[37] J. A. Jones, A. Orso, and M. J. Harrold, "Gammatella: Visualizing program-

execution data for deployed software," Information Visualization, vol. 3, pp.

173-188, 2004. [38] S. P. Reiss, "Visual representations of executing programs," Journal of Visual

Languages & Computing, vol. 18, pp. 126-148, 2007.

References

[39] F. Duseau, B. Dufour, and H. Sahraoui, "Vasco: A visual approach to explore object churn in framework-intensive applications," in Software

Maintenance (ICSM), 2012 28th IEEE International Conference on, 2012, pp. 15-

24. [40] S. Kelley, E. Aftandilian, C. Gramazio, N. Ricci, S. L. Su, and S. Z. Guyer,

"Heapviz: Interactive heap visualization for program understanding and debugging," Information Visualization, vol. 12, pp. 163-177, 2013.

[41] J. H. Cross II, T. D. Hendrix, J. Jain, and L. A. Barowski, "Dynamic object

viewers for data structures," ACM SIGCSE Bulletin, vol. 39, pp. 4-8, 2007.

[42] J. Ali, "Object visualization support for learning data structures," Information

Technology Journal, vol. 10, pp. 485-498, 2011.

[43] B. Johnson and B. Shneiderman, "Tree-maps: A space-filling approach to the

visualization of hierarchical information structures," in Visualization, 1991.

Visualization'91, Proceedings., IEEE Conference on, 1991, pp. 284-291.

[44] B. Shneiderman, "Discovering business intelligence using treemap visualizations," B-EYE-Network-Boulder, CO, USA, 2006.

[45] R. Vliegen, J. J. van Wijk, and E.-J. Van der Linden, "Visualizing business

data with generalized treemaps," Visualization and Computer Graphics, IEEE

[46] J. Guerra-Gomez, M. L. Pack, C. Plaisant, and B. Shneiderman,

"Visualizing change over time using dynamic hierarchies: TreeVersity2 and the StemView," Visualization and Computer Graphics, IEEE Transactions on, vol.

19, pp. 2566-2575, 2013. [47] A. Fiore and M. Smith, "Treemap visualizations of Newsgroups," Technical

Report, Microsoft Research, Microsoft Corporation: Redmond, WA, 2001.

[48] M. Balzer, O. Deussen, and C. Lewerentz, "Voronoi treemaps for the

visualization of software metrics," in Proceedings of the 2005 ACM symposium on

Software visualization, 2005, pp. 165-172.

[49] A. L. Hugine, S. A. Guerlain, and F. E. Turrentine, "Visualizing surgical

quality data with treemaps," J Surg Res, vol. 191, pp. 74-83, Sep 2014.

[50] J. J. Van Wijk and H. Van de Wetering, "Cushion treemaps: Visualization of

hierarchical information," in Information Visualization, 1999.(Info Vis' 99)

Proceedings. 1999 IEEE Symposium on, 1999, pp. 73-78, 147.

[51] M. Bruls, K. Huizing, and J. J. Van Wijk, Squarified treemaps: Springer, 2000.

[52] R. Blanch and E. Lecolinet, "Browsing zoomable treemaps: Structure-aware

multi-scale navigation techniques," Visualization and Computer Graphics, IEEE

References

[53] M. Rios-Berrios, P. Sharma, T. Y. Lee, R. Schwartz, and B. Shneiderman, "TreeCovery: Coordinated dual treemap visualization for exploring the Recovery Act," Government Information Quarterly, vol. 29, pp. 212-222, 2012.

[54] W. Collins, Data structures and the Java collections framework: Wiley Publishing,

2011. [55] D. Kawrykow and M. P. Robillard, "Improving api usage through automatic

detection of redundant code," in Automated Software Engineering, 2009. ASE'09.

24th IEEE/ACM International Conference on, 2009, pp. 111-122.

[56] R. Lämmel, E. Pek, and J. Starek, "Large-scale, AST-based API-usage

analysis of open-source Java projects," in Proceedings of the 2011 ACM

Symposium on Applied Computing, 2011, pp. 1317-1324.

[57] A. Shatnawi, A. Seriai, H. Sahraoui, and Z. Al-Shara, "Mining Software

Components from Object-Oriented APIs," in Software Reuse for Dynamic

Systems in the Cloud and Beyond. vol. 8919, I. Schaefer and I. Stamelos, Eds.,

ed: Springer International Publishing, 2014, pp. 330-347. [58] M. P. Robillard, E. Bodden, D. Kawrykow, M. Mezini, and T. Ratchford,

"Automated API property inference techniques," Software Engineering, IEEE

[59] A. Zaidman and S. Demeyer, "Automatic identification of key classes in a

software system using webmining techniques," Journal of Software Maintenance

and Evolution: Research and Practice, vol. 20, pp. 387-417, 2008.

[60] C. R. de Souza and D. L. M. Bentolila, "Automatic evaluation of API

usability using complexity metrics and visualizations," in Software Engineering-Companion Volume, 2009. ICSE-Companion 2009. 31st International

Conference on, 2009, pp. 299-302.

[61] Y. M. Mileva, V. Dallmeier, and A. Zeller, "Mining API popularity," in

Testing–Practice and Research Techniques, ed: Springer, 2010, pp. 173-180.

[62] V. Bauer and L. Heinemann, "Understanding API Usage to Support

Informed Decision Making in Software Maintenance," in Software

Maintenance and Reengineering (CSMR), 2012 16th European Conference on, 2012,

pp. 435-440. [63] E. Moritz, M. Linares-Vásquez, D. Poshyvanyk, M. Grechanik, C.

McMillan, and M. Gethers, "Export: Detecting and visualizing api usages in large source code repositories," in Automated Software Engineering (ASE), 2013

IEEE/ACM 28th International Conference on, 2013, pp. 646-651.

[64] M. A. Saied, O. Benomar, H. Abdeen, and H. Sahraoui, "Mining Multi-level

API Usage Patterns," in Software Analysis, Evolution and Reengineering

(SANER), 2015 IEEE 22nd International Conference on, 2015, pp. 23-32.

References

[65] J. Yin, C. Ma, and S.-M. Hu, "PAST: accurate instrumentation on fully optimized program," Software: Practice and Experience, pp. n/a-n/a, 2015.

[66] H. Mitasova, R. S. Harmon, K. J. Weaver, N. J. Lyons, and M. F. Overton,

"Scientific visualization of landscapes and landforms," Geomorphology, vol.

137, pp. 122-137, 2012. [67] W. Merzkirch, Flow visualization: Elsevier, 2012.

[68] D. Patel, S. Bruckner, I. Viola, and E. Groller, "Seismic volume visualization

for horizon extraction," in Pacific Visualization Symposium (PacificVis), 2010

IEEE, 2010, pp. 73-80.

[69] A. Kuhn, D. Erni, P. Loretan, and O. Nierstrasz, "Software cartography:

Thematic software visualization with consistent layout," Journal of Software

Maintenance and Evolution: Research and Practice, vol. 22, pp. 191-210, 2010.

[70] R. Marty, Applied security visualization: Addison-Wesley Upper Saddle River,

2009. [71] B. Shneiderman and A. Aris, "Network visualization by semantic substrates,"

Visualization and Computer Graphics, IEEE Transactions on, vol. 12, pp. 733-740,

2006. [72] B. Shneiderman, "The eyes have it: A task by data type taxonomy for

information visualizations," in Visual Languages, 1996. Proceedings., IEEE

Symposium on, 1996, pp. 336-343.

[73] E. H. Chi, "A taxonomy of visualization techniques using the data state reference model," in Information Visualization, 2000. InfoVis 2000. IEEE

Symposium on, 2000, pp. 69-75.

[74] D. A. Keim, "Information visualization and visual data mining," Visualization

and Computer Graphics, IEEE Transactions on, vol. 8, pp. 1-8, 2002.

[75] S. Lange, H. Schumann, W. Müller, and D. Krömker, "Problem-oriented

visualisation of multi-dimensional data sets," in Proceedings of the International

Symposium and Scientific Visualization, 1995, pp. 1-15.

[76] G. G. Grinstein, P. Hoffman, R. M. Pickett, and S. J. Laskowski,

"Benchmark development for the evaluation of visualization for data mining," Information visualization in data mining and knowledge discovery, pp.

129-176, 2002. [77] A. E.-T. Guettala, F. Bouali, C. Guinot, and G. Venturini, "A user assistant

for the selection and parameterization of the visualizations in visual data mining," in Information Visualisation (IV), 2012 16th International Conference on,

2012, pp. 252-257. [78] W. T. Laaser, N. P. Dearden, and T. G. MacNary, "Automatic rules driven

data visualization selection," ed: Google Patents, 2009.

References

[79] B. L. Chronister, D. P. Cory, and D. B. Lee, "Ranking visualization types

based upon fitness for visualizing a data set," ed: Google Patents, 2014. [80] H.-J. Schulz, T. Nocke, M. Heitzler, and H. Schumann, "A design space of

visualization tasks," Visualization and Computer Graphics, IEEE Transactions on,

vol. 19, pp. 2366-2375, 2013. [81] Y. Tanahashi and M. Kwan-Liu, "Design Considerations for Optimizing

Storyline Visualizations," Visualization and Computer Graphics, IEEE

[82] D. House, A. Bair, and C. Ware, "On the optimization of visualizations of

complex phenomena," in Visualization, 2005. VIS 05. IEEE, 2005, pp. 87-94.

[83] C. G. Healey, R. S. Amant, and M. S. Elhaddad, "Via: A perceptual

visualization assistant," in 28th AIPR Workshop: 3D Visualization for Data

Exploration and Decision Making, 2000, pp. 2-11.

[84] N. Marrero, "Visualization metrics: An overview," 2007. [85] J. Rigau, M. Feixas, and M. Sbert, "Informational aesthetics measures,"

IEEE Computer Graphics and Applications, pp. 24-34, 2008.

[86] C. E. Shannon, "A mathematical theory of communication," ACM

SIGMOBILE Mobile Computing and Communications Review, vol. 5, pp. 3-55,

2001. [87] M. Li and P. Vitányi, An introduction to Kolmogorov complexity and its

applications: Springer Science & Business Media, 2013.

[88] C. Li and T. Chen, "Aesthetic visual quality assessment of paintings," Selected

Topics in Signal Processing, IEEE Journal of, vol. 3, pp. 236-252, 2009.

[89] V. Matvienko and J. Kruger, "A metric for the evaluation of dense vector

field visualizations," Visualization and Computer Graphics, IEEE Transactions on,

vol. 19, pp. 1122-1132, 2013. [90] D. J. Lehmann, S. Hundt, and H. Theisel, "A Study on Quality Metrics vs.

Human Perception: Can Visual Measures Help us to Filter Visualizations of Interest?," it–Information Technology, vol. 57, p. 1, 2015.

[91] T. O. Aydin, A. Smolic, and M. Gross, "Automated Aesthetic Analysis of

Photographic Images," Visualization and Computer Graphics, IEEE Transactions

on, vol. 21, pp. 31-42, 2015.

[92] A. Vande Moere, M. Tomitsch, C. Wimmer, B. Christoph, and T.

Grechenig, "Evaluating the effect of style in information visualization," Visualization and Computer Graphics, IEEE Transactions on, vol. 18, pp. 2739-

2748, 2012.

References

[93] C. Demiralp, M. Bernstein, and J. Heer, "Learning perceptual kernels for visualization design," 2014.

[94] K. Hartmann, T. Götzelmann, K. Ali, and T. Strothotte, "Metrics for

functional and aesthetic label layouts," in Smart Graphics, 2005, pp. 115-126.

[95] A. Dasgupta and R. Kosara, "Pargnostics: Screen-space metrics for parallel

coordinates," Visualization and Computer Graphics, IEEE Transactions on, vol.

16, pp. 1017-1026, 2010. [96] G. Albuquerque, M. Eisemann, and M. Magnor, "Perception-based visual

quality measures," in Visual Analytics Science and Technology (VAST), 2011

IEEE Conference on, 2011, pp. 13-20.

[97] N. Kong, J. Heer, and M. Agrawala, "Perceptual guidelines for creating

rectangular treemaps," Visualization and Computer Graphics, IEEE Transactions

on, vol. 16, pp. 990-998, 2010.

[98] W. Lin and C. C. Jay Kuo, "Perceptual visual quality metrics: A survey,"

Journal of Visual Communication and Image Representation, vol. 22, pp. 297-312,

2011. [99] T. Isenberg, P. Isenberg, J. Chen, M. Sedlmair, and T. Moller, "A systematic

review on the practice of evaluating visualization," Visualization and Computer

[100] L. Harrison, Y. Fumeng, S. Franconeri, and R. Chang, "Ranking

Visualizations of Correlation Using Weber's Law," Visualization and

[101] N. Cawthon and A. V. Moere, "The effect of aesthetic on the usability of

data visualization," in Information Visualization, 2007. IV'07. 11th International

Conference, 2007, pp. 637-648.

[102] J. S. Yi, Y.-a. Kang, J. T. Stasko, and J. A. Jacko, "Understanding and

characterizing insights: how do people gain insights using information visualization?," in Proceedings of the 2008 Workshop on BEyond time and errors:

novel evaLuation methods for Information Visualization, 2008, p. 4.

[103] C. North, "Toward measuring visualization insight," Computer Graphics and

Applications, IEEE, vol. 26, pp. 6-9, 2006.

[104] M. A. Borkin, A. A. Vo, Z. Bylinskii, P. Isola, S. Sunkavalli, A. Oliva, et al.,

"What makes a visualization memorable?," Visualization and Computer

[105] J. Schneidewind, M. Sips, and D. A. Keim, "An automated approach for the

optimization of pixel-based visualizations," Information Visualization, vol. 6,

pp. 75-88, 2007.

References

[106] R. Fuchs, J. Waser, and M. E. Groller, "Visual human+ machine learning," Visualization and Computer Graphics, IEEE Transactions on, vol. 15, pp. 1327-

1334, 2009. [107] N. Elmqvist, P. Dragicevic, and J. D. Fekete, "Color Lens: Adaptive Color

Scale Optimization for Visual Exploration," IEEE Trans Vis Comput Graph,

Jun 11 2010. [108] S. Lee, M. Sips, and H.-P. Seidel, "Perceptually driven visibility optimization

for categorical data visualization," Visualization and Computer Graphics, IEEE

[109] S. J. Mason, S. B. Cleveland, P. Llovet, C. Izurieta, and G. C. Poole, "A

centralized tool for managing, archiving, and serving point-in-time data in ecological research laboratories," Environmental Modelling & Software, vol. 51,

pp. 59-69, 2014. [110] W. Xindong, Z. Xingquan, W. Gong-Qing, and D. Wei, "Data mining with

big data," Knowledge and Data Engineering, IEEE Transactions on, vol. 26, pp.

97-107, 2014. [111] H. Li, D. Liang, L. Xie, G. Zhang, and K. Ramamritham, "Flash-Optimized

Temporal Indexing for Time-Series Data Storage on Sensor Platforms," ACM

Transactions on Sensor Networks (TOSN), vol. 10, p. 62, 2014.

[112] S. Bajaj and R. Sion, "TrustedDB: A trusted hardware-based database with

privacy and data confidentiality," Knowledge and Data Engineering, IEEE

[113] J. M. Banda, M. A. Schuh, R. A. Angryk, K. G. Pillai, and P. McInerney,

"Big data new frontiers: Mining, search and management of massive repositories of solar image data and solar events," in New Trends in Databases

and Information Systems, ed: Springer, 2014, pp. 151-158.

[114] Z. Halim, A. R. Baig, and K. Zafar, "Evolutionary Search in the Space of

Rules for Creation of New Two-Player Board Games," International Journal

on Artificial Intelligence Tools, vol. 23, p. 1350028, 2014.

[115] D. A. Keim, J. Kohlhammer, G. Ellis, and F. Mansmann, Mastering the

information age-solving problems with visual analytics: Florian Mansmann, 2010.

[116] M. X. Zhou and S. K. Feiner, "Data characterization for automatically

visualizing heterogeneous information," in Information Visualization'96,

Proceedings IEEE Symposium on, 1996, pp. 13-20, 117.

[117] D. Keim, "Designing pixel-oriented visualization techniques: Theory and

applications," Visualization and Computer Graphics, IEEE Transactions on, vol.

6, pp. 59-78, 2000.

References

[118] Y.-a. Kang and J. Stasko, "Characterizing the intelligence analysis process: Informing visual analytics design through a longitudinal field study," in Visual Analytics Science and Technology (VAST), 2011 IEEE Conference on, 2011,

pp. 21-30. [119] (2015, July 2015). Dataset. Available: http://ming.org.pk/datasets.htm

[120] S. Lindholm, M. Falk, E. Sundén, A. Bock, A. Ynnerman, and T. Ropinski,

"Hybrid data visualization based on depth complexity histogram analysis," in Computer Graphics Forum, 2015, pp. 74-85.

[121] D. W. Scott, Multivariate density estimation: theory, practice, and visualization:

John Wiley & Sons, 2015. [122] M. Zhou, L. O. Hall, D. B. Goldgof, R. J. Gillies, and R. A. Gatenby,

"Decoding brain cancer dynamics: a quantitative histogram-based approach using temporal MRI," in SPIE Medical Imaging, 2015, pp. 94142H-94142H-5.

[123] M. Kotera, Y. Moriya, T. Tokimatsu, M. Kanehisa, and S. Goto, "KEGG

and GenomeNet, new developments, metagenomic analysis," in Encyclopedia

of Metagenomics, ed: Springer, 2015, pp. 329-339.

[124] W. R. Stauffer, A. Lak, P. Bossaerts, and W. Schultz, "Economic choices

reveal probability distortion in macaque monkeys," The Journal of

Neuroscience, vol. 35, pp. 3146-3154, 2015.

[125] O. Špakov and D. Miniotas, "Visualization of eye gaze data using heat

maps," Elektronika ir Elektrotechnika, vol. 74, pp. 55-58, 2015.

[126] J. A. Guerra-Gómez, M. L. Pack, C. Plaisant, and B. Shneiderman,

"Discovering temporal changes in hierarchical transportation data: Visual analytics & text reporting tools," Transportation Research Part C: Emerging

Technologies, vol. 51, pp. 167-179, 2015.

[127] J. Heinrich, J. Stasko, and D. Weiskopf, "The parallel coordinates matrix,"

EuroVis–Short Papers, pp. 37-41, 2012.

[128] A. Inselberg, Parallel coordinates: Springer, 2009.

[129] C. Shenghui, J. Zhifang, Q. Qi, S. Li, and X. Meng, "The Polar Parallel

Coordinates Method for Time-Series Data Visualization," in Computational

and Information Sciences (ICCIS), 2012 Fourth International Conference on, 2012,

pp. 179-182. [130] X. Yuan, P. Guo, H. Xiao, H. Zhou, and H. Qu, "Scattering points in

parallel coordinates," Visualization and Computer Graphics, IEEE Transactions

on, vol. 15, pp. 1001-1008, 2009.

[131] M. Kanazaki, T. Matsuno, K. Maeda, and H. Kawazoe, "Wind Tunnel

Evaluation Based Design of Lift Creating Cylinder Using Plasma Actuators,"

References

in Proceedings of the 18th Asia Pacific Symposium on Intelligent and Evolutionary

Systems, Volume 1, 2015, pp. 663-677.

[132] J. A. Schwabish, "An Economist's Guide to Visualizing Data," The Journal of

Economic Perspectives, vol. 28, pp. 209-233, 2014.

[133] (2015, June 2015). Line Chart online. Available:

https://developers.google.com/chart/interactive/docs/gallery/linechart [134] C. L. Paul, "Analyzing Card-Sorting Data Using Graph Visualization,"

Journal of Usability Studies, vol. 9, pp. 87-104, 2014.

[135] M. Friendly, "A brief history of data visualization," in Handbook of data

visualization, ed: Springer, 2008, pp. 15-56.

[136] S. Jamil, A. Khan, and Z. Halim, "Weighted MUSE for frequent sub-graph

pattern finding in uncertain DBLP data," in Internet Technology and

Applications (iTAP), 2011 International Conference on, 2011, pp. 1-6.

[137] Z. Halim, M. M. Gul, N. Ul Hassan, R. Baig, S. U. Rehman, and F. Naz,

"Malicious users' circle detection in social network based on spatio-temporal co-occurrence," in Computer Networks and Information Technology (ICCNIT),

2011 International Conference on, 2011, pp. 35-39.

[138] I. Herman, G. Melançon, and M. S. Marshall, "Graph visualization and

navigation in information visualization: A survey," Visualization and Computer

[139] (2015, 13 June 2015). HCIL Treemap. Available:

http://www.cs.umd.edu/hcil/ [140] F. B. Viegas, M. Wattenberg, F. Van Ham, J. Kriss, and M. McKeon,

"Manyeyes: a site for visualization at internet scale," Visualization and

[141] A. Shukla, R. Tiwari, and R. Kala, Real life applications of soft computing: CRC

Press, 2010. [142] M. Sivanandam, Introduction to artificial neural networks: vikas publishing

House PVT LTD, 2009. [143] D. Floreano and C. Mattiussi, Bio-inspired artificial intelligence: theories,

methods, and technologies: MIT press, 2008.

[144] H. Demuth, M. Beale, and M. Hagan, "Neural network toolbox TM 6 user’s

guide matlab," The MathWorks2009, 2013.

[145] F. Gao and G. Ge, "Optimal ternary constant-composition codes of weight

four and distance five," Information Theory, IEEE Transactions on, vol. 57, pp.

3742-3757, 2011.

References

[146] L. Xiong, L. Jiao, S. Mao, and L. Zhang, "Active learning based on coupled KNN pseudo pruning," Neural Computing and Applications, vol. 21, pp. 1669-

1686, 2012. [147] I. H. Witten and E. Frank, Data Mining: Practical machine learning tools and

techniques: Morgan Kaufmann, 2005.

[148] L. Breiman, "Random forests," Machine learning, vol. 45, pp. 5-32, 2001.

[149] K.-B. Duan and S. S. Keerthi, "Which is the best multiclass SVM method?

An empirical study," in Multiple Classifier Systems, ed: Springer, 2005, pp. 278-

285. [150] M. Galar, A. Fernández, E. Barrenechea, H. Bustince, and F. Herrera, "An

overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes," Pattern

Recognition, vol. 44, pp. 1761-1776, 2011.

[151] A. Bahrammirzaee, "A comparative survey of artificial intelligence

applications in finance: artificial neural networks, expert system and hybrid intelligent systems," Neural Computing and Applications, vol. 19, pp. 1165-1195,

2010. [152] S.-H. Huang and Y.-C. Pan, "Automated visual inspection in the

semiconductor industry: A survey," Computers in Industry, vol. 66, pp. 1-10,

2015. [153] D. Thom and T. Ertl, "TreeQueST: A Treemap-Based Query Sandbox for

Microdocument Retrieval," in System Sciences (HICSS), 2015 48th Hawaii

International Conference on, 2015, pp. 1714-1723.

[154] M. L. Huang, T.-H. Huang, and X. Zhang, "A novel virtual node approach for interactive visual analytics of big datasets in parallel coordinates," Future

Generation Computer Systems, 2015.

[155] W. Huang, M. L. Huang, and C.-C. Lin, "Evaluating Overall Quality of Graph Visualizations Based on Aesthetics Aggregation," Information Sciences,

2015. [156] C. Plaisant, "The challenge of information visualization evaluation," in

Proceedings of the working conference on Advanced visual interfaces, 2004, pp. 109-

116. [157] E. R. Tufte and E. Weise Moeller, Visual explanations: images and quantities,

evidence and narrative vol. 36: Graphics Press Cheshire, CT, 1997.

[158] J. Mackinlay, "Automating the design of graphical presentations of relational

information," Acm Transactions On Graphics (Tog), vol. 5, pp. 110-141, 1986.

[159] W. Huang, P. Eades, S.-H. Hong, and C.-C. Lin, "Improving multiple

aesthetics produces better graph drawings," Journal of Visual Languages &

Computing, vol. 24, pp. 262-272, 2013.

References

[160] C. Bennett, J. Ryall, L. Spalteholz, and A. Gooch, "The Aesthetics of Graph

Visualization," in Computational Aesthetics, 2007, pp. 57-64.

[161] S. Tak and A. Cockburn, "Enhanced Spatial Stability with Hilbert and

Moore Treemaps," IEEE Trans Vis Comput Graph, Apr 10 2012.

[162] Y. Tu and H.-W. Shen, "Visualizing changes of hierarchical data using

treemaps," Visualization and Computer Graphics, IEEE Transactions on, vol. 13,

pp. 1286-1293, 2007. [163] A. Tatu, G. Albuquerque, M. Eisemann, J. Schneidewind, H. Theisel, M.

Magnor, et al., "Combining automated analysis and visualization techniques

for effective exploration of high-dimensional data," in Visual Analytics Science

and Technology, 2009. VAST 2009. IEEE Symposium on, 2009, pp. 59-66.

[164] M. Borkin, A. Vo, Z. Bylinskii, P. Isola, S. Sunkavalli, A. Oliva, et al., "What

makes a visualization memorable?," Visualization and Computer Graphics, IEEE

[165] D. Ren, T. Hollerer, and X. Yuan, "iVisDesigner: Expressive Interactive

Design of Information Visualizations," IEEE Transactions on Visualization and

Computer Graphics, vol. 20, pp. 2092-2101, 2014.

[166] A. Lau and A. Vande Moere, "Towards a model of information aesthetics in

information visualization," in Information Visualization, 2007. IV'07. 11th

International Conference, 2007, pp. 87-92.

[167] D. Kawrykow and M. P. Robillard, "Detecting inefficient API usage," in

Software Engineering-Companion Volume, 2009. ICSE-Companion 2009. 31st

International Conference on, 2009, pp. 183-186.

[168] G. d. F. Carneiro, R. C. Magnavita, E. Spinola, F. Spinola, and M.

Mendonça, "Evaluating the usefulness of software visualization in supporting software comprehension activities," in Proceedings of the Second ACM-IEEE

international symposium on Empirical software engineering and measurement, 2008,

pp. 276-278. [169] W. De Pauw, D. Kimelman, and J. Vlissides, "Modeling object-oriented

program execution," in Object-Oriented Programming, ed: Springer, 1994, pp.

163-182. [170] (2014, 13 May 2014). Available: http://commons.apache.org/bcel/ [171] E. Bruneton, R. Lenglet, and T. Coupaye, "ASM: a code manipulation tool

to implement adaptable systems," Adaptable and extensible component systems,

vol. 30, 2002.

References

[172] J. Heer, S. K. Card, and J. A. Landay, "Prefuse: a toolkit for interactive information visualization," in Proceedings of the SIGCHI conference on Human

factors in computing systems, 2005, pp. 421-430.

[173] (20 January 201). JEdit. Available: http://www.jedit.org/

[174] V. Setlur and M. C. Stone, "A Linguistic Approach to Categorical Color Assignment for Data Visualization," IEEE Transactions on Visualization and Computer Graphics, vol. 22, no. 1, pp. 698-707, 2016.

[175] North, Chris, Purvi Saraiya, and Karen Duca. "A comparison of benchmark

task and insight evaluation methods for information visualization." Information Visualization (2011): 1473871611415989.

[176] Saraiya, Purvi, Chris North, and Karen Duca. "An insight-based methodology

for evaluating bioinformatics visualizations." IEEE transactions on visualization and computer graphics vol 11, pp. 443-456, 2005

Appendix A

Automatic Visualization Selection

Table A.1 Single hidden layer NN structure time and MSE (split)

Network

Structure Iteration Time Train-MSE Test-MSE Validation-MSE

05-01-8 46 01 0.0818 0.0852 0.0799

05-02-8 45 02 0.0794 0.0817 0.0789

05-03-8 38 02 0.0462 0.0456 0.0526

05-04-8 34 01 0.0301 0.0492 0.0345

05-05-8 30 01 0.022 0.023 0.0225

05-06-8 25 02 0.0145 0.0182 0.02

05-07-8 21 01 0.0195 0.0308 0.0266

05-08-8 16 01 0.0098 0.0156 0.0086

05-09-8 14 01 0.01 0.0359 0.0314

05-10-8 24 01 0.0118 0.0252 0.0218

05-11-8 11 01 0.0106 0.013 0.0232

05-12-8 10 01 0.0103 0.0253 0.0138

05-13-8 18 01 0.0103 0.0108 0.0148

05-14-8 18 01 0.0097 0.0145 0.0077

05-15-8 13 01 0.01 0.0231 0.0153

05-16-8 12 01 0.0101 0.0154 0.0136

05-17-8 13 01 0.0101 0.03 0.0293

05-18-8 24 02 0.0125 0.0196 0.0202

05-19-8 14 01 0.0145 0.0156 0.0263

05-20-8 18 01 0.0139 0.0121 0.0218

05-21-8 10 02 0.0713 0.0698 0.0567

05-22-8 12 01 0.0108 0.0391 0.0179

05-23-8 06 01 0.0098 0.0221 0.0245

05-24-8 09 01 0.0158 0.0207 0.0299

05-25-8 16 02 0.0115 0.0161 0.0176

Appendix A

Table A.2 Two-Hidden layers NN structure time and MSE

Network

05-05-02-8 23 01 0.0589 0.0547 0.0711

05-05-05-8 29 01 0.02 0.0259 0.0317

05-07-05-8 31 02 0.0135 0.0305 0.0291

05-08-05-8 46 03 0.0189 0.0283 0.0235

05-08-07-8 14 01 0.0092 0.0115 0.0142

05-12-11-8 16 02 0.0094 0.0128 0.0066

05-15-10-8 10 02 0.0108 0.032 0.0112

05-20-14-8 09 03 0.0097 0.0202 0.0096

05-24-16-8 10 02 0.009 0.0109 0.0112

05-30-20-8 11 03 0.0089 0.0172 0.0096

Appendix A

Table A.3 Network with Training and Test data

Network Structure Iteration Time Train-MSE Test-MSE

05-01-8 200 08 0.0791 0.0831

05-02-8 200 08 0.0566 0.0741

05-03-8 200 09 0.0553 0.0589

05-04-8 200 09 0.0464 0.0559

05-05-8 200 10 0.0179 0.0270

05-06-8 200 11 0.0138 0.0113

05-07-8 200 12 0.0215 0.0192

05-08-8 200 13 0.0167 0.0261

05-09-8 200 13 0.0108 0.0480

05-10-8 200 15 0.0101 0.0115

05-11-8 200 15 0.0050 0.0160

05-12-8 200 16 0.0055 0.0056

05-13-8 200 17 0.0026 4.8874

05-14-8 200 18 0.0040 0.0291

05-15-8 200 19 0.0035 0.1509

05-16-8 200 21 0.0045 0.2878

05-17-8 200 22 0.0044 0.7050

05-18-8 200 23 0.0029 6.5193

05-19-8 200 25 0.0025 0.0208

05-20-8 200 26 0.0039 7.1763

05-21-8 200 28 0.0037 0.7860

05-22-8 200 29 0.0054 0.0328

05-23-8 200 31 0.0024 2.8556

05-24-8 200 33 0.0033 0.0053

05-25-8 200 34 0.0023 9.0936

Appendix A

Table A.4 NN with validation check- Early stop

Network

05-01-8 27 01 0.0811 0.0766 0.0837

05-02-8 26 01 0.0595 0.0609 0.0646

05-03-8 48 02 0.038 0.0622 0.0347

05-04-8 21 01 0.0314 0.0389 0.0316

05-05-8 18 01 0.0383 0.0423 0.0455

05-06-8 22 01 0.0507 0.0571 0.0508

05-07-8 35 02 0.017 0.0166 0.0205

05-08-8 22 01 0.0121 0.0182 0.0151

05-09-8 29 02 0.0132 0.0229 0.0324

05-10-8 23 01 0.0145 0.0191 0.0275

05-11-8 110 07 0.0057 0.0071 0.0067

05-12-8 27 02 0.0152 0.027 0.0246

05-13-8 22 02 0.0134 0.0234 0.0136

05-14-8 44 04 0.0066 0.0119 0.0125

05-15-8 108 10 0.0062 0.01 0.0108

05-16-8 58 05 0.007 0.0291 0.0068

05-17-8 29 03 0.0057 0.016 0.0147

05-18-8 58 06 0.0049 0.0166 0.0207

05-19-8 19 02 0.0095 0.0095 0.0215

05-20-8 28 03 0.0076 0.0155 0.0102

05-21-8 36 04 0.007 0.012 0.011

05-22-8 18 02 0.0095 0.0166 0.0105

05-23-8 19 02 0.0079 0.05 0.014

05-24-8 15 02 0.0131 0.0251 0.0224

05-25-8 22 03 0.0096 0.0236 0.0134

Appendix A

Table A.5 NN with stop with goal / Validation stop

Network

05-01-8 46 01 0.0818 0.0852 0.0799

05-02-8 45 02 0.0794 0.0817 0.0789

05-03-8 38 02 0.0462 0.0456 0.0526

05-04-8 34 01 0.0301 0.0492 0.0345

05-05-8 30 01 0.022 0.023 0.0225

05-06-8 25 02 0.0145 0.0182 0.02

05-07-8 21 01 0.0195 0.0308 0.0266

05-08-8 16 01 0.0098 0.0156 0.0086

05-09-8 14 01 0.01 0.0359 0.0314

05-10-8 24 01 0.0118 0.0252 0.0218

05-11-8 11 01 0.0106 0.013 0.0232

05-12-8 10 01 0.0103 0.0253 0.0138

05-13-8 18 01 0.0103 0.0108 0.0148

05-14-8 18 01 0.0097 0.0145 0.0077

05-15-8 13 01 0.01 0.0231 0.0153

05-16-8 12 01 0.0101 0.0154 0.0136

05-17-8 13 01 0.0101 0.03 0.0293

05-18-8 24 02 0.0125 0.0196 0.0202

05-19-8 14 01 0.0145 0.0156 0.0263

05-20-8 18 01 0.0139 0.0121 0.0218

05-21-8 10 02 0.0713 0.0698 0.0567

05-22-8 12 01 0.0108 0.0391 0.0179

05-23-8 06 01 0.0098 0.0221 0.0245

05-24-8 09 01 0.0158 0.0207 0.0299

05-25-8 16 02 0.0115 0.0161 0.0176

Appendix A

Table A.6 Single hidden layer NN structure and MSEs (10Folds-CV)

Network Structure Train-MSE Test-MSE Validation-MSE

05-01-8 0.0812 0.0829 0.0817

05-02-8 0.0641 0.0701 0.0670

05-03-8 0.0544 0.0566 0.0600

05-04-8 0.0424 0.0490 0.0434

05-05-8 0.0347 0.0404 0.0403

05-06-8 0.0299 0.0367 0.0374

05-07-8 0.0200 0.0268 0.0261

05-08-8 0.0174 0.024 0.0202

05-09-8 0.014 0.025 0.0236

05-10-8 0.0144 0.0205 0.0231

05-11-8 0.0121 0.0206 0.0192

05-12-8 0.0108 0.0243 0.0212

05-13-8 0.0114 0.0197 0.0181

05-14-8 0.0101 0.0254 0.0213

05-15-8 0.0124 0.0240 0.0226

05-16-8 0.0090 0.0201 0.0257

05-17-8 0.0088 0.0196 0.0192

05-18-8 0.0119 0.0244 0.0200

05-19-8 0.0099 0.0261 0.0215

05-20-8 0.0089 0.0212 0.0190

05-21-8 0.0091 0.0226 0.0197

05-22-8 0.0090 0.0194 0.0163

05-23-8 0.0072 0.0213 0.0158

05-24-8 0.0098 0.0228 0.0210

05-25-8 0.0067 0.0275 0.0191

Appendix A

Table A.7 Comparison of classification and prediction accuracy (10Folds-CV)

Network

Structure Training Testing

05-01-8 47.77 52.50

05-02-8 60.00 72.50

05-03-8 75.27 72.50

05-04-8 79.44 82.50

05-05-8 84.16 82.50

05-06-8 89.72 87.50

05-07-8 92.22 90.00

05-08-8 93.88 95.00

05-09-8 94.16 92.50

05-10-8 95.27 95.00

05-11-8 96.11 95.00

05-12-8 96.38 97.50

05-13-8 96.11 97.50

05-14-8 95.00 95.00

05-15-8 95.00 95.00

05-16-8 96.94 97.50

05-17-8 96.38 95.00

05-18-8 96.66 97.50

05-19-8 96.83 95.00

05-20-8 98.33 95.00

05-21-8 96.38 95.00

05-22-8 98.61 95.00

05-23-8 96.66 95.00

05-24-8 96.38 95.55

05-25-8 96.38 95.00

Appendix A

Table A.8 Information visualization results for iris dataset

Histogram

Pie chart

Line chart

Parallel Coordinates

Scatter Plot

Linked graph

Appendix A

Treemap

Appendix B

Treemap Visualziation

JHotdraw

Prefuse

jImage

Appendix B

Browser

JMoney

Eclipse

Appendix B

freemind

Appendix C

Visualziation Techniques

Jinsight Gammatella

LBM JIVE

Vasco Heapviz

Appendix C

Treemap TreeCovery

Object-Visualization Export

Appendix D

User Forms, Handouts

D-1 Subject’s Evaluation Questioner Form

Instructions:

Kindly fill this questionnaire form carefully with correct information

All the collected information will solely be used for the purpose of this research and shall remain

anonymous

Participation in this study is on voluntary basis

1. Name:

2. Age(in years):

3. Field of study:

4. Qualification:

5. Experience with computer (in years):

6. Experience with Java (in years):

7. How you rate your knowledge about Java:

8. Have you experience with visualization tools:

Thank You

Appendix D

General Information

Experience information

Intermediat

Advance

Appendix D

Treemap visualization and collection APIs usage-Handout

This experiment evaluates the collection APIs usage information using the treemap

visualization tool. The participants of the study are divided in two groups, the first group

will be provided with the proposed treemap visualization tool. The second group is only

provided a log file. They may use Excel to find the required information. Subjects of both

groups need to perform different tasks based on the material provided to them. The subjects

need to mention starting and end time for each task. The subjects should provide a usability

score for the task to be carried out on their tool. The score is based on a scale of 0-5, where 0

means less usable and 5 is for most usable.

Tasks detail

Task 1: Identify the collection APIs used in a program

The basic aim of this task is to find the collection APIs used in a

particular Java program. The log file of specific Java program will be

provided and the subjects need to find and list the name of collection

APIs apparent in the log file.

Task 2: Identify the packages and classes of collection APIs in a program

The subject is provided with a visualization of the particular Java

program and they need to identify the packages and their respective

classes. List the name of packages and classes of target program

responsible for the creation of collection APIs objects.

Task 3: List 3 classes responsible for creation of most objects of a particular API

Instructions: Instructions:

Appendix D

The task is to find the three classes of particular Java programs that

create a large number of collection APIs objects. Overview the

information presented and list three classes responsible for the large

number of collection APIs objects creation.

Task 4: Identify methods that create the maximum number of collection APIs objects

During this task the subjects are required to identify and list the

name of methods that are responsible for collection APIs objects.

List the method names in descending order of number of objects

created.

Usability score Scale

On the scale of 0-5 the following coding is used to report usability according to your

participant's opinion for each task.

0--- Absolutely not usable

1--- Little usable

2--- Usable

3--- Fairly usable

4--- Very usable

5--- most usable

Appendix D

This experiment is used to evaluate the visualizations that are optimized using the proposed

approach. Each participant of the study is provided with six types of visualizations one at a time

and he needs to performance five benchmark tasks. These tasks are to be carried out for all

visualizations. Each participant will provide the score for the parameters metal efforts,

visualization, and time taken. Accuracy of the task will be checked in post experiment analysis.

For each task the participant will be provided hard copy of multiple choice questions.

Visualization optimization experiment -Handout

Tasks detail

Task 1: Which Java program has maximum number of objects?

A single visualization will be presented to each participant showing

the collection APIs usage in several programs. Overviewing the

visualization the participants need to find the Java program that has

the maximum number of objects in the visualization.

Task 2: Which collection API across the programs is mostly used?

For each presented visualization find the APIs that have been used

by most of the Java program. In visualization several Java programs

with their collection APIs usage is presented only pick the API

names that are used by most programs.

Instructions:

Appendix D

Task 3: How many Java program used more than 3 collection APIs?

As the visualization presented several Java programs in a view, the

participant need to find the program that used 3 or more collection

APIs. List the name of those programs from the visualization.

Task 4: Which Java program used a large number of ArrayList objects?

This task is used to evaluate participant’s response to find and

identify the program that used a comparatively large number of

ArrayList APIs objects. The ArrayList may be shown in several

programs, only identify the program that has a large number of

objects for this particular API.

Task 5: How many different APIs are used by Java programs?

Each of the presented visualization is showing different Java

program and their collection APIs used during program execution.

The participants need to find how many different types of APIs are

used collectively by all programs shown in the visualization.

Mental effort score scale

On the scale of 1-5, the mental efforts are represented by the following codes.

1--- Minimal efforts

2--- Little efforts

3--- Fair efforts

4--- Substantial efforts

5--- Maximum efforts

Appendix D

Visualization- score scale

On the scale of 1-5, visualizations have score according to the following scale.

1--- Poor

2--- Fair

3--- Good

4--- Excellent

5--- Outstanding

Accuracy score scale

On the scale of 1-5, each participant response has following scale.

1--- Absolutely wrong

2--- Partially wrong

3--- Fair

4--- Partially correct

5--- Absolute correct

Appendix D

D-2 Subject’s Evaluation Questioner Form Filled

9. Name:

10. Age(in years):

11. Field of study:

12. Qualification:

13. Experience with computer (in years):

14. Experience with Java (in years):

15. How you rate your knowledge about Java:

16. Have you experience with visualization tools:

Ahmed Ali Khan

Computer Science

General Information

Experience information

Intermediat

Advance

Appendix D

Treemap visualization and collection APIs usage

Group: Participant#:

Task Title

Please answer the following questions (Reference to manual/Handouts for help)

1. Provide your finding/answer for the task in this box

2. Provide Usability score of your tool for this particular task using the following scale

(tick one option )

Absolutely not usable Little usable Usable Fairly usable Very usable Most usable

3. Other comments

Control

Experimental

Start time End time

Appendix D

Visualization Optimization

Participant#: Visualization type:

Task Title

Please answer the following questions (Reference to manual/Handouts for help)

1. Provide your finding/answer for the task in this box

2. How much Mental effort is required for this task using this visualization, use the

following scale (tick one option )

4. Provide Score of your tool for this particular task using the following scale (tick one

option )

3. Other comments

Minimal efforts Little efforts Fair efforts Substantial efforts Maximum efforts

Poor Fair Good Excellent Outstanding

Start time End time

Appendix E

Evolved Visualziation

Combine

Effectiveness

Appendix E

Expressiveness

Readability

Appendix E

Interactivity

Random

Appendix E

Some facts

Completed research June 2014

Completed writing thesis December 2014

Internal evaluation completed July 2015

First IF paper accepted October 2015

Second IF paper accepted August 2016

Third IF paper accepted December 2016

Foreign evaluation completed June/July 2016

Examination committee approved November 2016

Number of words 49054

Number of figures 85

Number of tables 50

Number of pages 260

Thesis open defense 30 November 2016

Selecting, Quantifying, Optimizing, and Understanding...

Documents

Transcript of Selecting, Quantifying, Optimizing, and Understanding...

PUNJAB PUBLIC SERVICE COMMISSION …adspk.pk/wp-content/uploads/Download-PDF-Written-Test... · 2018-04-19 · 27 10171 ayesha tufail muhammad tufail ... 129 10715 sehrish khan niazi

The Quran Reader -- Quran Reader -- Author: Sheikh Muhammad Tufail Subject: islam, ahmadiyya Keywords: islam, ahmadiyya Created Date: 3/17/2007 8:11:41 PM ...

820-2796 051-8379 Air A1370 K99

Dyselectrolytemia by Dr Sheikh Tufail

Climate Change and Natural Disasters in Pakistan · Dr. Tufail Muhammad Khan Chairperson, SPO . Climate Change and Natural Disasters in Pakistan 4 Strengthening Participatory Organization

Home – Hazza Institute of Technologyhazzainstitute.org/wp-content/uploads/2020/02/welding-merit-list.pdf · Hamza Khan Zaheer Hussain Muhammad Nadeem Israr Ullah Imran Tufail Mohammad

EB 8379 EN - tvimem.com€¦ · The Declaration of Conformity is available on request. Note. 6 EB 8379 EN Design and principle of operation An M20x1.5 adapter allows for a direct

VRC Tabaret Fit Out - jacarandaindustries.com Tabaret Fit Out.pdf · Head Office: Level 2, 371 Spencer Street, Melbourne VIC 3000 Ph: 03 8379 2111 Fax: 03 8379 2199 Factory: 64-72

IN THIS GUIDEnews.bhcpartners.com.au/wp-content/uploads/2020/04/... · LINDEN PARK SA 5065 TEL (08) 8379 9501 FAX (08) 8379 2572 DIRECTORS Jenni Bottroff Adam Davenport ... QLD and

Library Automation Presented to: Dr. Ijaz Miraj Presented by: Muhammad Tufail Khan Aneela Zahid Theoretical Foundation of Library Science MPhil in Library.

LONDON BOROUGH OF ENFIELD · Andy Higham Tel: 020 8379 3848 Mr N. Catherall Tel: 020 8379 3833 Ward: Jubilee Application Number : P12-00245PLA Category: Other Development LOCATION:

DETAILS OF UNCLAIMED INSURANCE BENEFITS AS AT JUNE …...Jun 30, 2019 · 15 Muhammad Abdullah Tufail Khan 16 Nabila Sarwar 17 Tahir Abbas Kazmi 18 Saiqa Imtiaz Arif 19 Dr. Shazia

Asim Tufail (Group Chief, Consumer & Personal

ENGLISH & SCIENCE QUIZ 2015-16 GRAND FINALE ...hrcamegaevents.com/downloads/grand-finale-result.pdfPAK-HRCA-3501 AFEEFA ASLAM Bronze Medal Army Public School Tufail Road 52 Tufail

· COPY 1. Mr. Tufail Muhammad, DS-II, Health Department. 1. Dr. Rizwan Ullah Khan (Compla nant) Sd- Commissioner-I Assi.8a t Registrar KP llWrmation Commission KPK, Peshawar (COMPLAINT

Resume of SYED TUFAIL HUSSAIN SHAH

City of Palo Alto (ID # 8379) City Council Staff Report

Final presentation. Ammar Latif Zaheer u din (01141) Tufail Khalid (1047) Tufail Khalid (1047) M. Sufian (01157) M. Sufian (01157) Bilal Nazir (01229)

Investment Procss of Unicon (Tufail)

301-8379, Rev. D Xpert MTB-RIF Sample Reagent ... Files/301-8379, Rev. D... · Safety Data Sheet Xpert MTB/RIF Sample Reagent Effective Date: May 2020 Supersedes Date: March, 2020