Post on 18-Oct-2020
Selecting, Quantifying, Optimizing, and
Understanding Visualization Techniques: A
Computational Intelligence-Based Approach
By
Tufail Muhammad
Supervised by
Dr. Zahid Halim
A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy
in Computer System Engineering
Faculty of Computer Science and Engineering
Ghulam Ishaq Khan Institute of Engineering Sciences and Technology
Topi, Khyber Pakhtunkhwa, Pakistan
Fall, 2016
II
Dissertation examination committee:
Dr. Zahid Halim Advisor, Faculty of Computer Science and
Engineering, Ghulam Ishaq Khan Institute
of Engineering Sciences and Technology,
Topi, PAKISTAN.
Prof. Dr. Keith C.C. Chan Foreign Evaluator, Big Data Lab,
Department of Computing, The Hong
Kong Polytechnic University, Hung Hom,
Kowloon, HONG KONG.
QS World University Rank=116
Prof. Dr. Ivan Viola Foreign Evaluator, The Institute of
Computer Graphics and Algorithms,
Vienna University of Technology,
AUSTRIA.
QS World University Rank=197
Prof. Dr. Michael John Watts Foreign Evaluator, Academic Head of
Programme, Information Technology,
Auckland Institute of Studies, Auckland,
NEW ZEALAND.
Dr. Muhammad Tanvir Afzal External Examiner, Department of
Computer Science, Capital University of
Science & Technology Islamabad
Expressway, Kahuta Road, Zone-V,
Islamabad., PAKISTAN.
Dr. Muhammad Zohaib Zafar Iqbal External Examiner, Department of
Computer Science, National University of
Computer and Emerging Sciences, A.K
Brohi Road, Sector H-11/4, Islamabad.,
PAKISTAN.
Dr. Ahmar Rashid Internal Examiner, Faculty of Computer
Science and Engineering, Ghulam Ishaq
Khan Institute of Engineering Sciences
and Technology, Topi, PAKISTAN.
Acknowledgements
III
Dr. Ghulam Abbas Internal Evaluator (Pre-screening), Faculty
of Computer Science and Engineering,
Ghulam Ishaq Khan Institute of
Engineering Sciences and Technology,
Topi, PAKISTAN.
Dr. Rashad M Jillani Internal Evaluator (Pre-screening), Faculty
of Computer Science and Engineering,
Ghulam Ishaq Khan Institute of
Engineering Sciences and Technology,
Topi, PAKISTAN.
The work in this dissertation has been carried out at the Faculty of Computer
Science and Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences
and Technology (GIKI), Topi, Pakistan. The research was supported by the Higher
Education Commission of Pakistan (HEC) under the Indigenous Ph.D. Fellowship
Program.
Scholar PIN-: PS3-254
Copyright©2015 by Tufail Muhammad
IV
Declaration of authorship
The content of this dissertation entitled “Selecting, Quantifying, Optimizing,
and Understanding Visualization Techniques: A Computational Intelligence-Based
Approach“ had been undertaken in the Faculty of Computer Science and
Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences and
Technology, under the supervision of Dr. Zahid Halim. I, herewith declare
that the material presented in this dissertation has not previously been
submitted, in whole or in part, for any kind of academic awards elsewhere.
Moreover, the results are produced from original research, except otherwise
properly acknowledged and referenced to in the contents.
Tufail Muhammad
Acknowledgements
V
Acknowledgements
Foremost, I thank the Almighty God, the Most Gracious, and the Most
Merciful. At the same time, when I look back a few years, thinking of certain
people and memorable events with them, I would like to thank them all for
their support and encouragement throughout my doctoral studies.
I would like to present utmost gratitude to my supervisor, Dr. Zahid
Halim, whose invaluable support, help, patience, and encouragement enabled
me to complete this dissertation. He deserves the appreciation that cannot be
expressed in words. Every meeting with him was a source of inspiration and
great motivation for a PhD student having little hopes. His guidance not only
helped me in building my technical knowledge about the research field, but
also in my non-academic matters. Surely, without his supervision, I would be
lost. I would like to convey special thanks to Prof. Dr. Khalid J. Siddiqui for
his proofreading/editing, which has improved the composition of this thesis.
I want to express my heartfelt gratitude to my colleagues at GIK Institute,
Dr. Fazal Wahab, Dr. Adam Khan, Dr. Rahim Khan, Mr. Ihsan Ali, and Mr.
Mehran Bashir for their constant support and motivation that brought this
dissertation to existence. I would like to thank Mr. Ali Abass, Mr.
Muhammad Sohaib, Mr. Muhammad Riaz, and Mr. Hilal Khan for their
good wishes and “khidmat” (services) during the last two years.
I would like to express my deepest gratitude to my family, my parents,
specially my late father, though he is no longer here to see my success! I am
grateful to my brothers and my wife, their unconditional support and
encouragement helped me throughout my Ph.D. study. I am indebted to the
feelings of my sons, Iqrash Ahmad Khan and Zawar Ahmad Khan, who
always missed me.
Acknowledgements
VI
I would like to convey special thanks to the Higher Education Commission
(HEC) of Pakistan for financially supporting my PhD study through the
indigenous PhD program. I am also thankful to GIK Institute for providing
research facilities and excellent environment. My special thanks to all the
faculty members and staff at the Faculty of Computer Science and
Engineering for their support and cooperation.
Publications
VII
List of Publications Extracted from
This Work
Chapter 3 is published in the Applied Soft Computing Journal
T. Muhammad and Z. Halim, "Employing Artificial
Neural Networks for Constructing Metadata-Based
Model to Automatically Select an Appropriate Data
Visualization Technique," Applied Soft Computing, Vol.
49, 2016. pp. 365-384. [ISSN: 1568-4946, Thomson
Reuters JCR 2016, Impact factor 2.857, Elsevier]
Chapter 4 is published in the Information Sciences Journal
Z. Halim and T. Muhammad, "Quantifying and
Optimizing Visualization: An Evolutionary Computing-
Based Approach," Information Sciences, Vol. 385, 2017.
pp. 284-313. [ISSN: 0020-0255, Thomson Reuters JCR
2016, Impact factor 3.364, Elsevier]
Chapter 5 is accepted in the Journal of Visual Languages and Computing
T. Muhammad, Z. Halim, and M. A. Khan,
“Visualizing Trace of Java Collection APIs by Dynamic Bytecode Instrumentation,” Journal of Visual Languages and Computing, Vol. --, 20--. [ISSN: 1045-926X,
Thomson Reuters JCR 2016, Impact factor 0.634,
Elsevier] in-press.
Dedication
VIII
Dedication
To, my father (late)
Abstract
IX
Abstract
Information visualization is a prominent technique to visually explore and
analyze large volumes of data effectively. Visualization must be aesthetically
appealing and perceptually pleasing to the human cognition. This needs
necessitates a framework to predict visualization technique based on two
aspects: the underlying dataset and the task to be performed on it.
Additionally, the resultant visualization must be optimal in the context of
aesthetics and human perception. This dissertation contributes in three
perspectives that subsume information visualization aspects: automatic
technique selection of a visualization, quantifying and optimizing
visualization layout, and visualizing software trace. The study provides
computational intelligence (CI) model to predict a visualization technique
based on the metadata of original dataset and relevant tasks. Similarly,
visualization metrics are formulated to objectively measure the visualization
quality. Based on these metrics, an evolutionary algorithm optimizes the
visualization layout. Finally, the hierarchical visualization technique is used
to study the usage of application programming interface (API) objects in the
program trace. The trace is collected using the bytecode instrumentation.
This dissertation has three parts. First part aims to predict an appropriate
visualization technique for a specific dataset. A customize dataset is built
using the knowledge that exists in the contemporary literature on various
visualization techniques. The dataset comprise of four metadata attributes,
relevant task, and the visualization techniques. The study develops an
artificial neural network (ANN) to predict a visualization technique using five
input and eight output neurons. Optimal neural network architecture is
obtained by evaluating various structures with different network
configuration. Several well-known performance metrics, i.e., confusion
matrix, accuracy, precision, and sensitivity of the classification are used to
compare various neural network architectures. Additionally, the best ANN
Abstract
X
model is compared with five other well-known classifiers: k-nearest neighbor
(k-NN), naïve Bayes (NB), decision tree (DT), random forest (RF), and
support vector machine (SVM).
Second part provides design of an optimal visualization using visualization
quality metrics. Initially, the study focuses on the design parameters which
contribute towards the quality of a visualization technique. Visualization
metrics are proposed to measure the aesthetic and perceptual characteristics of
visualization. They include: effectiveness, expressiveness, readability, and
interactivity. An evolutionary algorithm (EA)-based framework to optimize
the layout of a visualization technique is also proposed. Treemap
visualization technique is used for layout optimization using the EA. These
results are evaluated using control experiments and benchmark tasks.
The last part uses treemap-based visualization to analyze API objects used
in the software, particularly to understand API’s objects during runtime of
Java programs. The work consists of two aspects: the extraction of APIs
information using bytecode instrumentation, and development of a
visualization tool to analyse the traces using treemaps. Initially, a bytecode
instrumentation tool is developed to probe and collect runtime information.
The extracted information is logged into an extensible markup language
(XML) file. The log file is synthesized using treemap. The instrumentation
part is evaluated using twenty benchmark and ten real world applications.
The results show that the instrumentation tool causes minimal runtime
overheads.
Table of Contents
XI
Table of Contents
Declaration of authorship .....................................................................IV
Acknowledgements .............................................................................. V
Abstract ........................................................................................... IX
List of Figures .................................................................................. XV
List of Acronyms ........................................................................... XVIII
CHAPTER1 : INTRODUCTION ....................................................................... 1
1.1 Motivation ..................................................................................... 3
1.2 Problem statement ............................................................................ 5
1.3 Primary research questions .................................................................. 8
1.4 Research Hypothesis ......................................................................... 9
1.5 Aims and objectives ......................................................................... 10
1.6 Research Methodology ..................................................................... 12
1.7 Assumptions .................................................................................. 15
1.8 Dissertation Outline ......................................................................... 15
1.9 Chapter Summary ........................................................................... 17
CHAPTER2 : LITERATURE REVIEW ............................................................... 18
2.1 Dynamic code analysis and software visualization .................................... 19
2.2 Automatic visualization prediction ....................................................... 29
2.3 Visualization optimization ................................................................. 34
2.4 Chapter Summary ........................................................................... 43
Table of Contents
XII
CHAPTER 3 : ON SELECTING DATA VISUALIZATION TECHNIQUE SELECTION ............. 46
3.1 Proposed system, dataset, and visualization techniques .............................. 49
3.1.1. BUILDING THE DATASET ....................................................... 51
3.1.2. VISUALIZATION TECHNIQUES ................................................. 55
3.2 Artificial neural network preliminaries .................................................. 58
3.3 Experiments and results .................................................................... 60
3.3.1. ANN EXPERIMENTS ............................................................ 63
3.3.2. THE N-FOLD CROSS-VALIDATION ............................................. 71
3.3.3. PERFORMANCE ANALYSIS ..................................................... 76
3.4 Sensitivity analysis ........................................................................... 80
3.5 Comparison with other classifiers ......................................................... 81
3.6 Ranking three best visualizations ......................................................... 87
3.7 Comparison with state-of-the-art .......................................................... 93
3.8 Discussion ..................................................................................... 96
3.9 Chapter summary ............................................................................ 98
CHAPTER 4 : QUANTIFYING AND OPTIMIZING VISUALIZATION .......................... 100
4.1 The information visualization metrics ................................................. 102
4.1.1. EFFECTIVENESS ............................................................... 104
4.1.2. EXPRESSIVENESS .............................................................. 104
4.1.3. READABILITY .................................................................. 105
4.1.4. INTERACTIVITY ................................................................ 106
4.1.5. THE COMBINED FITNESS FUNCTION ........................................ 107
4.2 Proposed solution .......................................................................... 107
4.2.1 PROBLEM FORMULATION ..................................................... 109
Table of Contents
XIII
4.2.2 CHROMOSOME ENCODING .................................................... 110
4.2.3 REPRODUCTION OPERATORS ................................................. 114
4.3 Experiments and Results ................................................................. 116
4.3.1 TREEMAP ........................................................................ 116
4.3.2. EA RESULTS.................................................................... 118
4.3.3. EVALUATION .................................................................. 122
4.3.3.1. USER STUDY ......................................................... 124
4.3.3.2. ANALYSIS OF VARIANCE AND POST HOC ANALYSIS ........... 136
4.3.4. DIRECT METHOD .............................................................. 141
4.4 Discussion ................................................................................... 143
4.5 Chapter Summary ......................................................................... 149
CHAPTER 5 : VISUALIZING TRACE OF JAVA COLLECTION APIS .......................... 152
5.1. Proposed System for Java Tree Visualization ....................................... 155
5.1.1 JAVA TRACES VISUALIZATION SYSTEM OVERVIEW ........................ 156
5.1.2 INSTRUMENTATION ........................................................... 158
5.1.3 DATA COLLECTION ............................................................ 159
5.1.4 VISUALIZATION AND USER INTERACTION .................................. 160
5.2. Case study .................................................................................. 163
5.3. Performance evaluation and comparison ............................................. 167
5.3.1 EXPERIMENT DESIGN .......................................................... 171
5.4. Performance evaluation .................................................................. 180
5.5. Chapter Summary ........................................................................ 184
CHAPTER 6 : CONCLUSIONS AND FUTURE WORK ........................................... 186
6.1. Primary research questions .............................................................. 187
Table of Contents
XIV
6.2. Summary of the findings ................................................................. 197
6.3. Limitations ................................................................................. 199
6.4. Future work ................................................................................ 200
APPENDIX A ........................................................................................ 207
APPENDIX B ........................................................................................ 207
APPENDIX C ........................................................................................ 207
APPENDIX D ........................................................................................ 207
APPENDIX E ........................................................................................ 204
List of Figures
XV
List of Figures
Figure 1. 1 Block diagram listing system components ...................................................... 10
Figure 1. 2 A visual roadmap of the dissertation .............................................................. 14
Figure 2. 1 Related work taxonomy ................................................................................ 19
Figure 2.2 A tree with its corresponding treemap ............................................................. 23
Figure 2.1 Typical compositions of a Java program ......................................................... 24
Figure 2.2 Information visualization techniques classification .......................................... 30
Figure 3.1 System working for the visualization prediction .............................................. 50
Figure 3.1 Visualization, tasks and metadata mapping ..................................................... 53
Figure 3.1 The eight visualizations used as class label ...................................................... 56
Figure 3.2 Neural Network ............................................................................................. 58
Figure 3.5 Dataset larger values vs. smaller values ........................................................... 63
Figure 3.6 Single hidden layer network, 2-hidden layered network ................................... 67
Figure 3.7 no. of nodes vs. MSE ..................................................................................... 69
Figure 3.8 2-Hidden nodes .............................................................................................. 70
Figure 3.9 Two hidden layered structure analysis ............................................................ 74
Figure 3.10 Accuracy for different number of nodes in hidden layer ................................ 75
Figure 3.11 Hidden nodes vs. MSEs ............................................................................... 75
Figure 3.12 Hidden Nodes vs. MSE for 1 hidden layered ANN....................................... 77
Figure 3.13 Confusion matrix of the best ANN architecture ............................................ 79
Figure 4.4 Crossover and mutation operations. ............................................................... 115
Figure 4.5 A tree with its corresponding treemap ........................................................... 117
Figure 5.1 System overview of the system for Java traces visualization ............................ 158
Figure 5.2 Segment of log file ......................................................................................... 161
Figure 5.3 Visualization main view ................................................................................ 165
Figure 5.4 Visualization Package-wise view.................................................................... 166
Figure 5.5 Mutator methods view .................................................................................. 166
Figure 5.6 Search result for HashTable .......................................................................... 166
List of Tables
XVI
List of Tables
Table 3.1 Dataset description .......................................................................................... 54
Table 3.2 The eight visualization techniques used in recent literature ............................... 55
Table 3.3 Network structure and initial parameters .......................................................... 64
Table 3.4 NN performance ............................................................................................. 78
Table 3.5 Best ANN performance ................................................................................... 79
Table 3.6 Impact of various learning approaches on the ANN ......................................... 80
Table 3.7 Sensitivity analysis results ................................................................................ 81
Table 3.8 Overall accuracy for random forest .................................................................. 83
Table 3.9 SVM prediction accuracy ................................................................................ 84
Table 3.10 Per class accuracy of different classifiers ......................................................... 85
Table 3.11 Average accuracy and CPU time of classifiers ................................................ 86
Table 3.12 Accuracy comparison using Friedman test ..................................................... 89
Table 3.13 Sensitivity analysis using classifiers- error rate (%) .......................................... 89
Table 3.14 Dataset description ........................................................................................ 90
Table 3.15 Dataset with three best visualizations ............................................................. 91
Table 3.16 Benchmark datasets and the predicted visualization based on task ................... 92
Table 3.17 Comparison between the proposed system and state-of-the-art ........................ 95
Table 4.1 Aspects mentioned in literature for better visualization .................................... 104
Table 4.3 Description of the genes in a chromosome....................................................... 112
Table 4.4 EA parameter settings ..................................................................................... 116
Table 4.5 Various combinations of the objective function ................................................ 123
Table 4.6 The five benchmark tasks ................................................................................ 124
Table 4.7 Summary of user study scores ......................................................................... 130
Table 5.1 An example of collection API objects analysis using clustering ........................ 153
Table 5.2 Log File Detail .............................................................................................. 164
Table 5.3 Collection APIs per program .......................................................................... 168
Table 5.4 Null hypotheses with their alternatives ............................................................ 171
Table 5.6 Task description ............................................................................................. 173
Table 5.7 Experimental group statistics for time and usability score (0-5) ......................... 175
List of Tables
XVII
Table 5.8 Control group statistics for time and usability score (0-5) ................................. 176
Table 5.9 Per task comparison ....................................................................................... 178
Table 5.10 Results statistics ............................................................................................ 179
Table 5.11 Software time taken while loading ................................................................. 181
Table A.1 Single hidden layer NN structure time and MSE (split) ................................... 207
Table A.2 Two-Hidden layers NN structure time and MSE ............................................ 207
Table A.3 Network with Training and Test data ............................................................. 207
Table A.4 NN with validation check- Early stop ............................................................ 207
Table A.5 NN with stop with goal / Validation stop ..................................................... 207
Table A.6 Single hidden layer NN structure and MSEs (10Folds-CV) ............................. 207
Table A.7 Comparison of classification and prediction accuracy (10Folds-CV) ................ 207
Table A.8 Information visualization results for iris dataset .............................................. 207
List of Acronyms
XVIII
List of Acronyms
1D One dimension
2D Two dimensions
ANN Artificial neural network
ANA Analytical style demonstrator
API Application programming interface
ART Artistic style demonstrator
AST Abstract syntax tree
AUC Area under curve
AWT Abstract window toolkit
BCEL Byte code engineering library
CI Computational intelligence
DT Decision tree
DVF Dense vector field
EA Evolutionary algorithm
EC Evolutionary computation
FFNN Feed-forward neural network
GA Genetic algorithm
HCI Human computer interaction
IDE Integrated development environment
JCF Java collection framework
JRE Java runtime environment
JVM Java virtual machine
JVMPI Java virtual machine profiler interface
JVMTI Java virtual machine tools interface
k-NN k-nearest neighbor
LBM Lifetime behavior model
LM Levenberg-Marquardt
MAG Magazine style demonstrator
List of Acronyms
XIX
MBs Megabytes
MLP Multilayer perceptron
MSE Mean square error
NB Naïve Bayes
nD n dimensions
RF Random forest
Rprop Resilient backpropagation
SVM Support vector machine
TTT Task by data type taxonomy
XML Extensible mark-up language
Chapter 1 Introduction
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 1
Introduction
“Imagination rules the world”
Napoleon Bonaparte
These days Information visualization is ubiquitous in almost
every discipline to visually analyse high volumes of data
effectively. Nevertheless, the selection of an appropriate
visualization technique for a particular problem domain or
dataset is still a non-trivial undertaking. Moreover, the
perspective visualization is needed to be perceptually
appealing and aesthetically alluring to the human cognitive
system. The work presented in this dissertation addresses
these problems using software instrumentation and
computational intelligence-based approaches. Using software
instrumentation, a visualization tool is presented to
comprehend the collection APIs usage in Java-based
applications through dynamic analysis. Computational
intelligence is used for predicting appropriate visualization
techniques for a particular dataset using the metadata.
Additionally, a set of visualization metrics is proposed to
quantify the given visualization technique. Later, the
proposed visualization metrics are used to optimize
visualization layout using evolutionary algorithms. This
chapter provides a synopsis for the research done in the
dissertation. The motivation for the work carried out in the
subsequent chapters is also covered. Various research
questions are formulated, the answers to which are soughed
Chapter
1
Chapter 1 Introduction
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 2
in this work. The methodology for research with major
limitations and assumptions is also discussed.
The advances in computing and related technologies have given birth to
new avenues of information handling and exploration. These advances
have large generated volumes of data. These volumes of data and
information are produced and manipulated daily using various media, i.e.,
social media [1, 2], business, financial transactions [3], ever growing large
database (also known as big data) [4], and large intricate software system
running round the clock [5]. This has paved the way for approaches to see
exploit hidden information underneath these piles of data. The three
popular domains that are used to explore the huge amount of data include:
data mining, exploratory data analysis, and visualization. Each of these
can further be classified into various domains. For example, stream
mining, uncertain data mining, and graph mining are a few of the sub
categories for the data mining domain. Similarly, visualization has sub
domains of information visualization and software visualization. Since the
work presented in this dissertation addresses the problem at hand using
visualization and computational intelligence that uses visualization as a
test bed, we will be focusing on the same in the rest of this chapter.
However, for a further study on data mining and exploratory data analysis,
the reader is referred to [6, 7].
Information visualization is a powerful method to visually explore the
large complicated data to get a thorough insight [8]. Information
visualization uses visual computing to amplify human cognition with
abstract information. It promises to help in expeditious understanding and
action in a world of increasing data volumes. There has been a consistent
demand for sophisticated tools and techniques, based on which the
visualization systems explore and analyse information efficiently [9].
Nonetheless, considering the naïve business users, the appropriate
Chapter 1 Introduction
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 3
tools/technique selection in perceptive to the data at hand remains a non-
trivial issue [10, 11]. Moreover, to be effective and efficient in exploration
for presentation of underlying information, specific visualization datasets
are needed [12]. The visualization must not be merely a pretty image; it
should be perceptually and aesthetically appealing to the human cognitive
system as well. This is also true for complex entities, such as, software, that
produces a vast amount of information, especially during its execution [13].
Based on the discussion above, this dissertation aims to address the
problem of gathering data from software-based systems for visual
inspection. The computational intelligence methods are also developed to
extract useful information, to predict an appropriate visualization
technique for given a dataset, and optimizing the layout of a visualization
technique.
1.1 Motivation
Effective information handling and manipulation is an important factor
that influences the strategic decision making from the primordial time to the
current world. However, the existing era is marked as the information age
which needs to process bulks of data rapidly to get insight [6]. In addition to
various experiments, processes, and events that generate data, the software
systems whether system software or application software, also generate
data. This is more relevant for the software systems that are built using
object oriented paradigm due to inheritance, polymorphism, and overriding
features. Collecting and understanding such data can give insight about the
internal working of a software system that can later be optimized for
performance. However, the data generated by these software-based systems
can be huge, ranging from megabytes (MBs) to gigabytes (GBs) of data,
depending upon the size of the particular software. Additionally, the object
oriented software has a vast usage of objects of different data structures,
such as Java collection application programming interfaces (APIs) which
make it very difficult to understand the code. Programmers are always
Chapter 1 Introduction
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 4
interested in optimizing their code in order to make it more efficient.
Instances where code size is huge and objects from other multiple classes
are instantiated make it a difficult task to understand.
Information visualization, on the other hand, is a powerful method for
visual representation of large datasets and gives instant stimulus to the
human cognitive system [14]. The advances in computing and information
processing facilities has given birth to new tools and visualization
techniques [15]. Selection of a particular visualization technique for a given
problem requires domain expertise. Notwithstanding a business user is
always enthusiastic to build most optimal visualization of the data in hand.
Most of the time the users are naïve who have little or no knowledge about
the underlying dataset and the intrinsic relationship among the items.
Although the field of information visualization has come a long way, a
quantitative measure to evaluate how good or bad a particular visualization
represent the data is yet to be found. This quantitative measure will not only
help is choosing an appropriate visualization, but can also be used to
optimize the layout of a particular visualization technique.
Computational intelligence (CI), a set of nature-inspired computational
methodologies that can address complex real-world problems for which
traditional approaches are ineffective, can be used to optimize the layout of
a visualization technique. The core components of CI consist of
evolutionary computation (EC), artificial neural networks (ANN), and
fuzzy logic to name a few. However, in order to optimize the visualization
layout the CI-based solution will require an objective function that needs to
be either maximized or minimized.
Based on the aforementioned details the work in this dissertation is
motivated to use data generated by the software-based systems (although
data from any systems can also be used) and software instrumentation
coupled with visualization to gain insight about complex software-based
Chapter 1 Introduction
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 5
systems. It also aims to exploit the search capabilities of CI for predicting
and optimizing a visualization technique.
1.2 Problem statement
Based on the motivation, the problem is twofold: firstly, there is a need
for an appropriate visualization selection mechanism that effectively
presents the data and handles the tasks needed to be accomplished with that
particular dataset. Secondly, the selected visualization should fulfil the
aesthetic and perceptual requirements of the users [12]. Conversely, an
inappropriate visualization will lead to inadequate and inappropriate
decision making based on the incorrect visualized information. In addition,
the visualization must not overwhelm the user, whilst conveying the
intended information effectively and efficiently [16]. The remedy to these
situations emerges with many challenges, i.e., what type of information is
needed from the dataset to predict an appropriate visualization? how are the
tasks and visualization related? the availability of some benchmark dataset,
visualization perceptual design parameters, metric to evaluated
visualization, and how these metrics are computationally measured. Several
areas like, human computer interaction (HCI), interface design, perceptual
studies, and cognitive science already have abundant foundational work
available while addressing these imperfections. Similarly, advances in the
computational intelligence techniques, such as ANNs and genetic algorithm
(GAs) bring powerful method for classification, prediction, and
optimization.
Large software systems are an example of complex entities that human
beings develop [17]. The behaviour of software systems during execution is
always subject of interest to the developers with respect to maintenance and
application optimization [18, 19]. Particularly, in Java-based applications,
where program use collection of APIs, i.e., HashTable, ArryList, etc., to
store runtime data. The program performance may degrade with inefficient
Chapter 1 Introduction
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 6
usage of APIs [20]. In case of program comprehension, in general, and for
maintenance in particular, developers need to understand the location
where the large APIs objects are created. The runtime information about
APIs usage may be recorded through binary instrumentation. This may be
subject to the runtime overheads [21]. Thus, there is a need for an effective
method to analyse the large amount of information collected during
software execution.
Visualization is a powerful method of exploration and confirmation of
underlying hypotheses. Modern visualization tools and techniques are
developed in abundance, both in research and industry. This situation
demands for a visualization technique selection method for a given data.
Nevertheless, the underlying data may have a complex intrinsic relationship
and characteristics that influence the visualization selection process [22]. It
copes with the complexity of underling data or system. Business users
having insufficient skills and knowledge about the data and/or visualization
are an example for such situations. The data and visualization technique is
highly desirable to build a suitable visualization [6]. This perspective
necessitates for an automatic visualization framework that predicts a
visualization technique for a specific data with high accuracy. The potential
system must have the knowledge not only about data, but also visualization
tasks as well. This will make the tasks to be executed by the user easier and
enhance the productivity and decision making.
As an intimately connected problem, the selected visualization must also
be perceptually pleasing and aesthetically appealing to human cognition.
However, creating such visualization is not a trivial task. Generally, these
quality aspects of visualization are inherently subjective in nature and vary
in context [12, 23]. In visualization community controlled experiment and
user studies remains the main evaluation methods for the comparison
among different visualization types and tools [24]. In literature, various
theoretical visualization metrics are discussed and proposed to evaluate the
Chapter 1 Introduction
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 7
visualization techniques [25, 26]. Such situations need an automated
framework to evaluation and create an optimal visualization based on the
existing knowledge. The computational model must be established on the
bases of existing theories from the empirical research conducted over the
years. The advancements in several divergent, however related fields may
be utilized, i.e., HCI, cognitive science, interface design, computer graphics,
and operational research. The intended system will then be able to display
better visualization created without the intervention of human beings.
To explore and comprehend a complex entity such as software,
visualization is the more appealing option [13, 27-29]. Information
visualization techniques are already used for exploring different aspects of
software in static perspective [30] and to analyse runtime behaviour of a
software [13]. Most object oriented applications are built from re-useable
components know as libraries and APIs. Java-based application developers
use APIs to store program data and variables during execution [31]. Since
the efficient usage of these APIs has an impact on the performance of a
program, the developer needs to understand the runtime behaviour of their
programs. The usage pattern and location in source code information, e.g.,
the packages or classes responsible for a large number of objects creation
help in program comprehension and maintenance [32, 33]. The dynamic
bytecode instrumentation is a method to extract information about state of
program during the execution [34]. However, this method has its own
limitation since the performance degrades due to instrumentation code [35].
Additionally, the runtime information needs to be properly analysed for
insight patterns. The situation demands a twofold solution, first a
lightweight instrumentation to extract runtime information about the APIs
used by the Java application. Secondly, a suitable visualization is inevitable
that assists developers to effectively analyse the large traced information
with minimum effort.
Chapter 1 Introduction
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 8
Hence, this work provides solution to cover the above-mentioned aspects
of the problem statement. In case of appropriate visualization selection, a
metadata-based ANN approach is provided. The proposed approach
leverages the user from the complexities of dataset and visualization
selection process. The model is trained and test for the accuracy of
visualization selection using a dataset and the tasks need to be performed.
This makes the users’ task easier and enhances the productivity by
augmenting the decision making. Furthermore, the proposed visualization
metrics and computational intelligence-based framework provides better
visualization. The proposed visualization metrics are based on the existing
theories and knowledge that provide a strong basis for computational
model. The proposed solution can be utilized to evaluate visualizations
computationally and get a better visualization for human decision making.
Moreover, the selective instrumentation and visualization-based tool enable
the developers to extract Java collection APIs information and provide a
visualization to get the insight. The visualization tool helps the developer to
analyze the large amount of data effectively.
1.3 Primary research questions
The research presented in this dissertation has three aspects. Firstly, the
automatic visualization prediction for a particular dataset based on
metadata and the tasks that the user needs to perform. Secondly, to build
perceptually better visualization using optimal design attributes. Finally, to
use information visualization techniques and bytecode instrumentation to
investigate Java APIs usage during program execution. Based on this
background and the motivation, the following research questions are
formulated to carry out this work:
RQ-1: What are important characteristics of a dataset that influence
the selection of visualization technique?
Chapter 1 Introduction
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 9
RQ-2: How metadata and a particular task related to the data, be used
to predict visualization for a dataset?
RQ-3: What is the best CI model to predict visualization based on
metadata?
RQ-4: What aesthetic and perpetual design parameters are important
for specific visualization?
RQ-5: How the visualization features and design parameters map to
the visualization metrics?
RQ-6: How the visualization metrics computationally evolve to
optimize the layout of a visualization technique?
RQ-7: Which types of Collection APIs are frequently used by a
given/target Java program during the execution?
RQ-8: Which packages/classes/methods of target program are
responsible to instantiate Collection APIs objects?
RQ-9: How dynamic bytecode instrumentation is used to extract APIs
objects traces with minimal runtime overheads?
RQ-10: Can treemap-based visualization be utilized for the analysis of
Collection APIs objects of a particular Java program?
1.4 Research Hypothesis
Research hypothesis are high level general questions about the research. In
this dissertation we formulate the following research hypotheses
H1: A computational intelligence-based model can be used to
automatically predict visualization for a specific dataset with a
relatively higher degree of accuracy.
Chapter 1 Introduction
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 10
Figure 1. 1 Block diagram depicting the system components
H2: Evolutionary computation-based optimization framework can
provide with perceptually and aesthetically better visualization design
parameters.
H3: Selective bytecode instrumentation with treemap visualization can
be used for Java APIs usage understanding.
1.5 Aims and objectives
This dissertation aims to contribute towards information visualization
techniques using computational intelligence (CI) and software
instrumentation in two perspectives. Firstly, the automatic prediction of
visualization for a particular dataset by utilizing the metadata and the tasks
those are to be performed on such data. This process is followed by the
optimization of visualization, to build a perceptually appealing and
aesthetically pleasing visualization. This part of the work is accomplished
Chapter 1 Introduction
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 11
by using computational intelligence techniques, specifically artificial neural
networks (ANNs) and evolutionary algorithms (EAs). ANN is tuned and
tested for the prediction of visualization based on metadata and the
visualization tasks. EA-based framework is developed to determine the
design parameters for the optimal set of values to build comparatively better
visualization. Secondly, using visualization technique along with bytecode
instrumentation to analyse and comprehend the APIs usage in Java
application. A lightweight, bytecode instrumentation is developed to extract
runtime usage of APIs objects in Java programs with minimal overheads.
Treemap visualization is then devised to display the vast amount of
information on a single screen. Figure 1.1 shows the working and
interaction of various components of this work using a block diagram. The
following objectives are formulated to achieve the research aims:
Investigation existing literature to find the intrinsic properties of a
dataset that may be used to select appropriate visualization method.
Review information visualization techniques for specific types of
tasks on a dataset.
Develop a novel dataset based on metadata, tasks, and visualization
techniques.
Develop a CI-based model to classify and predict visualization for a
dataset with intended task having high degree of accuracy.
Test the CI-model against other well-known classifiers and state-of-
the-art approaches.
Develop and formulate visualization metrics from the existing
knowledge.
Investigate the perceptual and aesthetic design parameter for a
particular visualization.
Design an EA-based framework to determine design parameters of a
visualization technique.
Chapter 1 Introduction
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 12
Develop bytecode instrumentation tool to extract Java APIs objects
usage information during the execution of a program.
Develop a treemap-based visualization interface to analyse Java
APIs objects on the basis of its type, packages, classes, and methods.
Design a controlled experiment and case studies with benchmark
tasks to evaluate the instrumentation and visualization results.
1.6 Research Methodology
A stepwise methodology is adopted to carry out this research and to
satisfy the formulated questions. Several areas of research are investigated to
establish strong foundation for the proposed techniques. The broader areas
including: information visualization, automated visualization selection,
intrinsic properties of the dataset, and computational intelligence-based
classifier. The research convinced to carry out the experiments for research
questions RQ-1-to-RQ-3. Theories on visualization metrics, perceptual and
aesthetic design aspects of visualization and computer graphics and soft-
computing techniques are carefully investigated to form a foundation for
research questions RQ-4-to-RQ-6. Further, to address RQ-7-to-RQ-10 an
exhaustive review of research in software visualization tools, dynamic
analysis, Java APIs usage analysis, and evaluation methods is carried out.
For the visualization prediction module, a novel dataset is built based on
the knowledge presented in the literature. Previously there was no
benchmark dataset available for the visualization classification. However,
already established visualization techniques, i.e., line chart, parallel
coordinates, and treemaps are known to be more suitable for some tasks.
Commonly used tasks are extracted with their intended visualization
technique and combined with the metadata of datasets to be visualized. The
metadata consists of intrinsic properties, i.e., dimensions, number of
instances, number of attributes, and data types. The newly created dataset is
then classified using ANN-based classifier. Several ANN models and
Chapter 1 Introduction
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 13
training methods are tested. ANN model is compared with well-known
classifiers including: support vector machine (SVM), random forest (RF),
and decision tree (DT) using benchmark performance metrics. The
proposed system is also compared with state-of-the-art systems.
The next step involves exploration of metrics for the optimization of
visualization layout. Various theories on quantifying visualization are
presented from the information visualization literature. However, the
subjective nature of these theories makes it challenging to devise a set of
metrics. The perceptual and aesthetic design parameters of a particular
visualization are mapped to the proposed visualization metrics. The
mapping process is based upon the theories and knowledge presented over
the years within the domain of information visualization, human computer
interaction (HCI), interface design, and psychology. An EA-based
technique is developed to develop optimal design parameter values. The EA
is fed with a random initial population where the fitness function and the
genetic operators are used to search for the best solution. In addition to this,
the outcome is needed to be compared with contemporary research. For this
purpose the internal metrics are combined with external evaluation
criterion. The evaluation process is followed by user studies and statistical
analysis of result collected during this process.
Chapter 1 Introduction
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 14
Figure 1. 2 A visual roadmap of the dissertation
The final step in this methodology is the development of visualization
system, and bytecode instrumentation tool to extract and analyse Java APIs
objects instantiated during the program execution. A modular strategy is
adopted for this purpose; the bytecode instrumentation tool is developed
with the aim to avoid runtime overheads. The overhead certainly result in
performance degradation of the targeted applications. The instrumentation
module adds probe to Java application and APIs objects information are
stored in a log file. The log file is then used for post-mortem analysis of
APIs based on treemap visualization. The treemap visualization module is
provided with extensible markup language (XML)-based tree format of the
log file as an input. The visualization interface provides with several sub-
views to explore the log file for APIs packages, classes, methods, and data
Chapter 1 Introduction
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 15
type information. The evaluation of the tool is also twofold: the evaluation
of instrumentation module is carried out by case study of large Java
applications and a benchmark suit as well, while, the evaluation of
visualization method is performed using controlled experiment. Figure 1.2
shows a visual roadmap of this dissertation.
1.7 Assumptions
Every research has certain limitations and the work is usually carried out
with several logical assumptions. The work presented in this dissertation is
also obtained with certain assumption and few limitations. The controlled
experiment and user study are performed with the assumption that all
participants are volunteers and that they are honest in their judgment about
the tasks. Additionally, it is assumed that the participants have the
experience and are truthful in providing the required information. The
development of a dataset, visualization metric, and exploring visualization
design parameters for optimality, the assumption is taken that existing
knowledge and experiment are also even-handedly performed. The
development of optimal visualization is taken out in general perspective, not
subject to individual preferences.
1.8 Dissertation Outline
The remaining dissertation consists of five more chapters. Chapter 2 lists
the contemporary work related to this dissertation with several aspects, i.e.,
software visualization using dynamic analysis, tool design, visualization
evaluation methods, automatic visualization selection, visualization
optimization, visualization metrics, and computational intelligence
approaches to automatic prediction, dataset classification, and design
optimization.
Chapter 3 deals with the concept of automatic prediction of visualization
techniques using computational intelligence. The chapter presents the
Chapter 1 Introduction
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 16
methodology for the creation of perspective dataset used for the purpose of
training and testing an ANN model. Training and testing of several ANN
models are elaborated in detail including different network parameter
settings. The chapter also covers an extensive evaluation of several
classifiers in comparison with ANN model. The classifiers are evaluated
using well-known performance metrics, i.e., accuracy, f-measure, and
confusion matrix. The discussion and comparison with state-of-the-art
elucidate the advantages of the proposed approach.
Chapter 4 describes the major contributions of this dissertation, an
information visualization optimization using bio-inspired evolutionary
algorithms. The theories and existing literature on information
visualization are formalized to explore and devise metrics to quantify
visualization quality. The proposed method for visualization metric and
mapping to specific visualization technique is presented. The chapter also
shows extensive study of EAs and the main component. A detail
experimental case study and comparison shows the effectiveness of the
proposed solution.
Chapter 5 presents details about the first contribution of this work, the
visual analysis and comprehension of APIs usage in large corpus based on
Java. A comprehensive study on the subject is presented. It consists of
basic methodology to build instrumentation and visualization tool, key
components, and modules. A case study on the large Java application is
devised to verify the tool’s results on real applications. This is followed by
the empirical evaluation of visualization part through controlled
experiment. Several statistical tests are applied to authenticate the result of
experiment and confirm the hypothesis. The effectiveness of
instrumentation tool regarding runtime overhead is checked on standard
benchmark suites. The results are tested and compared against state-of-the-
art approaches.
Chapter 1 Introduction
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 17
Chapter 6 draws the conclusion from this work and articulates the major
contributions and limitations of the study. The chapter summarizes and
revisits the research questions and concludes to what extent the dissertation
is successful with their answers. The promising future directions are
elucidated to lead this work in several directions.
1.9 Chapter Summary
This chapter has laid the foundation for the rest of the dissertation. The
motivation and problem statement highlighted the context in which the
actual problem is being solved using the proposed work. Followed these,
hypotheses and research questions were formulated to fill the knowledge
gap through answers to these questions. Moreover, the aims and objectives
set for the dissertation and methodology to achieve these are elaborated.
The basic assumptions that are taken while carrying out this research were
also described. The last section of the chapter outlined the dissertation
chapter-wise along with the short description of each chapter. The next
chapter presents a comprehensive review of contemporary theories and
past research closely relevant to this dissertation.
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 18
Literature Review
“There is always a better way “
Thomas Edison
The work in this dissertation focuses on three aspects of
visualization: automatic prediction, optimization, and the
use of visualization for the comprehension of dynamically
collected data about Java application programming interface
(APIs). This chapter reviews the literature pertaining mainly
to these three areas of research. An exhaustive investigation of
the relevant work is presented with focus on contributions,
opportunities, and identifying the gaps between knowledge.
Further, the major contributions of this dissertation are
discussed with the key distinction from the contemporary
work.
Information visualization is a powerful technique for visual
representation of data with the help of computer-based system. The field
combines related areas is rich with new tools and well established theories.
The visualization tools are built keeping in view various, aspects including:
features to cope with large volumes of data, effective way of handling data,
perform various tasks related to the data, and easy to use interface.
Nevertheless, theories are developed to provide bases for new techniques
and algorithms to coincide with the existing knowledge. Advances in
information visualization have given birth to new aspect and research
avenues, like: visual analytics, visual data mining, and visual data
classification. The work presented in this dissertation is concerned mainly
Chapter
2
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 19
Figure 2. 1 Related work taxonomy
Dissertation
Dynamic instrumentation and trace visualization
Bytecode instrumentation
Software visualization
API analysis
Automatic visualization selection
Theoretical visualization taxonomies
Automated visualization systems
Visualization optimization
Soft computing-based optimization
Visualization Metrics
Treemaps
with three areas of information visualization, i.e., automatic visualization
prediction, visualization optimization, and software visualization with
dynamic bytecode instrumentation. Additionally, the dissertation also
covers related work in program component analysis, APIs usage analysis,
computational intelligence techniques, dataset creation, visualization
metrics, and optimization using evolutionary algorithms. A
comprehensive analysis of relevant theories and contemporary approaches
are discussed along with their limitations.
The rest of the chapter is organized as follows: Section 2.1 consists of
related works on dynamic analysis and software visualization. The
automatic visualization selection is discussed in Section 2.2, while the
literatures on visualization optimization and visualization metrics are
presented in Section 2.3.
2.1 Dynamic code analysis and software visualization
This section covers dynamic code analysis and software visualization
related literature in the perspective of this dissertation. However, the
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 20
literature that combines aspects of dynamic analysis and visualization is
more important to build bases for this work. Additionally, in the domain of
dynamic code analysis, only bytecode instrumentation is discussed, since
this work only performs bytecode instrumentation to extract runtime
information of a Java program. The work relevant to the APIs usage
analysis is also analyzed and reviewed.
Software visualization is a technique to visually depict software at
different levels for effective understanding and to manage the code’s
complexity. Software visualization is an important activity for maintenance,
reverse engineering, and re-engineering of software [36]. Software has
different aspect, i.e., the development lifecycle activities, source code view,
software architecture, and runtime behavior. In the literature, considerably
large amount of work has been proposed to visualize different aspects of the
software. Cornelissen et al. [13] analyze the importance of dynamic analysis
in program comprehension. They discuss the growing importance of
visualization for understanding the runtime and the behaviour of the
software. Caserta et al. [4] present a comprehensive survey on visualization
related to static aspects of software and its evolution. Recently, Shahin et al.
[28] conducted a comprehensive review of software architecture
visualization based on the articles published during 1999 and 2011.
Pauw et al. propose a software visualization tool Jinsight [27] to explore
the runtime behavior of Java program. The tool is provided with two
visualization views, i.e., histogram view and execution view to get the
insight of a Java program. Jinsight captures several program events,
including: object construction/destruction, and program method invocation
using java virtual machine (JVM) instrumentation technique. However,
Jinsight [27] is more directed towards performance analysis than program
comprehension. As the amount of traces increase in [27], the visualization
becomes difficult for the user to understand and comprehend patterns.
Gammatella, is a visualization tool presented by Jones et al. [37] to explore
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 21
program data visually. The tool synthesizes program trace data to visualize.
The visualization is based on treemap technique where program execution
data in shown at several granularity levels, i.e., system level and file level.
The tool is provided with interaction facility where a user may visualize
regions of the program’s trace data. However, Gammetella has a limitation
when it comes to visually show objects per package or per class since large
applications require the information hierarchy to visualize. The tool also
needs to be evaluated for effectiveness and usability. At present, only case
study is presented in support of the tool. Wu et al. propose a novel
visualization based on their object lifetime behavior model for Java program
objects [18]. The lifetime behavior model (LBM) is presented to capture and
model events for a specific Java object in a temporal manner. The authors
present a prototype tool that gathers Java program traces at object level
using Java virtual machine profiling interface (JVMPI) and then use several
visual views to get the insight. The visualization is based on a tree type
structure and consists of three models: thread-oriented view, method-
oriented view, and interaction-oriented view. The authors also propose an
object performance measurement model and visualization based on several
states of program object during the course of execution. Nonetheless, the
visualization technique needs to be evaluated to determine its effectiveness
and usability. The approach does not provide a global view of a Java
program, which could help user to view complete information on a single
screen. However, the object performance model in [18] need complete
demonstration and its under visualization may further be examined. Reiss et
al. [38] present online visualization system for an executing program based
on various program components. Their visualization scheme is based on
different abstraction levels to depict program behavior while it is executing.
The technique consists of two visualization views, JIVE and JOVE, used for
program comprehension, debugging, and performance. The JIVE
visualization provides the program behavior in terms of classes, packages,
and threads on a single view. The JOVE part focuses on program code and
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 22
how it executes during runtime. The major limitation of this approach is the
difficulty to visualize the entire course of execution. However, the
visualization is effective in showing program component hierarchy. At the
same time, the visualization is not evaluated empirically. Vasco, an
interactive visualization tool has been presented by Duseau et al. [39]. The
tool is used to assist developer in the discovery of program behavior and
understanding the temporary objects issues, known as object churn. The
tool provides a flexible and scalable approach that enables developers to
quickly identify the source of objects’ churn in intensive applications. Their
visualization is based on Sunburst, a popular technique using tree like
structure. The developer is provided with different views of trace data with
less strain on cognition. They also report the tool’s application in three
framework intensive software applications. However, their works does not
utilize a formal approach to evaluate the visualization. The hierarchical
information about the source of object churn is not clear in the depicted
visualization. Heapviz [40], is another offline approach that uses dynamic
analysis to extract program data structure and use interactive visualization
to effectively obtain the insight. The technique captures the snapshot of
program heap and uses graph-based visualization to present a global view.
The technique is evaluated using case studies and benchmark tools. The tool
summarizes objects created and uses this information for program
comprehension and debugging. However, the graph navigation is not trivial,
particularly in cases where graph becomes too large. Moreover, some work
like [41] and [42] present visualization tool for pedagogical purposes to
learn about program data structure and how they are used during execution
time. However, such tool cannot be applied in large software applications.
Treemap [43] is a popular space-filling visualization technique used for
hierarchical information. After its inception, in 1991 by Shniederman [43],
treemap is used in a variety of domains to visualize hierarchical
information, including business [44] [45] [46], news media [47], software
[48], medical [49], etc. Originally treemap was proposed for hard disk
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 23
Figure 2.2 A tree with its corresponding treemap
content visualization to quickly analyze the space occupied by various
users. Several variations to the treemap’s original slice-dice layout
algorithm have been proposed over the years [50] [51] [48] [52]. Treemap
visualization divides the entire screen into nested rectangles that may be
represented as squares. Each rectangle corresponds to a node in the data
hierarchy. The innermost rectangles represent leaf nodes, while the other
nodes are represented by enclosing rectangles as shown in Figure 2.2.
Several dimensions of the original data are shown by using color and size of
the rectangular region.
In the literature, several applications of treemap based user information
have been presented in the last two decades. In [24] authors propose a
treemap-based visualization and user interface called ResultMaps, for the
deployment of digital library repository search. The technique has been
tested using two lab experiments and both produced better results in helping
repository search visualization. TreeCovery, another treemap-based
visualization interface has been proposed by Rios et al. [53]. The tool helps
the government agencies to effectively monitor the financial distribution
among various departments. The authors have further enhanced the
treemap with zooming, feature highlighting, and item filtering features.
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 24
Figure 2.1 Typical compositions of a Java program
Java collection framework (JCF) has been introduced with Java 2, to
provide an efficient mechanism to handle program data structure, i.e.,
hashtable, list, map, arraylist, etc. [54]. JCF is part of Java core API
package java.util; consisting of a group of classes and interfaces to
standardize the process of data handling as a single unit of collection. Most
of the collections in Java are derived from the collection interface provide in
java.util package. Java developers heavily use these Java collection APIs
since it provides a convenient way to handle different data structures
without the burden of going into the details. All the operations a
programmer would like to perform on data such as, searching, sorting, and
insertion can be performed using JCF.
Another aspect of the proposed work is the identification and usage
information of program components during execution, at different
granularity levels. In contemporary literature, large volume of work has
been proposed to extract and analyze runtime program components [55]
[56] [57]. A comprehensive survey has been conducted by Robillard et al.
[58] to systematically categorize the API usage and analysis techniques. The
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 25
authors include over sixty techniques in their study and organized them
based on API usage, migration, and behavior specification into five different
categories. Zaidman et al. [59] proposed a mining technique for automatic
identification of program key classes using dynamic coupling and web
mining approaches. The case studies show that the technique is able to
identify most of the key classes of a program. The basic objective of the
work is code comprehension so that the developers easily get the snapshot
of complete software. However, the technique is only limited to the key
classes. Nevertheless, the technique is not supported by proper visualization
to effectively gain the information. Kawrykow et al. present a novel
technique [55] to automatically detect API imitation in using static code
analysis. The technique analyzes the target program to identify code
segments that imitates library method. Moreover, the imitations are
grouped to identify potential API usage pattern. They use this approach for
discovering API usage that needs improvement and provide the
recommendation. A key limitation of this technique is that the user could
not detect all APIs; only APIs that need improvement are available.
Souza et al. [60] adopt a different approach to evaluate API usage in a
program. The tool presented in their work is twofold; finds API usability
using complexity metrics and present information to the user using a
visualization module. They use software metric instead of usability method
to compute the API complexity. The visualization module is based on
treemap technique, where each API complexity is shown in different colors
while the hierarchal information depicts method, class, and package of an
API. Additionally, they also used star-plot to show the classes in the
package. The star is used to depict information of several metrics. The long
face of a star shows that particular class is more complex. However, the tool
does need proper evaluation via testing with standard benchmark tools.
Mileva et al. [61] utilize an approach based on “wisdom of crowds” to analyze
the popularity and usage pattern of API. They carry out static analysis of
over 200 open source projects from Sourceforge and Apache to observe the
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 26
frequency of how an API is being used in projects. They developed a
prototype tool to analyze the information and plot the result. This type of
information would assist the developer to discover whether a particular API
is used by other developers or the API usage is declined due to other issues.
Nevertheless, their approach is based on static analysis and the API
behavior and performance during program execution is not considered. The
developer is not aware of the actual reason of API abandonment. Lämme et
al. [56] propose a scalable technique for API usage analysis of the open
source Java projects having large code corpus. Based on abstract syntax tree
(AST), they present an automatic approach to extract features, like: tagging
with metadata, analysis, checkout, and synthesis. The authors are more
motivated towards API migration with a high degree of relevance and
applicability. However, the presented technique is yet to be tested on large
commercial program. Yet, another static approach has been presented by
Bauer et al. [62] to analyze the dependencies of a project on externally used
third party APIs. The technique is based on the information extraction from
the source code and then use visualization to assist user to get the insight.
The project searches for API usage through AST-based approach using
eclipse java compiler, where each class of a project is inspected for a number
of APIs calls. Moreover, visualization module is used to support user’s
decision making about API dependency during software maintenance. The
visualization shows the total number of calls from a project source code to
various external APIs, where the packages are depicted in decreasing order
of the APIs calls. The tool is evaluated using a case study of three software
applications with around one thousand lines of source code. The approach
is effective in the quantification of APIs for software. The actual behavior of
APIs is visible from execution time snapshot only. Additionally, the
visualization part needs to be evaluated through user studies.
Khan et al. proposed an object invocation-based model and clustering
technique to find inefficient API usage in Java program [20]. Their
approach is used to extract runtime information from a Java program using
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 27
instrumentation. Moreover, hierarchical agglomerative clustering is utilized
to classify the trace data and identify the code location inevitable to an
inefficient APIs usage. The approach is evaluated with a single case study
only. A different approach to this topic has been adopted by Mortiz et al.
[63] to automate the process of API usage identification and visualization.
The major theme of their work is the representation of software as a rational
topic model. Moreover, document network is used to depict the APIs calls.
ExPort, is an interactive tool based on the above model to search the code
visually keeping in view the developer’s needs. The call-graph view of
visualization is used to assist developer to search APIs effectively and
efficiently. The technique enables the developer to find the APIs usage
across several functions of the software. The authors claim that a database
was created using over 13000 open source projects, while presented
prototype tool using two software tools consisting of over 3700 methods.
However, their research has some limitations, as the tool was not evaluated
for the usability or effectiveness. The tool also requires the understanding of
software architecture and structure to make queries. Furthermore, the tool is
based on static analysis and does not reveal the runtime behavior of these
APIs. Recently, Said et al. [64] proposed a generalize technique for multi-
level API usage pattern based on source code analysis. The objective of this
technique is to analyze different methods of a particular API that a client
program collectively uses. Clustering algorithm is used to form a
hierarchical structure for the API’s documentation. The technique is
evaluated with four APIs while using 22 client programs for each API.
However, the technique is limited due to static approach.
Recently, Caserta et al. [21] proposed JBInsTrace, a tool for Java program
profiling and analysis based on bytecode instrumentation. They used fine-
grained technique to trace Java-based program classes including Java
runtime environment (JRE) class. The tool is used to trace program at basic
block level and extract runtime information. Furthermore, this runtime
information is then combined with statically collected data to perform
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 28
detailed analysis of a program. The authors claim that the tool was utilized
with reasonable runtime overhead. Nevertheless, the tool has some
limitations, like: JRE classes for trace extraction makes the data huge and
needs to be handled for full program analysis. Moreover, the author did not
focus on how to effectively analyze the information to perform detailed
analysis. The tool is demonstrated only on five Java programs to evaluate
the performance, and the actual trace information is not reported. A more
recent work in this area has been presented by Lengauer et al. [33]. They
present a detailed report on the design and implementation of
instrumentation technique to extract runtime information about Java
objects. The authors proposed a light-weight memory monitor to trace
object allocations and de-allocations. To do this Java virtual machine was
tracked. A novel technique is used to keep the runtime overhead minimal
with compact binary trace. Moreover, they also propose a method to
reconstruct the omitted runtime information for a specific trace. An offline
analysis of the collected trace is provided to build the layout of Java heap at
different timestamps. The performance is evaluated using over 30
benchmark tools. However, the tool is not evaluated for usefulness.
Additionally, support by visualization to effectively get the insight of large
volume of information is also missing. Yin et al. [65] present an
instrumentation tool called PAST. They formulate an approach based on
probe in abstract syntax tree (AST) level, to identify the location needed to
be tracked in the fully optimized target program. Their technique works in
contrast to instruction level instrumentation, where debugging information
is used to trace program. A prototype tool is developed to implement their
technique using both offline and online instrumentation.
The work presented in this dissertation focuses on capturing data related
to object instantiation during the program execution and also provides
visualization of the runtime data to have a better understanding about the
software’s code. The previous work discussed in this section has two major
limitations, first these were not based on dynamic analysis and secondly no
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 29
suitable visualization was provided. In contrast, this work aims to visualize
dynamic data of programs and provide visualization to assist programmer
more effectively. The core difference between the prior work and the
technique presented is that, previously APIs object creation hierarchy was
overlooked. The specific goal of this work is to help the programmer to
visually understand packages/classes of a Java program that are responsible
for creating collection type objects at runtime and to further help them in
program understanding and performance analysis. For the purpose of
collecting runtime information of software, instrumentation is used. This is
done without degrading the performance of the software which is being
traced, in terms of execution time.
2.2 Automatic visualization prediction
Visualization techniques refer to the creation of visual images and
animation to effectively and efficiently communicate the information.
Visualization techniques can be divided in two types: scientific visualization
and information visualization. Scientific visualization visually depict some
physical phenomenon, i.e., surface visualization [66], flow visualization
[67], or volume visualization [68]. The information visualization techniques
are focused on abstract data, i.e., software visualization [35] [69], security
visualization [70], and network visualization [71]. In the literature, several
information visualization techniques have been presented including parallel
coordinates, treemap, and map. The focus of the research is on how to
select an appropriate visualization technique for a specific dataset. This
section provides comprehensive overview of the related literature about
information visualization, classification and automatic prediction. Past
research in the information visualization domain show several
categorization and taxonomies established on various criterion.
Shneiderman et al. [72] formulate a task by data type (TTT) taxonomy of
information visualization techniques. The taxonomy is based on the data
types of the dataset need to be visualized along with the task that may be
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 30
Figure 2.2 Information visualization techniques classification
performed on the data using that particular visualization technique. The
author incorporates seven data types for the classification of visualization,
including 1-, 2-, and 3-dimensional data, temporal and multidimensional
data, and data having network or tree relationship. Furthermore, seven
tasks have been used to which the data types are mapped, i.e., overview,
zoom, filter, detail-on demand, relate, history, and extract; known as
information seeking mantra. These tasks are used at an abstract level and there
may be other additional tasks based on these seven tasks. However, their
work is more theoretical and is not used to classify visualization techniques.
Chi et al. [73] takes a different approach for classification of visualization
technique by using data state reference model. The data state reference
model consists of four data stages and three transformation operators. This
taxonomy is process-centric and operates along with the visualization
pipeline to take all operators needed for design. The author shows the
transformation involved in each state along the visualization pipeline from
data value to visualization design. The author argues the taxonomy which is
helpful in understanding the design space as well as the application of
visualization techniques in broader sense throughout the visualization
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 31
pipeline. Chi taxonomy and reference model is more general, it can be
utilized for both scientific and information visualization domains. A large
number of examples have been provided from several visualization domains
to explain the proposed taxonomy. Nevertheless, the taxonomy needs to be
updated as the new visualization techniques are introduced. Keim et al. [74]
use other guidelines to classify information visualization and visual data
mining techniques. He suggests three criterions: data to be visualized,
visualization technique, and interaction and distortion techniques for the
classification purpose using related examples. The underlying dataset that is
to be visualized has significant impact on the visualization to be used.
Consequently, the author categorized the dataset into 6 types; 1-2
dimensions, multidimensional, text/web data, hierarchical, and software
data. Alternatively, the visual display techniques are divided in five
different types; 2D/3D, geometrically-transformed, dense, iconic-based, and
stacked-based techniques. The user interaction techniques are also utilized
to classify visual techniques in six categories; standard, dynamic projection,
interactive zooming and filtering, distortion and brushing, and linking.
However, despite the clear explanation, the proposed rationale is more
theoretical and needs to be implemented on a real working system to
automatically suggest visualization for future use. Marty et al. [70] use the
relevant task along with dataset and visualization technique to identify an
appropriate visualization technique in a context. The author argues that
visualization selection for a dataset with some intended task is not a trivial
undertaking. The selection of visualization for a specific dataset depends on
several factors, including the number of instances in a dataset, total number
of attributes, and the data type of the variables. Furthermore, the authors
identify four types of tasks: relationships, distribution, trends, and
comparison also contribute to the selection of suitable visual technique.
Marty’s contribution is more directed toward network security related data
and visualization task. However, the idea can be extended to other domains
and visualization techniques by their rigorous study.
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 32
Furthermore, in literature several visualization selection techniques have
been suggested. Lange et al. proposed a prototype tool, Vis-Wizz [75], to
assist the user in the visualization selection process. Although, the tool is for
scientific visualization with the multi-dimensional dataset, the idea is
equally applicable to information visualization as well. The visualization
recommendation is based on data characteristics combined with
visualization goals a user needs to accomplish. Furthermore, the system
also provides an evaluation of the resultant visualization based on the user
intended aims. The limitations of this system include: the difficulty of user
interface and the user is overwhelmed with mapping of data to visual
attributes. Grinstein et al. [76] present a basis for benchmarking and
evaluation of visualization technique for knowledge discovery and data
mining. They provide general rules for further advancement in the area of
automatic visualization selection for a specific dataset. The authors
empirically evaluate five visualization techniques on nine datasets to
formulate general criteria for the stand evaluation in data mining using
visual representation. Moreover, nine datasets have been evaluated against
different intrinsic features considered as tasks existing in each dataset, i.e.,
outlier, cluster, class clusters, important features, possible rules, and exact
rules. Similarly, the datasets selection for the purpose of this study is based
on the complexity, attributed to factors: dimension, number of records,
cardinality of dimension, and number of independent variables.
Nevertheless, the proposed benchmark rules need further improvement to
incorporate new visualization, tasks, and datasets. Guettala et al. [77]
present an automatic process of visualization selection and optimization for
visual data mining. The suitable visualization selection is based on the
model of the underlying dataset combined with user’s objectives. A
prototype assistant tool provides several mapping data attribute to potential
visualization. The interface allows users to interactively select visualization
based on simple heuristics. Once the visualization is selected, the system
uses an interactive genetic algorithm to optimize the visualization for
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 33
effectiveness. This allows the user to customize the selected visualization
and improve the mapping by using visually supported interactive interfaces.
However, the system is non-trivial for a user having limited knowledge of
visual mapping.
Laaser et al. [78] propose a rule-based system for automatic visualization
selection based on data and corresponding metadata. The system uses
predefined rules to select the most suitable visualization and its properties
for a specific dataset. The rules illustration in the system is more flexible and
perspective dependent, consequently, the visualization selection for same
dataset may differ. Several properties have been incorporated as rules in the
system, i.e., number of columns, total number of values to be visualized,
and data types in dataset. Furthermore, some other characteristics of
dataset, i.e., data properties, dimensions, and relationship among the data
values are also utilized. The system has the provision to override all rules or
select any other visualization technique based on user’s choice. However,
the system requires domain knowledge to design rules. The rules design and
their incorporation is not trivial. Additionally, the system is provided with
only business related visualization. The addition of new visualization will
necessitate the change in mapping mechanism between rules and
visualization techniques.
Recently, in [79] a visualization ranking-based approach is presented by
Chronister et al. . Basic idea is to put a human subject in the loop, where
users will need to perform selection among a set of visualizations that the
system provides. The set of visualizations is ranked according to fitness
score as a measure of suitability for the context present. The user then
selects most appropriate visualization technique to apply on the dataset.
Fitness score is computed for a visualization using the metadata of both
visualization and dataset. The visualization metadata may consist of
properties like visualization support data type, while the dataset’s metadata
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 34
includes characteristics: data types, number of attributes, and number of
instances.
As evident from the above survey and to the best of our knowledge,
neural networks have not been used for the classification and prediction of
visualization techniques as yet. Rather closely related work to find
appropriate visualization is available in [78, 79], as yet rule-based system are
used. It is also noticed that the selection of an appropriate visualization
technique beforehand will be a great help in understanding the data being
visualized [80]. Based on this, the proposed system in this work is motivated
to utilize ANN to select a particular visualization technique for a give
dataset.
2.3 Visualization optimization
The study of information visualization is one of the major research areas
and has been applied to an assortment of applications ranging from
biomedical to physics experiments. Quantifying visualization and
optimizing its layout is an emerging subfield in information visualization.
Since quantification of visualizing is subjective, it makes the problem
complex. This section lists the recent work on quantifying visualizations,
information visualization using treemaps, and techniques for optimizing the
visualization.
Tanahashi et al. [81] propose optimization technique to produce
aesthetically pleasing and legible storyline visualization. Their technique
consists of two main parts, the algorithmic part which performs layout
design for the visualization, and rule part to improve the aesthetic by
adjusting the line geometry using commonly agreed rules. To optimize the
layout for storyline visualization, the authors used three types of quality
metrics, i.e., line wiggles, line crossovers, and white space gaps between
lines. Additionally, visual quality metrics are combined with the general
design principle of storyline visualization. Genetic algorithm-based
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 35
computation approach is implemented to find the optimal layout for
visualization using the quality metrics as fitness function. The throughput of
this process results in legible storyline visualization. Moreover, the
visualization must be aesthetically pleasing and more legible. Two
properties, i.e., flow of lines and clarity of lines in storyline visualization
influence the aesthetic and legible perception. Therefore, the techniques are
applied to improve the aesthetic and legibility by enhancing visual flow and
clarity. However, the technique is focused on the optimization of storyline
visualization techniques only.
House et al. [82] proposed another approach for optimization of complex
visualization for perceptual and aesthetic properties. The method has two
stages: first stage utilizes genetic algorithm with human-in-the-loop strategy
to explore large visualization parameter space. This stage ends with a
database of large visualizations rated according to objective function score.
Furthermore, the second stage is devised to extract optimal visualization
parameters from the large database built during the first stage. Principal
component analysis, neural network, and clustering techniques are used at
this stage for data mining purposes. The second stage results in guidelines
for producing visualization and strategies to make the visualization more
aesthetic.
A semi-automated assistant tool for creating perceptually appealing
visualization for complex multidimensional datasets is presented by Healey
et al. [83]. The tool, called VIA, consists of two engines: search engine and
evaluation engine. The search engine is based on a real time search
algorithm which collects the basic information about a dataset from the
user. Search engine utilizes this information to generate a suitable mapping
between data values and visual attributes. A perspective user may guide the
search engine to evolve the current result for more expedient mapping.
Furthermore, the evaluation engine, a collection of more than one engine, is
used to build more appropriate visualization for a dataset and related tasks.
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 36
The evaluation engine uses knowledge based on human visual studies to
weight, and finds the optimal result. The main disadvantage of this work is
the limited set of visual features. Similarly, the domain knowledge is
required to get started with the search engine; this process may be limited to
user capability and knowledge about the problem.
The work presented by Marrero et al. [84] systematically reviews and
classify several visualization metrics. The classification is based on the use
of dataset structure, purpose of visualization, and the user expertise and
related tasks. Three types of visualization metrics are discussed in their
work; mathematical, user-centric, and visual efficiency metrics. Each of
these metrics is based on features related to various aspects of visualization.
Such quantification criteria help building better visualization both
subjectively and objectively.
Rigau et al. [85] proposed an aesthetic measure based on Shannon’s
information theory [86] and Kolmogorov complexity [87]. The basic
concept is to relate the quality of art with complexity level. The work
utilizes Birkhoff’s aesthetic measure from an information visualization
perspective. Three measures from Bense’s concept have been taken into
consideration; initial repertoire, the basic states, the palette used, and the
range of colors selected by the artist. This work also focuses on the
quantification of aesthetic quality in painting [85]. Another analogous work
is demonstrated by Li et al. [88] evaluates the visual quality of paintings.
His work adopts the problem with machine learning and a data-driven
approach. The technique starts with feature extraction of the visual quality
of artistic work. An experiment is conducted to compare the human-based
survey result with computational models that classify aesthetic paintings
from non-aesthetic ones. The authors argue that there is a relationship
among the subjective human perception and computational model of the
problem.
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 37
An interesting work is present by Huang et al. [12] to measure the
effectiveness of graph visualization in context of human cognitive load and
visual efficiency. The main objective is to overcome the limitation of
performance-based measures in graph visualization. The model presents
two new measures of effectiveness: mental efforts and visualization
efficiency. The model reveals that in a visualization environment, there is a
strong relationship among metal effort for a task, memory requirement, and
mental efforts. User studies have been conducted to evaluate the model for
the cognitive measure of metrics with various complexities.
Matvienko et al. [89] propose an evaluation method for dense vector field
(DVF) visualization which is based on image visual quality metric. The
technique is based on the average similarity of DVF visualization, image,
and underlying dataset, i.e., the vector field. In case of a vector field,
automatic evaluation of the images are carried out by using different
parameter sets, verity of DVF method, and quality improvement strategies.
The local image gradient is taken as the visual quality measure to compare
two images. User survey has been conducted with 53 subjects to evaluate
the effectiveness of the quality metric. Nevertheless, the technique focuses
scientific visualization.
More recently, a study has been conducted by Lehmann et al. [90] to
examine the relationship between human perception and quality metrics.
The study is performed using seven quality metrics on various high
dimensional datasets. More than 100 subjects performed the experiment by
taking tasks on three type of visualization: parallel coordinates, scatterplot,
and radial visualization.
Aydin et al. in [91] propose a framework for the automatic rating
approach for the evaluation of photographic images. The work provides an
objective assessment of images by considering meaningful aesthetic
attributes of the image. Five aesthetic attributes are utilized, based on
general criteria of selection; sharpness, depth, clarity, colorfulness, and
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 38
tones. Furthermore, computationally measured metrics are defined on the
basis of aesthetic attributes to calculate objective analysis, collectively
depicted by image signature. The system is then evaluated by comparing the
subjective and objective prediction results. However, the work is limited to
expressiveness aspect of the images only.
Yet Moere et al. [92] carried out another study to evaluate how the
insight of the information visualization impacts the perspective of style
using online survey. Three types of demonstrators are used in their
experiment, namely analytical style demonstrator (ANA), magazine style
demonstrator (MAG), and artistic style demonstrator (ART). The study
shows that in spite of creating lucid diversity in the usability, usefulness, and
enjoyability, no variation was found in the insight in context of self-reported
depth, expert-reported depth, and resulting difficulty. The work presented
by Demiralp et al. [93] introduces distance metrics, called perceptual kernels
for the evaluation of visualization design. The kernels are used in automatic
visualization design which is perceptually pleasing. The perceptual kernels
are derived from the visual parameters, i.e., shape, size, and color combined
with their aggregate perceptual effect. These perceptual kernels are
compared with existing perceptual models.
Earlier, the aesthetic label layouts metric had been proposed in [94] by
Hartmann et al. Their approach considers both internal and external labels
for optimization using aesthetic attributes. The automated system finds all
labels in a visual item and classifies them as internal or external labels. The
classification scheme and layout algorithm mitigate the side effect of
conflicting-requirements and produces more readable and aesthetic visual
layout. The system focuses on pedagogy.
Pargnostics, a model presented in [95] to quantify the visual structure for
the parallel coordinates visualization technique. The model intends to fill
the gap between visual representation and the user tasks. Pargnostics focuses
on screen space metric, pixels, rather than focusing on the design issues.
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 39
The system provides users with ranked parallel coordinates to choose an
appropriate visualization. The selected visualization is then optimized using
the metrics. With the help of scatterplot visualization another work on
perception-based quality assessment is presented in [96]. A perceptual
model is constructed for various projections in perceptual space. The
approach uses the studies in psychophysics and multi-dimensional scale.
The projections are then ranked using the estimate value of suitability for
the specific user task on the dataset. Moreover, the distance in the ranking
spaced is optimized based on quality metric of scatterplot to make it
comparable with the perceptual model. However, the technique is human-
centric and need proper training. Proper evaluation is also needed to
examine the technique for applications of visualization.
Kong et al. [97] present a set of perceptual guidelines for creating
treemaps visualization. The study conducts a series of controlled
experiments to produce a set of guidelines for designing perceptually better
treemap-based visualization. The experiment has been deployed using
crowdsourcing through Amazon’s mechanical truck (MTruck). The
experiment examines the impact of perceptual parameters, i.e., aspect ratio,
luminance, and data density on value estimation tasks. The study
empirically demonstrates that the aspect ratio is correlated with area
judgment. However, it is not the case for luminance. Moreover, treemap-
based visualization has more data density as compared to a hierarchical bar
chart. Furthermore, the author sets the guidelines about aspect ratio, density
and luminance for creating perceptually better treemaps.
In the literature, several comprehensive surveys are presented on
visualization quality metrics for high dimensional data, such as [98] and
[16]. The work in both of these articles uses a systematic approach to
categorize the quality metrics and related concept. However, Lin et al.’s [98]
work is focuses more towards the prediction of perceptual quality in
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 40
photographs, while, Bertini et al.’s work [16] considers the data
visualization related research for their review.
Lam et al. [24] present seven guiding scenarios for evaluating
information visualization from a comprehensive literature analysis. Their
work incorporates exhaustive study of over 800 published articles in various
venues to find the common linkage between goal and adopted approach.
Subsequently, this work is extended by Isenberg et al. [99] for general
visualization scenarios. The systematic review of literature spelled over a
decade shows that the trends in evaluation techniques change over the
years.
A different methodology is presented by Harrison et al. [100] to model
the correlation of nine common visualizations using Weber’s law. The
experiment is carryout online using crowdsourcing techniques. The study
shows that the models based on Weber’s law provide a brief technique for
the quantification, ranking, and comparison of perceptual visualization’s
effectiveness. Similarly, the model also provides information on the
symmetries and asymmetries raised from performance difference during the
evaluation process. These characteristics are actually related to the visual
features of the visualization. Furthermore, the model leads towards the
foundation for developing benchmarks to explore how the perceptual laws
impact the design elements of visualization.
Cawthon et al. [101] investigate the relationship of visualization’s
aesthetic and its usability through empirical analysis. The study presents
two metrics, i.e., task abandonment, and erroneous response time to capture
aesthetics. An online survey has been performed with 285 participants and
eleven visualization techniques, to measure the perceptual aesthetic quality
of the presented visualization. The participants were asked to label the
visualization with one of three ranks, i.e., ugly, neutral, and beautiful.
Moreover, the survey asked the users about the dataset and visualization’s
effectiveness for a particular task.
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 41
In contrast to aesthetics and effectiveness, another important aspect is
insight, being a literal purpose of the visualization [102]. North et al. [103]
focus on the insight of a visualization and how to evaluate this
quantitatively. The limitation of task-centric evaluation based on controlled
experiment is elaborated. The authors’ state that the insight gained for
visualization is highly subjective to user’s motivation, domain knowledge,
and interest in the underlying data. Consequently, insight measurement is
difficult and need a direct method for its evaluation. Moreover, the article
lists some general characteristics of insight as, e.g., complex, deep,
qualitative, unexpected, relevant, etc.
Recently Borkin et al. [104] presented another metric for the information
utility and effective visualization design as the memorability of
visualization. A systematic large scale online study has been conducted for
hundreds of visualizations. The visualizations are categorized in different
types and investigated for the factors that make visualization memorable.
Results of the study reveal the point that design attributes, such as, color,
inclusion of human recognizable visual shapes, low data-to-ink ratio, and
high visual density enhance memorability of visualization. Moreover,
unique and unexpected visualizations are more memorable than common
visualization types. Nevertheless, the finding of study is the first step
towards the effective design of visualization that provides users with more
utility of information.
The modern age datasets are large and complex, therefore, the visual
exploration of such massive data is tedious. Schneidewind et al. [105]
present an automated approach to optimize pixel-based visualization.
Cluster analysis is used to partition data and find important parameters in
the dataset, followed by image analysis techniques and ranking process. The
system determines the optimal parameters setting for data visual
exploration.
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 42
Fuchs et al. [106] propose an interactive visual analysis combined with
machine learning techniques to support the insight from the user’s
perspective. The main objective is to automatically generate and verify
hypotheses for visual exploration on massive dataset. Therefore, a set of
fitness criteria is used with genetic algorithm (GA) to find the best
hypothesis from a large search space. These visual hypotheses are
formalized using fuzzy logic and evolutionary algorithm to investigate and
explain features of data. The fitness function used with the GA considers
the influence of feature similarity, individuality, and complexity,
respectively. Random selection is used to add individuals to the GA
population.
A computational model of human vision for visualization evaluation and
optimization is presented by Pineo et al. [8]. A quality metric called
effectiveness has been simulated based on the model of the perceptual
process of the human retina and primary visual cortex. Furthermore,
effectiveness metrics is used as a fitness function in hill-climbing algorithm-
based optimization. The model in evaluated using two different flow
visualizations. The model has twofold usability; firstly, the model fills the
gulf between perceptual theories and guidelines of visualization design,
secondly, an automated assistance tool for visualization production may be
built on these recommendations.
Elmqvist et al. [107] discuss visualization optimization in perspective of
color schemes for perceptual issues. The technique is used to dynamically
optimize color schemes based on the set of sampling lenses of the user
specified region. Furthermore, it provides visual search facility for the
ubiquitous low-level user task to effectively get the insight. The technique is
implemented using two prototypes using OpenGL and programmable
graphic processing units. Additionally, the several case studies have been
conducted for both information visualization and image inspection
applications. More recently, Lee et al. [108] present the concept of class
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 43
visibility to measure the utility of color optimization for data visualization.
The algorithm quantitatively enhances the composition of a color palette
and comparatively reduces the visual weight of large groups. The
experiments of the technique on two real world datasets confirm the
effectiveness of visibility metric for visual representations.
The work in [81] focuses on the storyline visualization, however, the
approach in this work is generic and can be applied to any visualization
technique. The work in [81] use classical GAs, having the change to stuck in
local optimum. Similarly, work in [85] focuses the quality quantification of
aesthetics in paintings. The approach presented in this work is for
quantifying a visualization technique using an EA to find an optimal layout
of a particular visualization technique. The metric in [89] is only for images
generated via DVF methods. It is evident from the aforementioned
literature survey and to the best of our knowledge that there is no work on a
set of metric that can measure the quality of multiple visualization
techniques. The major aim of this work is to introduce a generic measure to
quantify a visualization technique and later to use the same measure to
optimize the layout of a visualization technique. Figure 2.5 summarizes his
chapter in a tabulated form.
2.4 Chapter Summary
In this chapter major related work from the literature was discussed. This
work was categorized with three aspects: dynamic code analysis and
software visualization, automatic visualization, and visualization
optimization. The first part of the chapter focused on several techniques and
tool from the area of dynamic code analysis through bytecode
instrumentation and software visualization. Furthermore, the work on
program runtime analysis and APIs analysis were also reviewed and
discussed with their limitations. This was followed by the review of past
work from the visualization classification and automatic prediction of
Chapter2 Literature review
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 44
suitable visualization. The last part of the chapter comprehensively
analyzed the previous work on visualization optimization, visualization
metric, and aesthetic and perceptual properties of visualization. The
following chapter will present the first part of the proposed work. It is a
solution to automatically select appropriate visualization technique based
on the given metadata about the data and the task that a user is required to
perform. The appropriate visualization is predicted based on an artificial
neural network (ANN)-based model which classifies the input data into one
of the eight predefined classes.
List of Acronyms
45
Figure 2.5 Overview of the literature review
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 46
On selecting data visualization
technique
“The best way to predict the future is to study the past, or
prognosticate” Robert Kiyosaki
Advances in computing technology have been instrumental in
creating an assortment of powerful information visualization
techniques. However, the selection of a suitable and effective
visualization technique for a specific dataset and a data
mining task is not trivial. This work automatically selects an
appropriate visualization technique based on the given
metadata and the task that a user intends to perform. The
appropriate visualization is predicted based on an artificial
neural network (ANN)-based model which classifies the input
data into one of the eight predefined classes. A purpose built
dataset extracted from the existing knowledge in the discipline
is utilized to train the neural network. The dataset covers
eight visualization techniques, including: histogram, line
chart, pie chart, scatter plot, parallel coordinates, map,
treemap, and linked graph. Various architectures using
different numbers of hidden units, hidden layers, and input
and output data formats have been evaluated to find the
optimal neural network architecture. The performance of
neural networks is measured using: confusion matrix,
accuracy, precision, and sensitivity of the classification.
Optimal neural network architecture is determined by
convergence time and number of iterations. The results
Chapter
3
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 47
obtained from the best ANN architecture are compared with
five other classifiers, k-nearest neighbor, naïve Bayes, decision
tree, random forest, and support vector machine. The
proposed system outperforms four classifiers in terms of
accuracy and all five classifiers based on execution time. The
trained neural network is also tested on twenty real-world
benchmark datasets, where the proposed approach also
provides two alternate visualizations, in addition to the most
suitable one, for a particular dataset. A qualitative
comparison with the state-of-the-art approaches is also
presented. The results show that the proposed technique
assists in selecting an appropriate visualization technique for
a given dataset with high accuracy.
Information visualization is a common computer-based interactive
technique to graphically represent large volumes of data efficiently to
reinforce human cognition. The data is constantly generated in a diverse set of
fields ranging from satellite systems and physics experiments to software tools
running over an operating system. This produces billions of terabytes being
stored daily, which requires approaches to handle this flood of information
effectively [109, 110]. The problem of managing large sets of data can be
addressed from different perspectives depending on its type at hand. This may
include optimizing memory usage [111], designing hardware capable of
storing large amounts of data [112], extracting valuable information from data
[113], and visualizing the data [112]. Each of the aforementioned domains
can further be classified into many open problems. The work in this chapter
deals with the software visualization domain. Software visualization may be
static or dynamic representation of software tools based on their size,
structure, history, and behavior. With the advances in speed of computer
processing and graphics tools, new and powerful visualization techniques are
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 48
being developed. These contemporary techniques may be applied in data
mining, knowledge discovery applications, and other tasks [75, 114]. Selecting
a particular visualization technique for a pictorial representation of data has
always been subjective. An inappropriate visual representation may lead to
inadequate decision making [76, 115]. It will be very useful to know which
visualization technique is more appropriate for a particular task at hand. It
will be helpful to know which visualization technique is appropriate for a
particular task at hand.
Deciding for a particular visualization technique is mainly guided by the
problem statement, the dataset and the tasks that are to be accomplished. A
dataset may be characterized by its properties referred to as the metadata.
Some of these characteristics include data dimensions, data types, number of
attributes, multivariate attributes, etc. Similarly, various types of tasks, e.g.,
relationship, distribution, and trends need to be accomplished on the data [76,
116]. Frequently users are not interested in such system peculiarities. They
normally want to build automatic visualization based only on the data
characteristics and the desired tasks. No extra mapping among datasets and a
particular visualization technique exists. However, literature suggests that
some visualization techniques are more suitable for a specific data type, data
dimension or a particular task than other [115, 117, 118]. Hence, visualization
techniques may be classified on the basis of data they have to visualize in
conjunction with the tasks to be performed. This classification will provide
basic knowledge using which future data can be mapped to a best suited
visualization technique. For this purpose the remedy to the situation can be
twofold, a metadata driven classification of existing visualization techniques
and then an accurate prediction system for new data to be assigned to
appropriate classes.
To select an appropriate visualization technique based on the metadata and
the task that a user intends to perform is presented. The appropriate
visualization is selected trough an artificial neural network (ANN)-based
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 49
model which classifies the input data in one of the given eight classes. The
major bottleneck in this problem is the unavailability of such datasets with
sufficient training samples. With a careful investigation of the literature a
dataset consisting of the metadata and corresponding visualization technique
was custom-built. A single record of the dataset has the attributes: data
dimensions, number of attributes, type of primary attribute to be visualized,
and the task. After pre-processing of the dataset, different neural network
architectures are trained and tested with supervised learning techniques. The
results are validated using neural network performance evaluation metrics.
Further, the performance of ANN is compared with k-nearest neighbor (k-
NN), naïve Bayes, decision tree, random forest, and support vector machine
(SVM). The results are also compared with the current state-of-the-art
approaches [78, 79]. Along with one best strategy, the proposed approach is
extended to provide more flexibility. The system automatically selects three
best visualizations based on the ranking of the ANN output layer neurons’
values. This aspect of the proposed system is experimentally checked with
twenty real-world benchmark datasets. The proposed system can be utilized
as an intelligent assistance in the current word/data processing software
packages to help the user in selecting an appropriate visualization technique.
At present, the current word/data processing software packages simply
provides a list of visualization techniques to the users without taking into
consideration the actual data and its metadata. However, the proposed
solution takes into consideration both of these, i.e., the actual data and its
metadata, before recommending a visualization technique. This helps in
taking an informed decision.
3.1 Proposed system, dataset, and visualization techniques
To classify and predict a visualization technique the proposed approach
consists of a series of steps. Initially a dataset is created that later train the
ANN. A wide range of literature from an assortment of disciplines is included
to create the dataset. Using this corpus, a dataset comprising of six columns
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 50
Figure 3.1 System working for the visualization prediction
and 400 rows is created. The last column of the dataset is the class label, i.e.,
the visualization technique. The data is classified in one of the eight classes,
each representing a particular visualization technique (details on the dataset
are covered in Section 3.1). The dataset is pre-processed to remove the
irregularities and to provide uniformly important input variables to the
network. The ANN is trained using a supervised learning method with input
presented as a vector of elements yielding one of the eight visualization
techniques explained in Section 3.2. The classifier’s results are validated using
the test data. Several network architectures in terms of the number of layers,
number of hidden nodes, and network performance are evaluated to
determine an optimal classifier structure. The system components and its
structure are shown in Fig. 3.1.
The rest of this section discusses the dataset built for the proposed system.
This section also reviews eight visualization techniques, i.e., line chart, pie
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 51
chart, parallel coordinates, scatter plot, histogram, linked graphs, map, and
treemap. These techniques are chosen for their widespread use.
3.1.1. Building the dataset
Literature on information visualization shows that there is no standard
dataset which consists of metadata and corresponding data. The only
information available is how various datasets are being visualized or which
visualization techniques are suitable for a given data type [76]. A dataset is
formed on the basis of knowledge available about data characteristics and
underlying visualization techniques. The dataset comprises the metadata
utilized in recently published work and corresponding visualization
techniques. Knowledge is also gathered about the tasks relevant to specific
data [70, 76, 78]. The metadata consists of: data dimensions, primary
variable type, number of attributes in the dataset, the number of
records/rows to be visualized, and the relevant task to be carried out. This
chapter focuses on the four tasks, namely, relationship, trends, distribution,
and comparison. The dataset consists of 400 items each having five
attributes of the metadata and the sixth attribute referring to the
visualization technique suitable for the particular record. The sixth attribute
becomes the class label for the classifiers. Eight visualization techniques are
considered to construct the dataset which become eight classes to be
considered by the classifier. These classes are labelled as: histogram, pie
chart, map, treemap, parallel coordinates, scatter plot, linked graph, and
line chart each having 105, 55, 29, 81, 56, 23, 25, and 26 samples,
respectively. The numbers of samples occurring for a particular
visualization technique differ since they depend on their usefulness in the
contemporary work. The dataset1 is mathematically shown below. Fig. 3.2
shows the mapping between metadata and the tasks. = { : ∈ }, 1 http://ming.org.pk/datasets.htm
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 52
= { : ∈ }, = { : ∈ } then × → =< , , >. .
The dataset D consist of various instances, where each instance can be
represent by the features = { , , , … . }, = { , , … . }. The task set T is represent as, = { , , , … . . }. Similarly, for visualization we have the set V = { , , , … . . }.
This makes × × possible combinations for each instance of D, T, and
Z.
× × = ⋃ , . ℎ are two disjoint set
= × → , .
= × ≠ , . ℎ , = , , … , = , , … , = , , …
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 53
Figure 3.1 Visualization, tasks and metadata mapping
D represents set of metadata where each di has its own domain, T is set of
tasks and V is set of visualization techniques. Later a relation is found
between D and T using a suitable mapping in set V given by Eq. 3.1. A
simple instance of the dataset is; dimension = 1, number of attribute = 1, the
instances are up to 100 or less and primary variable type is continuous, given
the distribution task is to map the target visualization histogram.
Standard pre-processing techniques (listed in Section 5) are applied on the
dataset before it is submitted to the neural network. Table 1 shows the
characteristics of the dataset. Further detail on the attribute and tasks can be
found in[6, 70, 116].
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 54
Table 3.1 Dataset description
Attribute name Values/Range
Dimension One dimension, Two dimensions, Three or
more dimensions, Hierarchical
Primary variable Ordinal, Continuous
Categorical Geographical
Tasks Relationship, Trends
Distribution, Comparison
No. of attributes Numerical value , , …
No. of instances Numeric value , , …
Target/Visualization Histogram, Pie Chart, Line Chart, Parallel
Coordinates
Scatter Plot, Linked graph,
Map, Treemap
To create the test data, published articles from diverse fields utilizing the
eight visualization techniques considered in this work are reviewed. Pertinent
information such as dimensions of the data, primary variables, dataset size,
attributes used for visualization, and the task to be accomplished through
visualization is extracted to construct dataset. This required an extensive
search of a large number of articles. Article selection was based on domain
diversity, relevance to the subject and citation count. Table 3.2 summarizes
key articles considered for constructing the dataset. The domains and the
cumulative citations are mentioned for the key articles in Table 3.2.
Table 3.2 Key articles used in dataset construction
Visualization
Key
articles Domains
Cumulative citations (As of
July 2016)
Histogram [67-69]
Swarm Intelligence, Soft
Computing 524
Pie Chart [70-72] Soft Computing, Genomics, 4908
Map
[63, 73,
74]
Scientific visualization,
Bioinformatics 828
Treemap [75-77] Software Engineering 816
Parallel
Coordinates
[39, 78,
79] Databases, Clustering 693
Scatter Plot [80-82]
Life Sciences, Behaviour
Analysis 11949
Linked graph [83-84] Physics, Mathematics 205
Line Chart [85-87] Chemistry, Biology, Medicine 826
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 55
3.1.2. Visualization techniques
Several visualization techniques have been presented over the years. As the
technology advances older techniques are replaced by the newer and
sophisticated approaches. However, the baseline remains the same; the use of
a particular visualization technique is subject to the task at hand. Therefore,
none of the techniques is universal[6, 14]. In this work, eight visualization
techniques are selected to perform classification. All techniques are known
and characterized to visualize a particular type of data. The selection of these
eight visualization techniques is based on their wide use in information
visualization and other multidisciplinary domains. Table 3.3 lists various
applications of these techniques for information visualization and
interdisciplinary usage.
Table 3.3 The eight visualization techniques used in recent literature
Visualization Type Reference
Histogram [120-122]
Pie Chart [121, 123, 124]
Map [121, 125]
Treemap [43, 126]
Parallel Coordinates [127-130]
Scatter plot [121, 131]
Linked graph [121, 131, 132]
Line Chart [132, 133]
In contemporary work more specific visualization techniques still exist
[115] Due to their specific nature these visualizations are not considered here.
This section discusses the eight selected visualization techniques.
Histogram: A histogram shown in Figure 3.3a is normally used with somewhat
regular type variables and shows their distribution over time. The histogram is
used in many statistical analysis and other applications [134]. In the current
perspective, if the intended task is distribution and the primary variable is
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 56
Figure 3.1 The eight visualizations used as class label
continuous, the target visualization should be a histogram. It is a common
visualization found in many statistical applications and software packages to
support programming languages.
Pie chart: The pie chart developed in the early 18th century [135], is another
common technique to visualize distribution of quantities as part of the overall
share (Figure 3.3b). Pie charts are good for categorical variables. They show
the distribution task on the dataset in many statistical and drawing
applications. A major limitation of pie charts is that, it can visualize data in
one or two dimensions.
Line chart: [, 49] [35].
To find trends in data, line chart (Figure 3.3c) is one of the best options
available and suitable for ordinal variables. It also shows good behavior for
small continuous and discrete data. Line charts are commonly used in
business data mining applications [136, 137] tasks and it is also available with
Google’s online visualization library [133].
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 57
Parallel Coordinates: Parallel coordinates is a visualization technique [127,
128] used for multidimensional and multivariate data, where each dimension
is comprised of hundreds of attributes as shown in Figure. 3.3d. The parallel
coordinates are useful for comparison and to show relationship across the
dimensions. It also shows the distribution of different variables along several
dimensions. The original technique is enhanced in [129, 130].
Scatter Plot: Scatter plot is a visualization technique that is built from the
Cartesian plane. A point on the plane shows the values of two variables,
mostly both of the variables are continuous or ordinal [6]. If data has more
than one dimension it can be shown by using a different color or size of the
point in a 2D plane (Figure 3.3(e)).
Linked graph: Linked graph shown in Figure 3.3(f) is a widely used
visualization for data having hierarchical or network relationship. One
advantage of using linked graph is that it can visualize many types of data.
The only problem is the layout of large graphs on a single screen. The graph
consists of a set of nodes and a set of edges, where edges are used to link
corresponding nodes. Several dimensions of data may be shown using the
position, size, and color of a node [6, 138].
Map: A special type of visualization technique used to display spatial or
physical distribution in two dimensional space (Figure 3.3g). Mostly, the data
has only one dimension and can be used with different variables. We choose
the geographical type of data to be classified with the class label map. Color
and size of the distribution area are used simultaneously to code different
information [70].
Treemap: Treemap is a popular space-filling approach used to visualize
hierarchical data on a single screen [43]. Multidimensional data of any type
can be visualized with treemaps and to show hierarchical relationship in the
data. Treemaps cover the entire screen with nested rectangles to show the
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 58
(a)
(b)
Figure 3.2 (a) Neural Network (b) Neuron
hierarchical structure (Figure 3.3h). Several applications and online facilities
are available for treemaps visualization[139, 140].
3.2 Artificial neural network preliminaries
The artificial neural network (ANN) is a soft computing technique inspired
by the biological neural system and a model of human brain cognition. ANN
is an adaptive system. Its functionality is different from traditional digital
computers and works in parallel as in case of human neurons [141, 142].
ANNs have applications in diverse areas such as engineering, medical,
computer games [114], and banking for classification and control [30]. A
neural network identifies and learns from the patterns presented to the
network in the form of input and corresponding target output. Typically
ANNs consist of one input layer and one output layer and may have one or
more hidden layers depending on the network architecture [143]. The number
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 59
of neurons in each layer may be different and depend on the problem
statement, often determined empirically by trial and error. One of the many
types of neural networks is known as feed-forward neural network (FFNN).
The data in FFNN only travels in forward direction from one layer to the
next, in contrast to feedback or recurrent network where a signal may travel in
backward direction or in same layer neurons. Multilayer neural networks are
those consisting of one or more hidden layers, known as the multilayer
perception (MLP)[144, 145].
The basic structure of neural networks is shown in Figure 3.4a, and the
basic structure of a neuron is shown in Figure 3.4b. As shown in Figure 3.4a
each layer’s neuron receives a weighted input from the predecessor layer and
gives an output to the neurons in successor layer. The weight of each layer is
adjusted to achieve network goal. Each processing neuron consists of input
function and the activation function as shown in Figure 3.4b. The input
function sums the weighted inputs (different input functions are possible) and
the activation function to put non-linearity and send it as an output to the
next layer. If there is an input vector X for a neuron having weights vector W
as,
= [ + + + ⋯ ], = [ + + + ⋯ ].
Then neuron output O can be calculated as Eq. 3.5 where tanh is the
activation function
= ℎ ∑ .= . .
Several activation functions can be used with neural network, including:
linear, sigmoid, and hyperbolic tangent, where sigmoid and tanh are used for non-
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 60
linear neurons. The mathematical form of these functions is listed in Eq. (3.6)
- Eq. (3.8) where x in net input and f(x) is the output:
= x .
= = +𝑒− .
= tanh = −𝑒−+𝑒− .
Broadly, there are two learning methods, i.e., supervised learning and
unsupervised learning. The supervised learning method provides an input
pattern as well as the target output to the network. A common supervised
learning algorithm is backpropagation where the network error propagates in
the backward direction to adjust the weights and minimize the error. Another
modern algorithm is Levenberg-Marquardt (LM) which tries to minimize the
mean square error. LM is the fastest training algorithm for a regular size
network, although it does consume more memory compared to other
methods. The basic function of the algorithm is to minimize the error E which
is defined in Eq. (3.9).
= ∑ − .
3.3 Experiments and results
This section describes the experiments conducted and the results obtained.
The data was pre-processed first. Later, different architectures of the ANN are
tested for better accuracy. Additionally, to compare the ANN accuracy with
other classifiers, we use five other classifiers: k-nearest neighbor (k-NN),
naïve Bayes, decision tree, random forest, and SVM. We have also compared
our results with two state-of-the-art approaches in [78, 79]. The class labels
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 61
available in the dataset are used to measure the classifiers accuracy.
Measuring accuracy using the dataset’s class-labels helped in evaluating the
classifier’s results with the earlier usage of the predicted visualization. This
assures that the choice of a predicted visualization is not random. However,
to avoid any biases due to the custom-built dataset, Section 5.6 shows the
classifiers’ performance on 20 benchmark datasets.
As mentioned earlier, pre-processing is required before we use the data for
training ANN. The data pre-processing step is important to clean data from
noise and irregularities before it is presented to the network. The data pre-
processing includes: input output coding, normalization, and scaling. In fact,
some data are in the form not suitable for the neural network to operate on,
since neural network only accepts data in numerical form and produce output
in a specific range depending on the activation function. The dataset used in
this chapter consist of six attributes and 400 instances. Each instance of the
dataset represents metadata of the datasets used for visualization. The first
attribute shows the number of dimensions using the alphanumeric value of
1D, 2D, n-D, or hierarchical. The second attribute is a numerical showing the
number of attributes. The third attribute is also numeric indicating the
number of items in the dataset. The fourth attribute is an alphabetic indicating
primary variable type. Possible types of the primary variable include: ordinal,
categorical, continuous, and geographical. The fifth attribute is also
alphabetic attribute showing the task to be accomplished through
visualization. The possible values for the fifth attribute include: distribution,
relationship, comparison, and trends. Finally, the sixth attribute shows one of
the possible visualization techniques. The values for the sixth attribute
include: histogram, pie chart, map, treemap, parallel coordinates, scatter plot,
linked graph, and line chart. This makes only two attributes out of the five
(excluding the last attribute, i.e., the class label) having numeric values. Thus,
we will require encoding non-numeric values as numeric so that these can be
presented to the ANN. Additionally, once the alphanumeric attributes are
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 62
converted to their numeric equivalent, the range of values may differ for the
five attributes. Fig. 6 shows some variables in the original dataset having
higher curves than others. This indicates that the input values are not on the
same scale. To handle this scaling is required. We scale all the attributes using
Eq. (3.10).
, = 𝑖− 𝑖 − 𝑖 . .
Where , ,, , are original values, scaled value, maximum, and
minimum, respectively. The attributes are first scaled using Eq. (3.10) and
later the data is normalized. Normalization is done to confine all the input
values in a desirable range for the activation function. The formula used for
normalization is listed in Eq. (3.11).
− × 𝑖− 𝑖− 𝑖 + . .
Another requirement of the NN is nominal input variables, since NNs are
favourable to numerical or binary values [45, 46]. To transform alphanumeric
values to numeric ones, we used simple coding to convert the nominal value
to numerical ones, e.g., tasks are coded as {distribution = 1, relationship = 2,
trend= 3, comparison =4}.
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 63
Figure 3.5 Dataset larger values vs. smaller values
The results of experiments with the aforementioned encoding scheme were
not so promising, so we later used another type of encoding scheme known as
1-of-n encoding. To illustrate, here is a simple encoding for variable task
{distribution = 0001, relationship = 0010, trend= 0100, comparison =1000}. Other
nominal input values are similarly processed before they are fed into the
ANN.
3.3.1. ANN Experiments
At this stage, the training dataset is ready to be classified through the
neural network. For the experiments we used MATLAB [144]. There are five
neurons in the input layer, each corresponding to one attribute of the dataset
and the number of neurons in the output layer is eight, one for each of the
eight class labels (the visualization techniques). The activation function is tan-
sigmoid and the connection weights are in the range -2 to 2. In the
experimental phase first several ANN structures with one and two hidden
layers are tried. Experiments using five other classifiers are also performed.
The results obtained with the five other classifiers and the best ANN are
compared. This section also provides a qualitative comparison of state-of-the-
art approaches presented in [78, 79] and the best ANN.
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 64
The initial parameters selected are: learning rate, activation function,
performance function, and goal for various ANN structures as listed in Table
3.5. The dataset consists of 400 items where we used 70% of the dataset for
training and 30% for testing. However, in cases where we use a validation set,
the dataset is randomly divided as 60% for training, 20% for testing, and 20%
for validation. The dataset is shuffled before proceeding for the training. For
training, input patterns and the desired output are presented to the network,
whereas in case of testing, only input patterns are presented. All the
experiments are carried out on Intel Core i5 machine with 2.5GH processor and
4 GB RAM.
Initially, only one hidden layered structure is used (Figure 3.6a), with one
to 25 hidden layer neurons, input and output layer neurons fixed to five and
eight, respectively. At the same time we perform the experiments with and
Table 3.5 Network structure and initial parameters
Hidden
Layers Hidden neurons/structure Initial network parameters
1-Layer
Hidden neurons 1-25
Act. function for hidden layer = tan-sigmoid
Act. function for output layer = linear
Perform function = MSE
Learning rate=.01
Train set Ratio = 0.6%
Val. set Ratio = 0.2%
Test set Ratio = 0.2%
Division = random
Maximum epochs = 500
Performance goal = .0100
2-Layers
05-05-02-8
05-05-05-8
05-07-05-8
05-08-05-8
05-08-07-8
05-12-11-8
05-15-10-8
05-20-14-8
05-24-16-8
05-30-20-8
Act. Function for hidden layer 1 = tan-sigmoid
Act. Function for hidden layer 2 = tan-sigmoid
Act. Function for output layer = linear
Perform Function = MSE
Learning rate=.01
Train set Ratio = 0.6%
Val. set Ratio = 0.2%
Test set Ratio = 0.2%
Division = random
Maximum epochs = 500
Performance goal = .0100
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 65
without the validation checks having the same parameters as show in Table
3.5. In the first case, we trained the network for 200 epochs by providing
training and test sets of the data. As the number of hidden neurons increases
from 1 to 25 the network takes different time duration for 200 epochs. The
training and test mean squared error (MSE) are reported with time taken to
complete for each network, see Appendix. It is observed that increasing the
number of neurons gives a better MSE performance. However, beyond 16
neurons further addition has negative effect on network as shown in Figure
3.7d. Figures 3.7a and 3.7b show the training and testing results of the two
network structures with 10 and 23 hidden neurons, respectively. The solid line
shows the training MSE and dashed line represents test MSE. In case of 10
neurons, the network is well trained and both lines show close proximity.
Although, the second network’s MSE shows improvement, testing results are
poor. This is due to over fitting of network with a large number of neurons.
Complete experimental results for various structures are presented in the
Appendix. Figure 3.8c shows the best case with 12 neurons, where test MSE
drops below training MSE with an overall accuracy of 98%.
To validate the network, another experiment is performed using randomly
divided data into three sets; training, validation, and test, with 60%, 20%, and
20%. The goal (MSE) threshold is set to 0.01, whereas all other parameters -
including the ANN structure remain the same. Validation check forces the
training to stop early, before reaching maximum epochs and thus prevent the
situation discussed in the previous case. With too few neurons (only two,
Figure 8a) the MSE for training, validation, and testing are very close and
remain constant after the first epoch. At epoch 45, the validation error starts
growing from the threshold (6 errors per epoch), and hence the training stops.
Since the MSE in this case is 0.0794, less than the set goal, the network is not
adopted. In the other case, as the network grows larger (as shown in Figure
3.8c) the training stops after 18 epochs. At this stage the training MSE is
0.0097 and validation MSE is 0.0077. This implies that the network is
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 66
properly trained at this point. As the number of hidden layer neurons increase
to 25, the gap between training and validation increases as shown in Figure
3.8b. Training in this case stops at epoch 9 as the validation error increases
and the testing curve starts deviating from the training curve.
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 67
(a) (b)
Figure 3.6 (a) Single hidden layer network (b) 2-hidden layered network
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 68
(a)
(c)
(b)
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 69
Figure 3.7 (a) 10 neurons (b) 23 neurons (c) best case with 12 neurons (d) no. of nodes vs. MSE
(d)
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 70
Figure 3.8 (a) 2-Hidden nodes (b) 25-Hidden nodes (c) 14- Hidden nodes
(a) (b)
(c)
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 71
Two hidden layered models of the ANN are also implemented. For most
applications, one hidden layer networks are sufficient. However, to find the
best model, we experimented with two hidden layers as well. The two hidden
layer network architecture is shown in Figure. 3.6b. The detailed results of the
experiment are shown in the Appendix. We maintained initial configuration
as described in Table 3.5. Validation errors are higher for fewer neurons in the
hidden layers as Figure 3.9a show MSE of 05-8-7-8 NN. The Figure 3.9b
shows MSE for 05-24-16-8 NN structure, as in the latter case, the three lines
are adjacent to each other. The comparison of different two hidden layered
ANN network structures is shown in Figure 3.9c. As the number of neurons
increase in the hidden layers, validation error declines. However, neural
networks consisting of a very large number of neurons take more computation
time and show no significant improvement in MSEs. The NN structure with
24 neurons in first hidden layer and 16 in the second layer show better
average MSE. As number of neurons increase beyond 30-24 structure, the
network computation time increases with no significant improvement in
MSE.
3.3.2. The n-fold cross-validation
The n-folds cross-validation is another popular technique to train a
classifier. In this experiment the n-folds cross-validation is used by keeping the
value of n at 10. The cross validation is considered to be more robust and
reliable, at the same time it is good for generalizing error estimation. The n-
fold cross-validation technique is also more effective for small number of
training samples. The basic property of this method is randomness added to
the training samples, as the method does not use fixed partitions of training
and testing sets. In contrast to the fixed split out strategy, the whole dataset is
divided into n disjoint sets known as folds and each time the n-1 folds are used
for training and the remaining 1 set is used for testing the model. The model’s
accuracy is computed taking the average of all runs. The 10-folds cross-
validation method on several architectures with one and two hidden layers
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 72
having various numbers of neurons is experimented. The training, testing, and
validation MSEs are reported in the Appendix. We also calculated the
training and testing accuracy for one hidden layer structure which are
reported in the Appendix.
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 73
(a)
(b)
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 74
Figure 3.9 Two hidden layered structure analysis
(c)
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 75
Figure 3.10 Accuracy for different number of nodes in hidden layer
Figure 3.11 Hidden nodes vs. MSEs
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 76
Figure 3.10 shows the comparison in accuracies for training and testing
with NN structure of one hidden layer having hidden nodes from one to 25
using 10-folds cross-validation method. The maximum accuracy of 98.61% is
achieved in training, while with test data, the highest accuracy of 97.50% is
achieved using several hidden layer neurons. Figure 3.11 shows the
comparison of MSEs and hidden nodes for training, testing, and validation of
model for several hidden neurons. As the number of neurons in the hidden
layer increases, the error rate decreases to a level after which the change is
insignificant.
3.3.3. Performance analysis
Experiments with various one-hidden layered ANN structures produce
different results. The network MSE declines with the addition of more
neurons to the hidden layer in case of one-hidden layered NN architecture.
However, adding more neurons has the opposite effect on the network
causing the gap between training and test MSE to increase due to growing
validation errors. As shown in Figure 3.12 the training MSE quickly reduces
with the addition of hidden nodes. The validation curve shows spikes due to
the different number of validation errors of the network. The network with 14
hidden nodes show better results for training, validation, and test MSE as
compared to others. To evaluate the performance of classification, evaluation
factors based on confusion matrix are used. The confusion matrix defines the
parameters on the bases of correctly classified and misclassified items, such as
true positive (TP), true negative (TN), false positive (FP), and false negative (FN).
Other parameters are accuracy (Eq. 3.12), sensitivity (Eq. 3.13), precision (Eq.
3.14), and correlation coefficient (R2) to measure the performance of NN
model.
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 77
Figure 3.12 Hidden Nodes vs. MSE for 1 hidden layered ANN
= 𝑃+ 𝑒 ×
(3.12)
= 𝑃𝑃+ (3.13)
= 𝑃𝑃+ 𝑃 (3.14)
The average performance for all the parameters is taken since NN output
as eight classes is implemented. Table 3.6 presents the performance
evaluation of different one-hidden layered neural networks having MSE less
than 0.01. The one-hidden layered NN structure consisting of 14 hidden
neurons has the best performance with 97.50% accuracy.
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 78
Table 3.4 NN performance
Network Structure Accuracy % Sensitivity % Precision % R2
05-12-8 96 95 91 0.967
05-13-8 96 96 90 0.956
05-14-8 97.50 98 92 0.978
05-15-8 95 97 92 0.953
05-16-8 95 95 90 0.931
05-08-07-8 96 97 91 0.95
05-12-11-8 94 95 88 0.957
05-15-10-8 94 95 89 0.948
05-20-14-8 96 97 92 0.943
05-24-16-8 95 97 93 0.967
05-30-20-8 97 98 92 0.968
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 79
The confusion matrix is also used for the two-hidden layered structure to
extract network evaluation parameters, i.e., accuracy, sensitivity, precision,
and coefficient relation. The results of these parameters are already shown in
Table 3.6. Based on these experiments, the one-hidden layered ANN structure
consisting of 14 hidden neurons seems to be best performing architecture on
the given dataset. Figure 3.13 shows the confusion matrix of the best NN
architecture.
Ou
tpu
t C
lass
Histogram 100% 20% 0 0 0 0 0 0 93%
Pie Chart 0 80% 0 0 0 0 0 0 100%
Map 0 0 100% 0 0 0 0 0 100%
Treemap 0 0 0 100% 0 0 0 0 100%
Parallel Coord. 0 0 0 0 100% 0 0 0 100%
Scatter Plot 0 0 0 0 0 100% 0 0 100%
Linked Graph 0 0 0 0 0 0 100% 0 100%
Line Chart 0 0 0 0 0 0 0 100% 100%
100% 80% 100% 100% 100% 100% 100% 100% 97.50%
His
tog
ram
Pie
Ch
art
Ma
p
Tre
em
ap
Pa
ralle
l Co
ord
.
Sca
tte
r P
lot
Lin
ked
Gra
ph
Lin
e C
ha
rt
Target Class
Figure 3.13 Confusion matrix of the best ANN architecture
The performance metric values for: true positive rate (TP-rate), true negative rate (TN-rate),
false positive rate (FP-rate), and false negative rate (FN-rate) for the best ANN model are listed in
Table 3.6.
Table 3.6 Best ANN performance
Metric Value
TP-Rate 0.9700
TN-Rate 0.9956
FP-Rate 0.0043
FN-Rate 0.0300
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 80
To study the impact of various learning approaches for the ANN,
experiments are conducted using different approaches. These approaches
include: Levenberg-Marquardt, Rprop (resilient backpropagation), BFGS
quasi-Newton, GD-adaptive learning, scaled conjugate gradient, conjugate
gradient with Powell/Beale restarts, and Polak-Ribiére conjugate gradient.
The comparison is made based on average accuracy, training CPU time, and
training MSE. Table 3.7 shows that the Levenberg-Marquardt algorithm has
the best performance considering the accuracy.
Table 3.7 Impact of various learning approaches on the ANN
Learning algorithm
Average
accuracy (%)
Training MSE
Training CPU
time
Levenberg-Marquardt 97.5 0.008 2.3
Rprop 95 0.021 2.7
BFGS quasi-Newton 89 0.023 4.5
Gtraining MSE 78 0.053 2.2
Scaled conjugate gradient 83 0.034 1.9
Conjugate gradient with Powell/Beale
restarts 96 0.024 2.4
Polak-Ribiére conjugate gradient 95.5 0.025 2.5
3.4 Sensitivity analysis
Sensitivity analysis is used to explore the relative importance of model’s
inputs and to check how they impact output. In this study, sensitivity analysis
is performed using the stepwise method. In this procedure, the trained
network is presented with input parameters set, one at a time. The respective
behavior of the model is taken as output and network parameter, such as:
MSE, error rate, correlation coefficient (R2), and accuracy are recorded. The
combined effect of various parameters is also reported. As shown in Table
3.8, first two input parameters, i.e., dimension and primary variable of the
dataset are influential in selecting visualization technique. After removing the
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 81
dimension parameter, the error rate increases to 21% and MSE to 0.055. For
primary attribute the error rate reached 29.50% and MSE value to 0.058.
Other three input parameters have lower error rates if omitted from the input
and have almost the same influence on the output. Along with their
individual importance, the combined influence of most important parameters,
i.e., dimension and primary attribute are also evaluated in comparison with
less important parameters. In this case, when dimension and primary attribute
inputs are left out from the input, the error rate is achieved as low as 41% with
only 68% of correlation existing between the output and target values.
Table 3.8 Sensitivity analysis results
Input parameter
Error rate
% MSE R2
Dimension 21.00 0.055 0.82
No. Of attributes 12.75 0.023 0.90
No. Of instances 12.75 0.021 0.93
Primary attribute 29.50 0.058 0.73
Task 12.75 0.023 0.90
Dimension + Primary Attribute 41.00 0.073 0.68
No. Of attributes + No. Of instances +
Task 14.25 0.024 0.89
3.5 Comparison with other classifiers
The best ANN architecture’s results are compared with k-nearest neighbor
(k-NN), naïve Bayes, decision tree, random forest, and SVM.
The k-NN is a non-parametric supervised learning algorithm used
commonly for classification problems [146]. In contrast to other classifiers, k-
NN does not use training set for generalization; rather it uses a prototype for
new instances to be classified. The training of k-NN is fast, since only training
set is mapped to a feature space with different regions. The k-NN classifier
generally uses Euclidean distance to classify new instances. We used k-NN to
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 82
classify the dataset and the overall accuracy turns out to be 84%. Individual
class performance of k-NN is shown in Table 3.11.
Naïve Bayes is a probabilistic classifier based on Bayes theorem and is used
in many real-world complex problems. It uses maximum likelihood to classify
an instance to a class, where the probability of each feature is taken
independently. However, the classifier has low classification performance,
particularly for the features with higher interdependence. We used a kernel-
based naïve classifier achieving maximum accuracy to 81%. The naïve
classifier’s individual class performance is given in Table 3.11.
Decision tree is a hierarchical classifier. The decision tree consists of three
different types of nodes, i.e., root, internal, and leaf nodes [147]. The leaf
nodes are assigned class labels while the internal nodes are used for
classification of rules and decision making. A decision tree-based classifier is
used with the dataset, Table 3.11 lists the performance of decision trees.
Random forest is an ensemble classifier, consisting of a large number of
tree-based classifiers [148]. Each tree is built using independent random
vector of features selected from the training dataset. To classify unknown data
with random forest, all trees in the forest cast a unit vote for the popular class.
We used five features in the input dataset and eight output classes. Random
forest is trained and tested using 10-folds cross-validation method as described
earlier. The random forest is developed using various number of trees in the
forest and their respective overall accuracy is shown in Table 3.9. In this
experiment, class-wise performance of the classifier is also calculated as
shown in Table 3.11.
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 83
Table 3.9 Overall accuracy for random forest
No. of tree Accuracy (%) CPU time (seconds)
5 92.50
0.302801(average)
10 92.50
50 95.00
100 95.00
200 95.00
500 95.50
1000 95.00
SVM is a classifier originally developed to solve binary classification
problems. However, later the technique was adopted to handle multiclass
problems using various methods, i.e., one-against-all and one-against-one
[149,150]. SVM is used with four different kernels: linear, polynomial, RBF,
and sigmoid. The experiment is carried out with different ratios of training
and testing data as given in Table 3.10. The table shows the prediction
accuracy of SVM for the four kernels with respect to different training and test
sets of data. Table 3.11 shows the class-wise performance of SVM.
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 84
Table 3.10 SVM prediction accuracy
SVM kernel Accuracy Accuracy Accuracy CPU time (seconds)
(70% training) (80% training) (90% training)
(30% testing) (20% testing) (10% testing)
Linear (207 SVs) 85.83% 77.50% 87.50%
0.0312002
Polynomial (272 SVs) 41.67% 42.50% 40%
RBF (245 SVs) 76.67% 77.50% 82.50%
Sigmoid (253SVs) 53.33% 55% 57.50%
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 85
Table 3.11 Per class accuracy of different classifiers
Class label ANN SVM RF DT NB k-NN
Histogram 100 92.96 95.67 98 85.83 73
Pie Chart 92.3 82.41 96.5 99 80.83 69
Map 100 100 99 99 93.33 94.75
Treemap 100 98.99 98 98 92.5 96
Parallel Coordinates 100 92 100 98 93.33 92.5
Scatter Plot 100 94.47 98.16 98 93.33 75.57
Linked graph 100 93.97 100 97 92.33 68
Line Chart 100 95 100 100 94.83 87
Average accuracy 99.03 93.73 98.79 98.37 90.79 81.98
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 86
Table 3.11 Average accuracy and CPU time of classifiers
Performance ANN SVM RF DT NB k-NN
Accuracy (%) 97.5 92 98 95 81 84
F-Measure 0.9634 0.9085 .9798 0.9156 0.7867 0.8067
AUC 0.9820 0.9619 0.9923 0.9590 0.8996 0.8698
CPU time(seconds) 0.0243 0.0312 0.302801 0.0468 0.0468 0.073601
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 87
Table 3.12 lists the overall accuracy of each classifier. The table shows the
CPU time taken, F-measure, and AUC (area under curve) for the classifiers.
The Friedman test, a non-parametric statistical test, is applied to determine
the significant difference among the classifiers. The threshold for Friedman
test is 0.01, where values greater than this threshold indicate better
performance of the ANN. Table 3. 13 lists the results of Friedman test. All
comparisons are taken with 99% confidence level. As shown in Table 3.13
ANN’s accuracy is significantly better than the other classifiers except for RF.
However, in case of RF the difference in accuracy is not significant.
The sensitivity analysis for the five additional classifiers is also performed
as was done for the ANN in Section 5.4. Table 3.14 shows the results of
sensitivity analysis for each classifier. Some variables are more sensitive to the
model. The classifier’s error rate increases when the primary attribute is
removed from the input data. It also shows that if both the primary attribute
and the dimension parameter of the input are dropped the accuracy for all the
classifier is decreased. This has the most significant effect on NB which results
in an error rate of almost 50%. The tree-based classifier DT is comparatively
less sensitive to the removal of these input parameters.
3.6 Ranking three best visualizations
Throughout the chapter the visualization selection methodology is
discussed. However, the user is not always satisfied with a single suitable
visualization, especially when there is no obvious advantage. To provide a
more flexible approach, the system is enhanced to automatically select three
best visualizations out of eight. The three selected visualizations are ranked
based on the neural network output. The system recommends only one best
visualization based on the output layer neuron having maximum value. The
modified version of the proposed approach is tested on twenty real-world
benchmark dataset. The selection criteria for the twenty benchmark dataset
was diversity in terms of their domain, number of attributes, number of
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 88
instances, and the primary variable type. These datasets are summarized in
Table 3.15.
The results of this experiment are listed in Table 3.15. The trained neural
network is simulated with the metadata of the 20 datasets mentioned in Table
15. The system provides three best visualizations, where the first visualization
in each case is the most suitable one whereas the other two ranked 2nd and 3rd
, respectively.
To further verify the performance of the best ANN in an unknown
environment, we create a new dataset using the twenty benchmark datasets
mentioned in Table 3.15. Metadata of these twenty benchmark datasets are
used to predict appropriate visualization. The attributes and their types for
this newly created dataset are the same as being in the original dataset
explained in Section 3.1.
The dataset is then used as input to the ANN model which is already
trained. The ANN predicts the correct output with 95.4% accuracy. Table
3.17 shows the predicted visualization by the ANN, where each of the
visualization is predicted for the particular task specified against the case.
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 89
Table 3.13 Accuracy comparison using Friedman test
Friedman test
p-value
Confidence
Level (%)
ANN-SVM 0.001 99
ANN-RF 0.011 99
ANN-DT 0.001 99
ANN-NB 0.002 99
ANN-KNN 0.005 99
Table 3.14 Sensitivity analysis using classifiers- error rate (%)
Input parameter ANN SVM RF DT NB k-NN
Dimension 21 23.7 17.4 11 30 30
No. Of attributes 12.75 18 12.3 14.5 37.5 10.5
No. Of instances 12.75 20 11.5 10.4 27.4 9
Primary attribute 29.5 28.5 28 27 38 38
Task 12.75 11.4 14.2 11 21 13.5
Dimension + Primary attribute 41 45 42.3 25 50 48.4
No. of attributes + No. of instances + Task 14.25 24 11.8 30 32 28
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 90
Table 3.15 Dataset description
Dataset No. of attributes No. of instances Variable type
Iris2 4 150 Real
Web Page Ranking3 5 332 Categorical
Chess 4 6 28056 Categorical
Wine5 13 178 Integer
Breast Cancer 6 32 569 Real
Planning Relax7 13 182 Real
Online social media keywords8 35 51 Integer
Abalone9 8 4177 Categorical
Heart Disease10
75 303 Categorical
Cvd311
6 117 Categorical
Car12
19 428 Real
Flags13
30 194 Categorical
Ozone Level Detection14
73 2536 Real
Bird 15
26 12 Categorical
World oil production16
27 220 Categorical
P2P17
6 187 Categorical
Doctorate student by state18
21 53 Categorical
Journal articles19
2 1256 Real
Internet usage20
72 10104 Categorical
Image segmentation21
19 2310 Integer
2 https://archive.ics.uci.edu/ml/datasets/Iris
3 http://archive.ics.uci.edu/ml/datasets/Syskill+and+Webert+Web+Page+Ratings
4 http://archive.ics.uci.edu/ml/datasets/Chess+%28King-Rook+vs.+King%29
5 http://archive.ics.uci.edu/ml/datasets/Wine
6 http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
7 http://archive.ics.uci.edu/ml/datasets/Planning+Relax
8 http://archive.ics.uci.edu/ml/datasets/Predict+keywords+activities+in+a+online+social+media
9 http://archive.ics.uci.edu/ml/datasets/Abalone
10 http://archive.ics.uci.edu/ml/datasets/Heart+Disease
11 https://www.quandl.com/data/DMDRN/
12 http://www.amstat.org/publications/jse/jse_data_archive.htm
13 http://archive.ics.uci.edu/ml/datasets/Flags
14 http://archive.ics.uci.edu/ml/datasets/Ozone+Level+Detection
15 http://ec.europa.eu/eurostat/product?code=tsien170
16 https://datamarket.com/data/set/17tl/total-world-oil-production-
barrels#!ds=17tl!kqb=6&display=line 17
https://snap.stanford.edu/data/ 18
trends.collegeboard.org 19
https://www.oclc.org/data/data-sets-services.en.html 20
http://archive.ics.uci.edu/ml/datasets/Internet+Usage+Data 21
http://archive.ics.uci.edu/ml/datasets/Statlog+%28Image+Segmentation%29
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 91
Table 3.17 Dataset with three best visualizations
Visualizations
Dataset Best
2nd
Best 3rd
Best
Iris Line Chart Pie Chart Histogram
Web Page Ranking Histogram Line Chart TreeMap
Chess Linked graph Map TreeMap
Wine Line Chart Histogram TreeMap
Breast Cancer Line Chart Histogram Linked graph
Planning Relax Parallel Coord. Map TreeMap
Online social media keywords Histogram Line Chart TreeMap
Abalone Parallel Coord. Pie Chart Map
Heart Disease TreeMap Histogram Line Chart
Cvd3 TreeMap Pie Chart Scatter Plot
Car Line Chart Histogram Parallel Coord.
Flags Parallel Coord. Linked graph Pie Chart
Ozone Level Detection Histogram Line Chart TreeMap
Bird TreeMap Histogram Line Chart
World oil production Parallel Coord. Scatter Plot Pie Chart
P2P TreeMap Histogram Linked graph
Doctorate student by state TreeMap Scatter Plot Pie Chart
Journal articles Line Chart Histogram Map
Internet usage Pie Chart Parallel Coord. Map
Image segmentation Parallel Coord. Linked graph Line Chart
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 92
Table 3.18 Benchmark datasets and the predicted visualization based on task
Dataset Task Visualization
Iris Relationship Scatterplot
Webpage Ranking Comparison Line Chart
Chess Comparison Parallel Coordinates
Wine Distribution Histogram
Breast Cancer Relationship Map
Planning Relax Distribution Map
Online social media keywords Comparison Histogram
Abalone Distribution Parallel Coordinates
Heart Disease Relationship Histogram
Cvd3 Distribution Line Chart
Car Comparison Histogram
Flags Distribution Histogram
Ozone Level Detection Distribution Histogram
Bird Relationship Histogram
World oil production Distribution Histogram
P2P Relationship Parallel Coordinates
Doctorate student by state Trend Parallel Coordinates
Journal articles Distribution Map
Internet usage Trend Line Chart
Image segmentation Distribution Line Chart
The newly built dataset is then used as input to the ANN model which is
already trained. This time ANN predicts the correct output with 95.4%
accuracy. Table 3.16 shows the predicted visualization by the ANN, where
each of the visualization is predicted for the particular task specified against
the case.
We demonstrate the information visualization results for the iris dataset.
Table A.8 in the Appendix-A shows seven out of the eight visualizations
using the sample dataset, i.e., iris. Map visualization is not displayed using
the iris dataset as it requires spatial coordinates. We refer the reader to Figure
3.3(g) for the map visualization.
Table 3.28 in the appendix shows seven out of the eight visualizations
using the sample dataset, i.e., iris. Map visualization is not displayed using
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 93
the iris dataset as it requires spatial coordinates. The reader is referred to Fig.
3.4(g) for the map visualization. Table 3.18 lists the performance of ANN and
five other classifiers in unknown environment created using 20 benchmark
datasets.
3.7 Comparison with state-of-the-art
As discussed in section 3.2, work in [78] presents a visualization selection
system using a rule-based system. This system automatically selects
visualization type and its properties based on the data and corresponding
metadata. The data and metadata is checked against the fixed rules. The
system consists of basic charts commonly used in businesses. In contrast,
neural network-based system is used to select a visualization based on the
metadata. There are some basic differences between the neural network-based
system and rules-based system. The rule-based system works on rules build by
humans; whereas, neural network learns from the training data and are more
efficient. A comparative study of various intelligent and expert systems is
reported in [151]. The proposed approach used metadata to select a
visualization method. To the metadata the task needed by the user to be
performed on data are added as well. Another limitation of rule-based system
is the requirement of domain knowledge to build rules, which is not required
in case of neural network. The rules in [78] are fixed and managing or adding
new rules itself is an issue. The proposed system is more general, only training
is required for predicting specific visualization. Eight visualization techniques
are used in the current work and more visualization techniques can be added
to the system. Another advantage is that once trained, neural network works
faster as compared to the rules-based system.
Another closely related work on automatic visualization selection is
presented in [79]. The system uses fitness scores to select visualization type
for a dataset. The fitness value is computed from the metadata of original data
and visualization type metadata as well. A rule-based engine is utilized to
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 94
determine the fitness value for each visualization type available in the system.
The user is then provided with more than one visualization options ordered
by their fitness value. The implementation of the system provides only with
charts visualization, which is a limitation of the system. The system’s
complexity is dependent on the rules-engine, as the rules-engine has to keep
both data mapping rules and chart selection rules. At the same time, the
system needs to store both the metadata for the dataset and also for the
visualization techniques. Since a qualitative comparison between the
proposed system and state-of-the-art in [78, 79] is not possible, Table 3.18
qualitatively compares the three.
3.8 Dataset extensibility
To demonstrate the extensibility of the dataset the best performing ANN is
customized and additional articles are added to the dataset. A plain data table
and infographics are added to the dataset as new classes. In addition, the total
number of visualizations in the document is also added to the dataset as an
attribute. With this addition of the attribute all dataset instances were revised
and the records where the total number of visualizations in the document was
not available were removed to avoid any noise. The representative articles
used for entries in the dataset related to the plain data table and infographics
include [88], [89], and [90]. This resulted in the revised dataset to have 80
records. Two neurons are added to the output layer of the best performing
ANN to accommodate the newly added classes. The customized ANN is
trained using the revised dataset. This experiment resulted 84% average
accuracy of the ANN. Table 3.20 shows the average accuracy of all the
classifiers on the revised dataset.
Table 3.19 Classifiers performance using revised datasets
Performance ANN SVM RF DT NB k-NN
Accuracy (%) 84 72 70 80 65 70
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 95
Table 3.20 Comparison between the proposed system and state-of-the-art
Feature Proposed system Rules-based system in [13] Rank-based system in [14]
Basic working Learning Human rules /business rules
Rules-engine
Visualization selection Metadata, Tasks on data Data, metadata
Metadata of visualization
and data
Domain knowledge Not required Required Required
Visualization type 8 different types Common business type Charts, maybe others
Rules management General/learning on input data Fixed Fixed
Adding new visualization Easy/only training required Relatively difficult Relatively difficult
Complexity Number of neurons/layers Rules/number of statements Complexity of rules-engine
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 96
3.9 Adding New Visualization Techniques
The dataset is also extended by adding few modern visualization techniques,
including: cone tree, sunburst, storyline, bubble chart, radial chart and area
graph. The ANN-based model is then evaluated using this enhanced dataset.
3.10 Discussion
The selection of an appropriate visualization method for a particular
dataset is not trivial. Normally the process for visualization selection is either
through trial and error or one’s experience. There is a complex relation
between visualization technique and the dataset to be visualized.
Visualization technique may change with different number of attribute types,
number of attributes, and primary task. This work has investigated ANN-
based model for automatic selection of a visualization technique based on the
metadata of the dataset. Various ANN architectures are tested with different
values of configuration parameters, as summarized in Table 3.4. The ANN
models having one and two hidden layers with different number of neurons
were designed and tested using training, testing, and validation sets. The
ANN model takes metadata as input of the dataset and gives a visualization
technique as output. Although one-layer model gives an accuracy of 97.50%,
we tested two layered models to countercheck for better accuracy. The
experiment shows that by increasing one hidden layer, the accuracy of the
model remains around 97%, while it increases the computational complexity
of the model. The ANN model of one hidden layer with 14 neurons is
adopted, as it gave the highest accuracy with the least number of neurons.
The performance results are compared using different metrics, as given in
Table 5. One hidden layered model with 14 neurons performs better, since the
model has higher accuracy, sensitivity, precision, and R2. The results show
that ANN model explains the ambiguous relation between dataset and
visualization technique with high accuracy.
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 97
Training using 10-folds cross-validation method is also performed to check
the reliability of the model. The performance of model in terms of accuracy
did not improve much but 10-fold cross-validation gave more robust model.
The results from the ANN are compared with five other classifiers: k-NN,
naïve Bayes, decision tree, random forest, and SVM. All classifiers are
implemented and checked with their optimum structure and were executed on
the same machine. Each classifier was tested for different parameter
configurations, e.g., kernel function in case of SVM, number of trees in case
of RF, and number of neighbours in case of k-NN. The accuracy for ANN,
SVM, RF, DT, NB, and k-NN was 97.5%, 92%, 98%, 95%, 81%, and 84%,
respectively as shown in Table 3.13. The ANN model outperformed all the
classifiers except RF, where both classifiers have taken almost same CPU
time, while other classifiers such as k-NN took more time to complete. The
NB comparatively shows worst performance with an accuracy of only 81%,
while taking almost 100% more time than the ANN. Individual class
comparison shows that both ANN and RF have closer results and are better
than other classifiers (see Table 3.12). RF give 100% accuracy for three classes
in the dataset while ANN gives 100% accuracy for seven classes.
The discussion above shows that only RF performance was closer to ANN.
The ANN model’s basic configuration and complexity depends on the
number of neurons, while RF model’s complexity is based on the number of
trees in the forest. The RF’s accuracy also depends on how uncorrelated the
trees in the forest are, while ANN model is based on the learning algorithm
and activation function of the neurons. Observing closely the range of
accuracies for both ANN and RF models, it can be seen that ANN has lowest
accuracy of 96% with one layer architecture consisting of 12 neurons, while
RF has the lowest accuracy of 92% with five tree forest. The performance of
RF is only better if there are more than 500 trees in the forest, which adds to
the complexity of RF.
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 98
The proposed approach was also compared with the related state-of-the-art
approaches which were two rule-based systems. The study shows that the
proposed approach has several advantages over the existing ones. Flexibility
is an appealing aspect of the proposed system. The ANN-based system does
not need direct domain knowledge and can handle the complex relation of
data, whereas building rules need domain expertise in case of rule-based
system.
3.11 Chapter summary
Appropriate visualization technique selection for a specific data is essential.
This study presented an automatic selection of information visualization
technique based on the metadata of a dataset. The proposed solution was
based on an artificial neural network (ANN) which was compared for
performance with five other classifiers and two most closely related works.
The ANN-based prediction model of automated visualization selection
outperformed the k-nearest neighbour, naïve Bayes, decision tree, random
forest, and support vector machine based on accuracy and/or time consumed.
The dataset used consisted of eight classes and the proposed ANN-based
model provided with a generic framework to accommodate more than eight
classes. Appropriate data may be added in the dataset and neurons in the
input/output layer. In contrast to the current states-of-the-art approaches,
which are based on rule-based systems, the proposed solution is generic and
can accommodate new input patterns. The work brings new perspective in the
field of visualization; where new visualizations may be added to the dataset in
order to build a comprehensive database. The dataset will then provide a
foundation for an expert system with a knowledge base.
This work can be extended in several directions in the future. We used
eight visualization techniques, more visualization techniques may be added to
the current dataset. Right along this, visualization weight may be added to
increase the selection probability of a particular visualization technique. We
Chapter 3 Data visualization technique selection
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 99
used four tasks in our dataset and one task is considered for the selection of
visualization, the work can be extended to incorporate more tasks,
particularly user specific ones. Another interesting future aspect would be
developing a library based on this work and integrate it with various
development environments, e.g., data mining packages, electronic
worksheets, and online services. Another future direction is the optimization
or configuration of selected visualization according to user requirements. A
user study, such as control experiment may be performed to evaluate the
output of the system and at the same time in future, user feedback on selected
visualization can be added to the system.
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 100
Quantifying and Optimizing
Visualization
“Measure what is measurable, and make measurable what is not so”
Galileo Galilei
Advances in computing technology and computer graphics
engulfed with huge collections of data have introduced new
visualization techniques. This gives users many choices of
visualization techniques to gain an insight about the dataset
at hand. However, selecting the most suitable visualization
for a given dataset and the task to be performed on the data is
subjective. The work presented here introduces a set of
visualization metrics to quantify visualization techniques.
Based on a comprehensive literature survey, we propose
effectiveness, expressiveness, readability, and interactivity as
the visualization metrics. Using these metrics, a framework
for optimizing the layout of a visualization technique is also
presented. The framework is based on an evolutionary
algorithm (EA) which uses treemaps as a case study. The EA
starts with a randomly initialized population, where each
chromosome of the population represents one complete
treemap. Using the genetic operators and the proposed
visualization metrics as an objective function, the EA finds
the optimum visualization layout. The visualizations that
evolved are compared with the state-of-the-art treemap
visualization tool through a user study. The user study
utilizes benchmark tasks for the evaluation. A comparison is
also performed using direct assessment, where internal and
external visualization metrics are used. Results are further
verified using analysis of variance (ANOVA) test. The results
suggest better performance of the proposed metrics and the
EA-based framework for optimizing visualization layout. The
proposed methodology can also be extended to other
visualization techniques.
Chapter
4
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 101
The increased use of computing devices in almost every field of life is
generating data at a much higher pace than ever before. This is mostly due to
the advances in data processing speed, communication technology, and
storage capacity. Such burst of emerging data paved the need for techniques
that could extract useful information hidden in the data. The techniques such
as predictive analysis, descriptive analysis, or visual inspection can address
such issues. Like predictive and descriptive analysis, multiple options are
available for the visual inspection of data [152]. Over the years, various
information visualization techniques have been proposed ranging from a
simple pie chart and line graph to more sophisticated techniques such as
treemaps, boxplot, and parallel coordinates. Selection of a particular
visualization technique depends on the type of data, its size, the task to be
performed, and the problem domain. Visualization provides humans with an
ability to explore and analyse large volumes of data visually using cognition
and perception. Therefore, establishes the augments for the decision making
capabilities of humans. However, due to its interdisciplinary nature, often
interpreting visualization can be tedious for the domain experts. A domain
expert needs answers to complex questions by visualizing their data. During
this process, one basic question a domain expert may ask, “is this a better
visualization?” or “can the visualization quality be further enhanced with respect to
aesthetic and perception aspects?” At the same time there is a need for
aesthetically better and perceptually pleasing visualizations [91] which are, at
the same time, helpful in extracting useful patterns in the data. These issues
can be addressed if there were a metric that could quantify a particular
visualization technique or its layout. Initial work on the quantification of
visualization can be seen in [81, 82, 85, 155]. However, these are either initial
definitions [81, 82] or limited to only a few aspects of a specific visualization
technique [85, 155]. Further, to automate the process of optimizing a visual
display, techniques are needed that can autonomously present an optimal
layout of a visualization from the viewer’s perspective.
This work presents an approach for the automated optimization of
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 102
visualization layout. For this purpose, a set of visualization metrics consisting
of four components, including: effectiveness, expressiveness, readability, and
interactivity are proposed. These metrics are selected and formulated based on
a literature survey. To create visualizations and optimize their layout, an
evolutionary algorithm (EA) is used. The proposed set of visualization
metrics is used as an objective function for the EA. Treemap is adopted as
visualization method in the experiments. The effectiveness of the proposed
approach is evaluated using a controlled experiment based on benchmark
tasks. It is also compared with a state-of-the-art treemap visualization tool
using both internal and external evaluation metrics.
4.1 The information visualization metrics
The work in [156] suggests moving beyond user studies/controlled
experiments and find newer evaluation methods for information visualization.
Over the years, different metrics have been proposed to measure quality of
visual information based on various aspects. Most of the work in this area is
inspired from the original work of Tufte [157] and Mackinlay [158]. The work
presented by Huang et al. [159] and Bennett et al. [160] discuss several metrics
related to graph visualization. A perceptual metric for scatter plot is presented
by Albuquerque et al. [96] while treemaps related metrics are proposed in
[161] and [162]. Aritra et al. [95] suggest screen space metrics for the effective
use of parallel coordinates visualization. This section reviews the proposed
metrics to quantify the information visualization techniques. The proposed
metrics are formulated based on a comprehensive literature survey that covers
the previously published work and theories on various aspects of an appealing
visualization technique.
Visualization is used to represent the data using shapes, colors and their
patterns so that the hidden information in the data becomes obvious to the
viewer. Visualization is an alternate way to present the raw data to
communicate the information contained therein. This makes effectiveness an
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 103
important aspect of any visualization technique. Visualization techniques (or
layout of a particular technique) that fails to adequately
highlight/communicate the hidden information in the data are less effective
than those which do it more precisely. Effectiveness is important when
visualizing high dimensional data. To measure the effectiveness of
visualization, it can be seen from the visualization’s ability to communicate
the intended message [164]. The visualization technique should take into
account all the components of the data. Depending on the particular
visualization technique, either the data is presented directly or it is shown
indirectly using aggregated quantities. Visualizations that consider all data
during the rendering process are more expressive and thus show better picture
of the data.
In order to communicate the desired information using visualization, the
depicted visual patterns should be readable. Distorted, blurred, or overlapping
regions in the visualization make it less readable and influence the
effectiveness and expressiveness of the visualization [95]. A suitable choice of
colors can further increase the readability of a visualization technique. In
addition to coloring, the choice of shapes, their layout, size, and position of
visualization components also influence the readability. Readability can be
further enhanced using labels and tooltips. Most of the times readability is
influenced by size of data where some visualization techniques like, parallel
coordinate visualization, are less readable due to huge datasets and others, for
instance treemaps, are more appropriate to visualize large datasets. To allow
various operations on visualization, it needs to be interactive. All
visualizations are not necessarily interactive, however, interactivity adds the
value to a particular visualization technique where the information can be
displayed at various levels using options like, filtering, zooming, and querying.
Based on the discussion above, four metrics to quantify a visualization
technique are identified, namely, effectiveness, expressiveness, readability, and
interactivity. Table 4.1 lists the work that mentioned such aspects.
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 104
Table 4.1 Aspects mentioned in literature for better visualization
Aspects Previous works
Effectiveness [45], [51], [13], [42], [52], [53], [54], [55], [38], [56], [57], [4], [21],
[25], [58], [45], [59], [60], [46], [24], [26], [61], [21]
Expressiveness [45], [51], [56], [54], [50], [4], [58], [60], [62], [61]
Readability [48], [54], [32], [56], [63], [64], [65]
Interactivity [50], [54], [66], [67], [68], [69], [62], [61]
4.1.1. Effectiveness
For a given visualization and a dataset (that is to be visualized) X = (X0, X1,
…, Xn-1) with n attributes, the mathematical formulation of the effectiveness is
shown as:
effectiveness = , , = ∑ 𝑖 𝑖 𝑖𝑖=∑ 𝑖 𝑖𝑖= , (4.1)
effectiveness = { , ∑ = ,=∑ 𝑖 𝑖 𝑖𝑖=∑ 𝑖 𝑖𝑖= , ℎ , (4.2)
where,
Ww=[Ww0, W
w1, …. , Ww
n] is a weight matrix having weights for each of the
item in X. The range of weights is [0,1],
and VWb=[VWb0, VWb
1, … , VWbn] is a visualization weight bit matrix
corresponding to each item in X. An item in VWb is assigned a value of 1 if its
corresponding item in X is being visualized; otherwise a value of 0 is
assigned. The minimum value of effectiveness can be 0. Higher value of
effectiveness will show better visualization.
4.1.2. Expressiveness
For a given visualization and a dataset (that is to be visualized) X = (X0, X1,
…, Xn-1) with n attributes, the mathematical formulation of the expressiveness is:
expressiveness = , , = ∑ 𝑖 𝑖 𝑖𝑖=∑ 𝑖 𝑖𝑖= (4.3)
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 105
expressiveness ={ , ∑ ==, ∑ ==∑ =∑ = , ℎ (4.4)
where,
Wb=[Wb0, Wb
1, …. , Wbn] is a bit matrix that contains a value of 1 if its
corresponding element in X is being visualized; otherwise a value of 0 is
assigned, and
VWw=[VWw0, VWw
1, … , VWwn] is a visualization weight matrix corresponding
to each item in X. The range of weights in VWw is [0, 1]. The minimum value
of expressiveness can be 0. A higher value of expressiveness will show better
visualization.
4.1.3. Readability
For a given visualization and a dataset X = (X0, X1, …, Xn-1) with n
attributes, the mathematical formulation of the readability is:
readability = , , = ∑ 𝑖 𝑖 𝑖𝑖= + σ ) (4.5)
where,
Ww= [Ww0, W
w1, …. , Ww
n] is a weight matrix that contain weights [0 or 1] for
each of the item in X.
VWb= [VWb0, VWb
1, … , VWbn] is a visualization weight bit matrix
corresponding to each item in X. An item in VWb is assigned a value of 1 if its
corresponding item in X is being visualized; otherwise a value of 0 is
assigned, and
RWw= [RW0w, RW1
w,… , RWnw ] is a readability weigh matrix corresponding to
each item in X based on the color assigned during the visualization process.
Values of RWw range between [0.2, 1]. It can be a numeric value of 0.2, 0.4,
0.6, 0.8, or 1. This gives the flexibility to assign weight to the colors based on
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 106
a range of RGB combinations. The value of readability is greater than 0, where
lower value of readability shows less readable visualization. Table 4.2 lists the
possible ranges of weights for RWw. The value σ ( ) is the standard
deviation of the values in RWw.
Table 4.2 Value range for RWw
R or G or B R or G or B R or G or B RWw
0-50 25% less than first column’s value 25% less than first column’s value 0.2
51-100 25% less than first column’s value 25% less than first column’s value 0.4
101-150 25% less than first column’s value 25% less than first column’s value 0.6
151-200 25% less than first column’s value 25% less than first column’s value 0.8
200-255 25% less than first column’s value 25% less than first column’s value 1
1 1 1 1
0 0 0 1
4.1.4. Interactivity
For a given visualization and a dataset X = {X0, X1,…, Xn} having n
attributes, the mathematical formulation of the interactivity is
interactivity = , , = ∑ 𝑖 𝑖 𝐼 𝑖𝑖= (4.6)
where,
VWb= [VWb0, VWb
1, … , VWbn] is a visualization weight bit matrix
corresponding to each item in X. An item in VWb is assigned a value of 1 if its
corresponding item in X is being visualized; otherwise a value of 0 is
assigned, and
IWb= [IW0b, IW1
b,… , IWnb ] is an interactivity bit matrix corresponding to each
item in X. IWb is set to 1 if the visualized component is interactive; otherwise
it is 0. A higher value of interactivity indicates a more interactive visualization.
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 107
4.1.5. The combined fitness function
The combined fitness function linearly adds four metrics. An overall fitness
function can, therefore, be represented as a sum of effectiveness, expressiveness,
readability, and interactivity values. To influence the weight of each value, a
constant multiplier is used. The combined fitness function (CFF) would then
be:
CFF = ( , , + , , + , , + , ,
(4.7)
The value for α, , , and range between [0, 1]. In the experiments these
weights are set to 1. However, Section 5.2.1 empirically shows their effect on
the visualization evolved.
4.2 Proposed solution
The problem statement at hand deals with the issue of quantifying the
visualization technique on how “good” it displays the data and later, based on
the quantifiable aspects, optimize the particular visualization technique for
better visual representation. The proposed solution uses an EA-based
approach to optimize the layout of a visualization technique. To quantify
visualization, four metrics: effectiveness, expressiveness, readability, and
interactivity are used. Figure 4.1 shows the complete structure of the proposed
system. It starts with an initial population of candidate solutions (also called
chromosomes or individuals), where each individual represents a complete
visualization. As a case study, we used treemap 22 visualization technique.
Treemap is a space-constrained visualization of hierarchical structures. A
mapping function converts the given chromosome into a complete
visualization. In order to evaluate the fitness of the chromosome a combined
fitness function (also called the objective function) is used. The objective
function consists of the proposed visualization metrics presented in Section 3
(Eq. (4.7)). Once the fitness of each chromosome is evaluated, those with
22
http://www.cs.umd.edu/hcil/treemap/
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 108
Figure 4.1 Proposed system working
worst objective function values are discarded. Remaining chromosomes are
used to fill-in the population using genetic reproduction operators of
crossover and mutation. Various combinations of crossover and mutation are
used. This procedure is repeated for a fixed number of iterations until the
solution converges.
In experiments we used various combinations of the objective functions,
mutation rate, and crossover. The EA converges to the optimum visualization
based on the combined fitness function. In addition, the EA also finds the
optimum visualization using the individual visualization metric of
effectiveness, expressiveness, readability, and interactivity. This results five
visualizations. These five visualizations are evaluated and compared with a
state-of-the-art treemap visualization tool using user study and direct
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 109
assessment. For the user study visualization benchmarks are used. The direct
assessment is performed using internal and external visualization metrics.
4.2.1. Problem formulation
The visualization layout optimization problem is formulated using three
types of sets, visualization metric, design parameters, and the visualization
method. A particular visualization is expressed in terms of the design
parameters to optimize the layout and aesthetical properties. These
parameters are then mapped to the visualization metrics. Let M be a set
consisting of the four visualization metrics.
= { , , , } (4.8)
Let D be the set of n design parameters that serve as attributes to a
visualization technique, each having a specific range of values from the
domain of real numbers R.
= { , , . . . , } (4.9)
Every visualization technique has design parameters from set D and can be
measured by M, where each member of M is described by the set of elements
in D. Similarly, a set V represents the visualization techniques. The mapping
of these three sets is shown in Figure 4.2. The top layer represents
visualization type (i.e., space) where different visualization kinds are
instantiated (e.g., v1, . . . ,vn). The middle layer depicts the design variable (p1, . .
. , pn), where each design variable may be used with any of the visualization
type. The bottom layer represents the four visualization metrics (m1, m2, m3,
m4), where each metric is comprised of one or more design variables. The
objective is to find the optimal design parameters set ′ ⊇ for a particular
visualization vi that satisfies the metric mj.
= { , , … , } (4.10)
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 110
Figure 4.2 Mapping of visualization parameters and metrics
← × (4.11)
The quality visualization is the one having optimal design parameter values
for all its metrics. This quality function can be expressed in terms of
visualization metrics as:
= + + + (4.12)
where, each M consists of design parameters required for the particular
metric.
4.2.2. Chromosome encoding
The chromosome used in this work is a one-dimensional array having N
cells where, N is the number of features available in a visualization technique.
The chromosome is of fixed size; however, the chromosome length may vary
for different visualization techniques. In case of treemaps, each chromosome
has twelve genes. These genes represent following attributes of the treemmap
visualization: layout, border, border size, border color, label, label size, label
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 111
location, color, hierarchy, leaf node, size, and interactive. The RGB color
scheme is used for its wide application [174] in the literature related to the
current proposal. However, the use of HSV space may also result in
perceptually more attractive visualizations. Table 4.3 lists twelve genes, their
possible values (alleles), and their description.
Structure of a sample chromosome and its corresponding treemap is
illustrated in Figure 4.4 Each cell of the chromosome represents one
component of the visualization technique. The cells can have an integer value
within the range mentioned in Table 4.3. The first cell of the chromosome
contains 2, indicating squarified layout of the treemap, having no border.
Although the third cell of the chromosome contains a border size of 3, it is
ignored since the border option is not available for this treemap. Similarly, the
color scheme for the border is also ignored. Labels for the sample treemap in
Figure 4.4 are enabled having size 8 and are center-aligned. RGB color
scheme is used for the treemap with hierarchy and leaf nodes enabled with
interactivity switched off. A mapping function is used to convert the given
chromosome into a complete treemap.
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 112
Table 4.3 Description of the genes in a chromosome
Genes Alleles Description
Layout 0-Slice dice, 1-Strip, 2-Squarified Three layout algorithms are considered for Treemap visualization; each layout
has its own pros and cons
Border 0-No, 1-Yes Treemap may have a border or the border may not be applied
Border size 1-50 If border is applied to a treemap, then these several sizes can be used
Border color RGB color/256 Each color is represent by the RGB color palette
Label 0-No, 1-Yes The treemap may use labels
Label size 6-20 The labels may vary in size
Label Location 0-Right, 1-Center, 2-Left The label has three location options
Color RGB color/256 Label color is selected from the 256 RGB colors
Hierarchy 0-No, 1-Yes The hierarchy information between the item may or may not be shown
Leaf Node 0-No, 1-Yes Leaf node in the treemap may or may not be shown
Size Relative to displays area/full screen Treemap size is relative to screen
Interactive 0-No, 1-Yes Treemap interactivity can be enabled or disabled
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 113
Figure 4.3 A sample chromosome and its corresponding treemap
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 114
4.2.3. Reproduction operators
The reproduction operators used in this work are the crossover and
mutation operations. In our experiments, we tested their different variations
on EA. The experiments are conducted both with and without crossover.
However, mutation is used in both experiments. A single point crossover
mechanism is used and six mutation rates are tested to obtain the optimum
one. Once the fitness of all chromosomes in the population is evaluated, the
crossover and mutation operations are performed. For crossover, the process
is designed as follows:
Randomly select two chromosomes from the existing population.
Generate a random number, Cp, between 0 and N, where N is the
length of the chromosome. Cp becomes the crossover point and the
data is swapped to generate one child chromosome.
The above two steps are repeated till the number of required
individuals in the population is complete.
Only one child is reproduced using the selected parent pair.
For mutation, the process is designed as follows:
Once all child chromosomes are generated through crossover, a random
number mpi, called mutation probability, is generated between 0 and 100
for each child chromosome.
For each child, if the value of mpi is less than the mutation rate, any
randomly selected child cell’s value is changed to some other
permissible one.
The crossover and mutation operations are illustrated in Figure 4.4 for a
chromosome of length twelve.
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 115
Figure 4.4 Crossover and mutation operations.
After the reproduction operators are applied, the new population is
generated and tested for fitness using the objective function. EA gives many
options for selecting the population for the next generation. The first selection
is random, where no reference is made to the fitness of the chromosome.
Each chromosome, regardless of the fitness, has an equal chance of selection
in the next generation. The second method is the proportional selection in
which the chromosomes are selected based on their fitness. Individuals are
selected based on their fitness relative to the fitness of all other chromosomes.
Another popular selection method is rank based selection. It uses the rank
ordering of the fitness values instead of the actual fitness. The proportional
selection method represented in Eq. (4.13) is used. In Eq. (4.13) ′ is
the probability that the individual ′ will be selected and 𝐴 ′ is the
fitness value of the individual as mentioned in Eq. (4.7). Later in Section
5.2.5 experiments are also conducted using proportional selection. Out of m,
m/2 chromosomes are selected as parents based on their fitness value, and the
rest are generated using the reproduction operators applied to the top m/2
chromosomes in the population.
′ = 𝐸𝐴 𝐶′∑ 𝐸𝐴 𝐶′= (4.13)
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 116
4.3 Experiments and results
The experiments conducted and the results obtained are described in this
section. The treemap visualization, which is used as a case study, is already
explained in Section 2.4. The simulations are run using the combined fitness
function. The EA is applied on each visualization metric individually. This
produces five set of results which are compared using the internal metrics.
The results are also compared with a randomly created visualization and a
visualization created using the state-of-the-art tool for treemaps. The EA is
used for various mutation rates to select the most suitable one. A controlled
experiment using twenty participants is performed for the visualizations
evolved. The benchmark tasks are used for the controlled experiment and the
results are verified using analysis of variance (ANOVA) test. All experiments
were carried out using Intel Core i5 machine with 2.5GHz processor and 4
GB RAM. Table 4.4 summarizes the EA parameter settings for various
simulations.
Table 4.4 EA parameter settings
Parameter Value(s)
Number of
populations 1
Initial population
size 500
Reproduction
operators Crossover and mutation, mutation only
Crossover type Single point
Mutation rate
0.05%, 0.5%, 1%, 5%, 10%, 15%, 20%, and 25% with 25%
and 50% elitism
Crossover
Single point crossover, where each chromosome
contributes 50% of it attribute to new child.
Stopping criterion 1000 iterations / convergence
4.3.1. Treemap
Treemap is a visualization technique used for scientific visualization as well
as for information visualization with many variations. Treemaps are used to
visualize hierarchical data in many domains, including, business, news, and
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 117
Figure 4.5 A tree with its corresponding treemap
software visualization. Since its inception, different variations of the original
treemap have been proposed. Initially treemap was proposed for visualization
of hard disk contents having thousands of files. Since then, a variety of
treemap layouts have been proposed to visualize large volumes of data on a
single screen. Treemaps are based on a tree like hierarchical structure where
each attribute corresponds to a level in the tree. While building a treemap, the
screen is divided into nested rectangles. Each rectangle corresponds to a node
in the tree. The smallest and innermost rectangles represent leaf nodes while
the enclosed rectangles represent the outer nodes. The color and size of each
rectangle show different dimensions or attributes of data as shown in Figure
4.5. Treemap has many advantages which makes it a primary choice in our
approach and used as a case study. When size, color, and dimensions are
associated with a tree like structure, unviable patterns become prominent. The
optimum combination of size, color, and dimensions for a treemap will be
searched by the EA using the proposed visualization metrics. Another
advantage of using treemap is to search for optimum utilization of space to
visualize thousands of items on the screen simultaneously.
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 118
Figure 4.6 Mutation rates
0
0.2
0.4
0.6
0.8
1
1.2
1
19
37
55
73
91
10
9
12
7
14
5
16
3
18
1
19
9
21
7
23
5
25
3
27
1
28
9
30
7
32
5
34
3
36
1
37
9
39
7
41
5
43
3
45
1
46
9
48
7
50
5
52
3
54
1
55
9
57
7
59
5
61
3
63
1
64
9
66
7
68
5
70
3
72
1
73
9
75
7
77
5
79
3
81
1
82
9
84
7
86
5
88
3
90
1
91
9
93
7
95
5
97
3
99
1
No
rma
lize
d f
itn
ess
Iterations
Mutation-0.1 Mutation-0.5 Mutation-1
Mutation-5 Mutation-10 Mutation-15
4.3.2. EA results
EA is used to evolve the population using the combined fitness function.
The best chromosomes based on the combined fitness function, effectiveness,
expressiveness, readability, and interactivity are archived. This generates five
chromosomes to be saved after each iteration. During the evolution process
six mutation rates are tested, namely, 0.1, 0.5%, 1%, 5%, 10%, and 15%. The
mutation rate of 10% appears to converge faster. Figure 4.6 shows the
convergence speed on six mutation rates. These results represent the average
values obtained from ten EA runs. Each run of EA evolved the population for
1000 iterations.
Figure 4.7 shows the combined fitness function-based convergence graph of
EA using 10% mutation for 1000 iterations. It also shows the convergence of
the population using the four metrics of visualization, i.e., effectiveness,
expressiveness, readability, and interactivity. The x-axis in Figure 4.7 shows the
number of iterations and y-axis represents the normalized fitness values of
four individual fitness criteria and the combined fitness function. Since all
fitness metrics for each chromosome will have varying range of values, we
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 119
normalized those to the same scale to compare as shown in Figure 4.7. Each
line in Figure 4.7 shows the best fitness of its relevant criterion achieved till
the particular iteration. For the experiment in Figure 4.7 the population is
evolved using the combined fitness function. This is the reason to the
inconsistent peaks and dips for the individual metrics, except for the
combined fitness function.
In addition to the evaluation of EA population using the combined fitness
function, four other populations are evolved using the effectiveness,
expressiveness, readability, and interactivity as an objective function, respectively.
During this experiment the population was evolved using one of the four
individual metrics. However, the best individuals based on other three criteria
were also saved for further analysis. Figure 4.8 to Figure 4.11 show the
convergence of the EA using effectiveness, expressiveness, readability, and
interactivity as an objective function, respectively. Table 4.5 shows the
normalized fitness value and the iteration number where the best solution was
found. This shows the relation between various fitness criteria and their effect
on each other. The fitness values almost remain same across various objective
functions, however, the number of iterations vary as the objective function
changes.
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 120
Figure 4.7 Convergence of the EA using the combined fitness function
0.000
0.200
0.400
0.600
0.800
1.000
1.200
1
21
41
61
81
10
1
12
1
14
1
16
1
18
1
20
1
22
1
24
1
26
1
28
1
30
1
32
1
34
1
36
1
38
1
40
1
42
1
44
1
46
1
48
1
50
1
52
1
54
1
56
1
58
1
60
1
62
1
64
1
66
1
68
1
70
1
72
1
74
1
76
1
78
1
80
1
82
1
84
1
86
1
88
1
90
1
92
1
94
1
96
1
98
1
No
rma
lize
d f
itn
ess
Iterations
Combined Effectiveness
Expressiveness Readability
Interactivity
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 121
Figure 4.8 Convergence of the EA using effectiveness as fitness function
Figure 4.9 Convergence of the EA using expressiveness as fitness function
Figure 4.10 Convergence of the EA using interactivity as fitness function
Figure 4.11 Convergence of the EA using readability as fitness function
0
0.2
0.4
0.6
0.8
1
1.2
1
32
63
94
12
5
15
6
18
7
21
8
24
9
28
0
31
1
34
2
37
3
40
4
43
5
46
6
49
7
52
8
55
9
59
0
62
1
65
2
68
3
71
4
74
5
77
6
80
7
83
8
86
9
90
0
93
1
96
2
99
3
No
rma
lize
d f
itn
ess
Iterations
Expressiveness
Effectiveness
Interactivity
Readability
Combined
0
0.2
0.4
0.6
0.8
1
1.2
1
29
57
85
11
3
14
1
16
9
19
7
22
5
25
3
28
1
30
9
33
7
36
5
39
3
42
1
44
9
47
7
50
5
53
3
56
1
58
9
61
7
64
5
67
3
70
1
72
9
75
7
78
5
81
3
84
1
86
9
89
7
92
5
95
3
98
1
No
rma
lize
d f
itn
ess
Iterations
Expressiveness
Effectiveness
Interactivity
Readability
Combined
0
0.2
0.4
0.6
0.8
1
1.2
1
29
57
85
11
3
14
1
16
9
19
7
22
5
25
3
28
1
30
9
33
7
36
5
39
3
42
1
44
9
47
7
50
5
53
3
56
1
58
9
61
7
64
5
67
3
70
1
72
9
75
7
78
5
81
3
84
1
86
9
89
7
92
5
95
3
98
1
No
rma
lize
d f
itn
ess
Iterations
Expressiveness
Effectiveness
Interactivity
Readability
Combined
0
0.2
0.4
0.6
0.8
1
1.2
1
29
57
85
11
3
14
1
16
9
19
7
22
5
25
3
28
1
30
9
33
7
36
5
39
3
42
1
44
9
47
7
50
5
53
3
56
1
58
9
61
7
64
5
67
3
70
1
72
9
75
7
78
5
81
3
84
1
86
9
89
7
92
5
95
3
98
1
No
rma
lize
d f
itn
ess
Iterations
Expressiveness
Effectiveness
Interactivity
Readability
Combined
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 122
4.3.3. Evaluation
The experiments mentioned in Section 4.3.2 gives a total of five
visualizations evolved using the combined fitness function, effectiveness,
expressiveness, readability, and interactivity, respectively. To evaluate the results
we used the empirical method (user studies) as it is an effective form of
quantitative evaluation for information visualization. In addition, we also
assessed the visualizations evaluated using the direct method based on the
internal and external metrics. The internal metrics are the four visualization
quantification measures represented in Eq. (4.1), Eq. (4.3), Eq. (4.5), Eq.
(4.6), and the combined fitness function as presented in Eq. (4.7). The
external evaluation metrics are listed below.
= 𝐴 − 𝑇 − 𝐸 √ (4.14)
where, E is the visualization efficiency as suggested in [12], ZA , ZT , and ZME are standard z-score for accuracy, time, and mental efforts, respectively.
The second external evaluation metric is the quality of visualization (Q).
= 𝐴 + − − √ .
where, 𝐴, , , and are standard z-score of accuracy, visualization
score, time, and mental effort, respectively. Eq. (4..15) defines the quality of
visualization in terms of four dependent variables, response time, mental
efforts, accuracy, and visualization score. It is useful to find the difference
between standard z-score of accuracy and scores to response time and effort.
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 123
Table 4.5 Various combinations of the objective function
Objective Function (Iteration number for the converged solution)
Objective function
Best fitness using
the objective
function
Effectiveness Expressiveness Readability Interactivity Combined
Combined 11 (138) 03(50) 02(200) 03(138) 01(30) 11(138)
Effectiveness 03 (30) 03(30) 02(80) 03(100) 01(40) 09(650)
Expressiveness 02(190) 03(32) 02(190) 03(260) 01(25) 11(190)
Readability 02(290) 03(38) 02(300) 02(290) 1(40) 10(100)
Interactivity 01(25) 03(70) 03(100) 03(150) 01(25) 11(140)
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 124
4.3.3.1. User study
The user study is carried out to establish the fact that the evolved
visualizations are better than the randomly created visualization and the one
created using state-of-the-art treemap tool. Additionally, the user study also
investigates the usefulness of the evolved visualization using various
combinations of the proposed metrics. Twenty volunteers, the graduate and
post-graduate students of the university, participated. All participants had the
familiarity with computers, basic concept of visualization techniques, and its
usage. Their vision was normal during the course of experiment.
Table 4.6 The five benchmark tasks
Task no. Description
1 Which Java program has the maximum number of objects?
2 Which collection API is mostly used?
3 How many Java programs use more than 3 collection APIs?
4 Which java program use large number of ArrayList objects?
5 What is the total number of APIs used by each Java program?
The participants were asked to perform five benchmark tasks listed in Table
4.6. These tasks were designed specifically for treemap-based visualization
[175,176]. The tasks reflected data collected from Java programs through
dynamic analysis. The dataset contained information about the collection
APIs usage, i.e., type, instances name, package, class, and method name.
Moreover, the tasks were general in nature and collection APIs or knowledge
of Java was not required to perform those tasks. Participants were provided
with multiple choice questions with four possible options for each of the task.
They were asked to perform each of these tasks using the six visualizations
(one random and five evolved visualizations). These visualizations are
described in the Appendix. Major goal of the user study was to evaluate the
quality of various visualization layouts in terms of their perceptual properties.
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 125
Five visualizations, i.e., evolved with the combined fitness function, and that
for effectiveness, expressiveness, readability, and interactivity, respectively were
compared with a randomly created visualization and visualization created
using the state-of-the-art tool.
Response time, mental efforts, accuracy, and visualization score were
recorded for each task. For 5 tasks × 6 visualizations × 20 participants gave us
a total of 600 responses. The response time for each task was recorded in
seconds. The mental effort was rated at a scale from 1 to 5 (1-minimum effort,
5-maximum effort). Similarly, accuracy was rated at the scale from 1 to 5 (1-
all wrong, 5-all correct). Other values were assigned according to subjects’
responses. Participants were asked to rank the visualizations for each task
respectively on a scale of 1 to 5 (1-lowest rank, 5-highest rank). Moreover,
efficiency and quality metric were computed from the recorded responses for
all visualizations and tasks using the equations discussed earlier, i.e., Eq.
(4.14) and Eq. (4.15).
The user study results were analyzed formally and investigated the
perceptual quality of evolved visualization using the combined fitness
function through hypothesis testing. The null hypothesis and alternate
hypothesis are:
H0: Correlation between human ratings of the evolved visualization (using
combined fitness function) and the perceptual quality is due to randomness.
H1: The evolved visualization using the combined fitness is perceptually
better.
The user study was completed in different sessions using same material and
stimuli. Initially, the participants were introduced to the tasks, visualizations,
and administrative procedure that were to be adopted during the experiment.
All participants were given questionnaire to fill the necessary data. Each
participant was provided with six visualizations on computer screen. This
procedure took 20 minutes on average for each participant. Based on the four
parameters a summary of the study is presented in Table 4.7, showing
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 126
minimum, maximum, average, and standard deviation (SD) values for each of
the evolved visualization, random visualization, and visualization created
using state-of-the-art (SoTA) tool23.
As the results show, evolved visualization using the combined fitness is
perceptually better than the random visualization. Table 4.7 summarizes the
experiment’s statistics for all visualizations with respect to the dependent
variables, i.e., time, mental efforts, accuracy, and visualization score. For each
dependent variable the mean value using the evolved visualization based on
the combined fitness function is better. If we compare numerical values for the
evolved visualization with the random visualization, evolved visualizations
take less time with (M=16.72, SD=5.08) as compared to the random
visualization time (M=19.4, SD=5.2). For the mental effort, the evolved
visualizations require less effort (M=2.4, SD=0.97) as compared to the random
visualization. Similarly, the accuracy and score for the evolved visualizations
is higher with (M=4.12, SD=0.77) and (M=4.1, SD=0.71), respectively as
compared to random visualization’s accuracy of (M=3.67, SD=0.99) and score
(M=3.02, SD=0.88). Thus, the values of all dependent variables are better for
the visualization evolved with the combined fitness function.
Table 4.8 lists the mean values of the dependent parameters: time, mental
efforts, accuracy, and visualization score. The table also lists mean values for
the external evaluation criterion (Eq. (4.14) and Eq. (4.15)). The results
indicate that the evolved visualizations are better as compared to the random
one and the visualization created using SoTA tool. On average, the
participants take more time for the random and SoTA visualization producing
less accuracy as compared to the evolved visualizations. Furthermore,
participants exerted more mental efforts in case of random visualization.
Figure 4.12 shows the summaries of the results for dependent parameters:
time, mental efforts, accuracy, and score for the six visualizations using five
23
http://www.cs.umd.edu/hcil/treemap/
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 127
benchmark tasks. The visualization evolved using the combined fitness
function shows better results for all four factors. The random visualization has
taken larger average time and was less accurate.
Figures 4.13-4.16 show the participants against each dependent parameter,
i.e., accuracy, efforts, score, and time for all visualizations. The scatter plot
shows participants on horizontal axis and each dependent parameter on y-
axis. The visualization types are shown by using different colors. The
relationship in these figures shows visualizations’ effect on the users’
performance to accomplish the task. In case of Figure 4.13 the accuracy for
visualization evolved using the combined fitness function is better for most of
the participants. Moreover, the participants of this visualization require
comparatively less mental efforts to perform the task. In case of the random
visualization, more efforts are needed (see Figure 4.14). Furthermore, Figure
4.15 show that visualization evolved using the combined fitness function and
SoTA get a higher score as compared to the random visualization. The time
taken by the participants to perform their task with random visualization is
much larger as compared to the combined visualization as shown in Figure
4.16.
Figure 4.17 shows the boxplots of the response time, efforts, accuracy, and
score for five evolved visualizations, random visualization, and the
visualization created using SoTA tool. As shown in Figure 4.17 the random
visualization requires more time as compared to evolved visualizations using
the response time indicator. Similarly, the time duration varies largely across
various subjects for the random visualization. The random visualization also
requires more mental efforts to perform tasks as compared to the evolved
visualizations. As far as accuracy is concerned, visualization evolved using
the combined fitness function performs better.
A comparison of the visualizations used in this study is shown for mean
value of the dependent variables in Figure 4.18, where each coloured line
represent one of the visualization. As shown in the figure, for random
visualization the participants take more time to perform different tasks in
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 128
comparison with other visualizations. Moreover, the visualizations evolved
using the combined fitness function and SoTA visualization requires less
effort while achieving better accuracy and gets better scores. Additionally,
random visualization needs more mental efforts and the resultant accuracy is
lower than other visualizations.
In order to study the effect of the weights in Eq. (4.7) twenty-four
visualizations are evolved with various combinations of weights having values
0, 0.2, 0.5, 0.8, or 1. The results for this experiment are listed in Figure 4.19,
where the major x-axis lists the values of the weights for effectiveness, minor x-
axis lists the values of the weights for expressiveness, the major y-axis lists the
values of the weights for readability, and minor y-axis list the values of the
weights for interactivity. The value in each cell represents the subjective ratings
by the participants in percentage for the visualization evolved with weights set
to the values represented by the major and minor x/y-axis. Higher percentage
indicates better evolved visualization. The results suggest better rating of the
visualizations evolved with all weights set to an equal value.
Expressiveness
0 0.2 0.5 0.8 1
Rea
dab
ilit
y
1
0% 1% 1% 1% 90% 1
Interactiv
ity
0.8
90 3% 2% 95% 11% 0.8
0.5
20% 15% 88% 80% 13% 0.5
0.2
30% 88% 45% 90% 88% 0.2
0
N/A 30% 40% 30% 20% 0
0 0.2 0.5 0.8 1
Effectiveness
Figure 4.19 Weight analysis of the combined fitness function
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 129
Table 4.7 Summary of user study scores
Visualization evolved through
Factor Combine Effectiveness Expressiveness Readability Interactivity Random SoTA
Time
Min. 8 8 8 10 10 10 10
Max. 29 29 30 30 30 32 29
Mean 16.72 16.84 17.57 17.78 17.31 19.4 18.17
SD 5.08 4.83 4.82 3.6 4.85 5.2 4.46
Mental
efforts
Min. 1 1 1 1 1 1 1
Max. 4 5 5 5 5 5 4
Mean 2.4 2.68 2.67 2.93 2.67 3.08 2.60
SD 0.97 0.86 0.99 0.83 0.99 0.97 0.95
Accuracy
Min. 2 2 2 2 2 2 2
Max. 5 5 5 5 5 5 5
Mean 4.12 3.98 3.95 3.75 3.95 3.67 4.01
SD 0.77 0.89 0.94 0.88 0.94 0.99 0.74
Score
Min. 3 2 2 2 2 1 2
Max. 5 5 5 5 5 5 5
Mean 4.1 3.75 3.71 3.49 3.71 3.02 3.95
SD 0.71 0.9 0.91 0.88 0.91 0.88 0.78
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 130
Table 4.8 Mean values for the dependent parameters
Visualization evolved through
Factor Combine Effectiveness Expressiveness Readability Interactivity Random SoTA
Time 16.72 15.84 17.57 17.78 17.31 19.4
18.17
Mental efforts 2.4 2.68 2.67 2.93 2.67 3.08
2.60
Accuracy 4.12 3.98 3.95 3.75 3.95 3.67
4.01
Score 4.1 3.75 3.71 3.49 3.71 3.02
3.95
Efficiency -0.46 -0.47 -0.85 -0.89 -0.99 -1.16
-00.18
Quality -0.02 -0.01 0.02 0 -0.03 -0.07
-00.06
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 131
Figure 4.12 Dependent variables summaries
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 132
Figure 4.13 Scatterplot for mean accuracy
Figure 4.14 Scatterplot for mean effort
Figure 4.15 Scatterplot for mean score
Figure. 4.16 Scatterplot for mean time
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 133
Figure. 4.17 Box-plots of the dependent variables for each visualization
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 134
Figure 4.17 shows the boxplots of the response time, efforts, accuracy, and
score for the five evolved visualizations, random visualization, and the
visualization created using SoTA tool. As shown in Figure 4.17 the random
visualization takes more time as compared to evolved visualizations using the
response time indicator. Similarly, the time duration varies largely across
various subjects for the random visualization. The random visualization also
consumes more mental efforts to perform tasks as compared to the evolved
visualizations. In case of accuracy, visualization evolved using the combined
fitness function performs better.
A comparison of the visualizations used in this study is shown for mean
value of the dependent variables is Figure 4.18, where each coloured line
represents one of the visualization. As shown in the Figure, for random
visualization the participants take more time to perform different tasks in
comparison with other visualizations. Moreover, the visualizations evolved
using the combined fitness function and SoTA visualization requires less
effort while achieving better accuracy and gets better scores. Additionally,
random visualization needs more mental efforts and the resultant accuracy is
lower than other visualizations.
Figure 4.18 Mean values of the dependent variables
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 135
4.3.3.2. Analysis of variance and post hoc analysis
To confirm the statistically significant variation between these
visualizations, repeated analysis of variance (ANOVA) test is performed.
ANOVA test verifies the significant difference among the evolved, random,
and SoTA visualization with respective to time, mental efforts, accuracy, and
visualization scores. The results in Table 4.9 suggest a statistically significant
difference among the visualizations regardless of other factors F(30, 2646) =
8.57, p<0.0005, 2= 0.07). Similarly, by combining the visualizations with
tasks there is still a significant difference F(120, 3253) = 5.20, p<0.0005, 2=
0.16).
This statistical analysis shows that there is significant difference among the
visualizations while taking the dependent variable time into consideration,
F(6, 665)= 13.3, p<0.0005 (M=17.61, SD=4.82). For the dependent variable
mental effort, F(6, 665)= 7.13, p<0.0005 (M=2.7, SD=0.96). Considering
accuracy F(6, 665) = 7.15, p<0.0005 (M=3.90, SD=0.90), and for visualization
score F(6, 665)= 18.18, p<0.0005 (M=3.67, SD=0.91). Taking the combination
of seven visualizations and the five tasks, the results in Table 4.9 suggests a
significant difference. If we compare the combined effect (interaction) of
visualization and tasks as shown in Table 4.9, there is also a significant
difference with respect to all four dependent variables. Considering the
response time, the statistics show difference on F(24, 665) = 5.91, p<0.0005
(M=17.61, SD=4.82). For the dependent variable mental effort, F(24, 665)=
7.13, p<0.0005 (M=2.7, SD=0.96). Considering accuracy F(24, 665) = 7.15,
p<0.0005 (M=3.90, SD=0.90) and for visualization score F(24, 665)= 18.18,
p<0.0005 (M=3.67, SD=0.91).
A post hoc analysis of least square difference (LSD) is performed for each
dependent variable to investigate the real significant difference between
visualizations. The multiple comparison test shows that there is significant
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 136
difference in the mean time taken between visualization evolved using
combined fitness function and the random visualization (p=0.0005). There is
also a significant difference based on mean time between the visualization
evolved using combined fitness function and SoTA visualization (p=0.002).
However, no significant difference is observed between the visualization
evolved using combined fitness function and those evolved using effectiveness
(p=0.061), expressiveness (p=0.071), and interactivity (p=0.15). There is no
significant difference in mean time taken between SoTA and visualization
evolved using expressiveness (p=0.19), readability (p=0.45), and interactivity
(p=0.6).
Considering mental efforts, the visualization evolved using combined
fitness function has significant difference with all other visualizations for
p<.05. Moreover, the random visualization requires more mental effort to
perform tasks compared to all other visualizations (p<.05) shown in Table
4.10. The mean mental effort excreted in case of SoTA is not significant with
the visualizations evolved using effectiveness (p=.41) and expressiveness (p=.49).
In case of accuracy, the visualization evolved using the combined fitness
function has significant difference in its mean value with the random
visualization (p=.0005). There is no significant difference based on mean
accuracy between SoTA and other evolved visualizations except those evolved
using the combined fitness function and readability.
The visualization evolved using the combined fitness function is also
statistically better than the random visualization based on mean efficiency
(p=.0005). However, the difference between combined and other evolved
visualization is not significant (p>.05) except for the one evolved for readability
(p=.0005). In case of quality as a factor, we see significant difference in mean
quality of the visualization evolved using combined fitness function as
compared to the random and SoTA visualizations (p=.040 and p=.0005).
These results make us reject the null hypothesis H0.
Table 4.11 shows the effect of visualization on the dependent variables. The
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 137
results suggest that the visualization’s perceptual quality we get with the
combined fitness function correlate with time and efforts positively across all
the visualizations and negatively correlated with accuracy and visualization
scores. The visualizations are explained by 55%, 19%, 15%, and 21% of the
variance in time, mental efforts, accuracy, and score, respectively. All the
aforementioned results are significant at (p<0.0005) with F statistics
mentioned in Table 4.11.
In addition to the F statistics, to conduct a non-parametric (distribution free)
test, Kruskal-Wallis H test is utilized to find statistically significant difference
in our experiment. Furthermore, post hoc analysis of the results is performed
using Wilcoxon signed-rank test to find the statistical difference between the
visualizations in the case study. There is statistically significant difference in
time among the different visualizations, χ 2(6) = 31.23, p < 0.001, with a mean
time rank of 280.34 for the visualization evolved using the combined fitness
function, 278.51 for visualization evolved using effectiveness, 348.79 for
expressiveness, 384.10 for readability, 336.24 for interactivity, 415.19 for the
random visualization and 376.20 for SoTA. Based on effort there is also
statistically significant difference among the various visualizations, χ 2(6) =
28.68, p < 0.001, with a mean effort rank of 287.49 for the visualization
evolved using the combined fitness function, 378.19 for effectiveness, 340.60 for
expressiveness, 391.79 for readability, 340.48 for interactivity, 418.49 for the
random visualization and 318.34 for SoTA. For accuracy the test show
statistically significant difference among the different visualizations, χ 2(6) =
36.46, p < 0.001, with a mean accuracy rank of 418.50 for the visualization
evolved using the combined fitness function, 365.94 for effectiveness, 361.79 for
expressiveness, 272.35 for readability, 361.79 for interactivity, 306.47 for the
random and 366.55 for SoTA visualization. Table 4.12 lists these results.
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 138
Table 4.9 ANOVA test result for dependent variables
Source
df
Time Mental Efforts Accuracy Vis-Score
F F
F F
Visualization 6 13.33 0.13 7.13 0.06 7.15 0.06 18.18 0.14
Error 665 -133.75 -5.54 -5.56
-12.8
Tasks(only) 4 156.7 0.48 12.27 0.07 9.01 0.05 5.07 0.03
Error 665 -1719 -8.65 -6.62 -4.25
Visualization
24 5.91 0 3 0.1 1.68 0.06 2.4 0.08
*Task
Error 665 -63.03 -2.35 -1.3 -1.68
*All Results are taken at p<.001.
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 139
Table 4.10 Post hoc analysis
Factor F-statistics Visualizations* pairs with significant difference**
Time 13.3
(1,4)(.001), (1,6 )(.0005), (1,7 )(.002)
(2,3)(.0005), (2,4)(.0005), (2,5)(.002), (2,6)(.0005), (2,7)(.0005)
(3,4)(.0005), (3,6)(.0005)
(4,5)(.040),(4,6)(.014)
(5,6)(.0005)
(6,7)(.008)
Mental efforts 7.13
(1,2)(.026), (1,3 )(.032),(1,4)(.0005), (1,5)(.0032), (1,6 )(.0005),
(1,7 )(.041)
(2,4)(.045),(2,6)(.002), (2,7)(.020)
(3,4)(.035), (3,6)(.001)
(4,5)(.035),(4,6)(.044)(4,7)(.026)
(5,6)(.001)
(6,7)(.001)
Accuracy 7.15
(1,2)(.040), (1,3 )(.021), (1,4 )(.0005), (1,5)(.021), (1,6)(.0005)
(2,4)(.0005),(2,6)(.010)
(3,4)(.001), (3,6)(.021)
(4,5)(.001),(4,6)(.043)(4,7)(.0005)
(5,6)(.021)
(6,7)(.005)
Score 18.2
(1,2)(.004), (1,3 )(.001), (1,4 )(.0005),(1,5)(.001) (1,6)(.0005)
(2,4)(.033),(2,6)(.0005)
(3,6)(.0005)
(4,6)(.0005)(4,7)(.0005)
(5,6)(.0005)
(6,7)(.0005)
Efficiency 7.8
(1,4)(.0005),(1,6 )(.041), (1,7 )(.005)
(2,4)(.012),(2,7)(.0005)
(3,4)(.015), (3,7)(.0005)
(4,5)(.014),(4,6)(.001)
(5,7)(.0005)
(6,7)(.0005)
Quality 15.7
(1,4)(.0005), (1,5 ),(1,6 )(.040), (1,7 )(.0005)
(2,4)(.004), (2,7)(.0005)
(3,4)(.003), (3,7)(.0005)
(4,5)(.002),(4,6)(.0005),(4,7)(.0005)
(5,7)(.0005)
(6,7)(.0005)
* Combined (1), Effectiveness (2), Expressiveness (3), Readability (4), Interactivity (5), Random
(6), SoTA (7)
**All test are significant at alpha = 0.05
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 140
Table 4.11 Visualization’s effect on dependent variables
Predictor Dependent aspect Beta (β) R2 F-value*
Visualization Time 0.18 0.55 20.23
Mental efforts 0.2 0.19 21.04
Accuracy -0.17 0.15 18.36
Score -0.3 0.21 60.04
*F statistic on p<.005
Table 4.12 Non-parametric test
Visualization type
Chi-
Square*
(χ2)
Combined Effectiveness Expressiveness Readability Interactivity Random SoTA
Time 31.23 280.34 278.51 348.79 384.1 336.24 415.19 376.2
Effort 28.68 287.49 378.19 340.6 391.79 340.48 418.49 318.3
Accuracy 36.46 418.5 365.94 361.79 271.35 361.79 306.47 366.6
Score 80.12 438.48 365.51 357.13 312.19 357.13 219.44 403.6
*For all cases P< 0.001
Table 4.13 Wilcoxon signed-rank test
Visualization type
Parameter Test
value Combined Effectiveness Expressiveness Readability Interactivity SoTA
Time Z -4.45 -5.98 -3.19 -1.61 -3.74 -1.8
p <.001 <.001 0.001 0.1 <.001 0.071
Effort Z -4.9 -3 -2.76 -1.6 -2.76 -3.08
p <.001 0.003 0.006 0.19 0.006 0.002
Accuracy Z -4.89 -2.51 -1.99 -1.01 -1.99 -2.87
p <.001 0.01 0.04 0.2 0.04 0.004
Score Z -6.57 -5.05 -4.83 -3.3 -4.83 -6.42
p <.001 <.001 <.001 0.001 <.001 <.001
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 141
4.3.4. Direct method
The evolved visualizations, random visualization, and SoTA are also
compared using the direct method. Table 4.14 lists the values obtained for the
external evaluation criterion (Eq. (4.14) and Eq. (4.15)) for all these
visualizations. The efficiency of the visualizations is calculated using Eq.
(4.14) which takes into consideration accuracy, time, and mental efforts,
required for these visualizations. Lower value of efficiency indicates a more
efficient visualization. The results in Table 4.14 shows the visualization
evolved using the combined fitness function scores best using the external
criteria of visualization efficiency. The second best visualization is the one
evolved using effectiveness. The worst performing visualization based on
efficiency is the random visualization. The visualization created using SoTA
tool scores third best using the efficiency metric. The second external
visualization metric used is the visualization quality, calculated using Eq.
(4.15). The visualization quality metric takes into consideration accuracy,
visualization score, time, and mental effort required by the visualization under
consideration. The lower value of quality metric represents a better
visualization. Considering quality as an external metric the visualization
evolved using the combined fitness functions seem to perform the second best.
Whereas, the visualization evolved using effectiveness performs the best. Based
on the quality metric the visualization evolved using expressiveness performs the
worst.
Table 4.14 External metric values for the visualizations
Visualization Efficiency Quality
Combined -0.46 -0.02
Effectiveness -0.47 -0.01
Expressiveness -0.85 0.02
Readability -0.89 0
Interactivity -0.99 -0.03
Random -1.16 -0.07
SoTA -0.51 -0.06
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 142
Table 4.15 User Study Results For the Evolved, SoTA, and Seven Other Non-
Treemap Visualizations
Visualization Code
Subject ID A B C D E F G H I
1 4 2 2 2 4 0 0 6 6
A-Parallel
Coordinates
2 2 2 4 2 4 0 4 4 6
B- Sunburst
3 4 4 2 2 4 2 2 6 6
C- Circular Packing
4 2 2 4 2 2 2 2 4 4
D- Line chart
5 6 4 2 2 2 0 0 2 4
E- Pie chart
6 2 2 2 2 2 0 4 4 4
F- Scatterplot
7 4 4 2 2 4 0 4 6 6
G-Bar chart
8 2 2 4 0 4 0 0 2 6
H-Evolved
9 2 2 2 2 2 2 4 6 4
I-SoTA
10 2 2 2 2 2 2 2 4 4
11 0 0 2 2 4 0 4 6 6
6-Very much Useful
12 4 4 4 0 4 2 4 4 6
4-Useful
13 2 4 2 0 4 0 0 2 6
2- Neutral
14 2 4 4 2 4 0 0 4 6
0- Not Useful
15 6 6 2 4 4 2 2 6 6
16 6 6 0 0 4 2 4 6 6
17 2 2 4 0 4 0 0 2 4
18 2 2 4 0 2 0 0 2 4
19 6 4 2 2 4 2 2 4 6
20 4 4 2 2 4 2 2 4 4
%-liked 53.33 51.67 43.33 25.00 56.67 15.00 33.33 70.00 86.67
Average 3.20 3.10 2.60 1.50 3.40 0.90 2.00 4.20 5.20
Median 2.00 3.00 2.00 2.00 4.00 0.00 2.00 4.00 6.00
Std. dev. 1.77 1.52 1.14 1.10 0.94 1.02 1.72 1.58 1.01
4.4 Discussion
To study the usefulness of the evolved visualization against other non-
treemap-based visualizations a comparison is made using seven other
visualization techniques. These include: parallel coordinates, sunburst, circular
packing, line chart, pie chart, scatterplot, and bar chart. The same data is used
with all the visualization techniques utilized in this experiment. These seven
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 143
visualizations are created using a tool provided by datapine24 and are listed in
Appendix-B. A user investigation is performed to identify the most and the
least useful visualization. A total of 20 participants were engaged in this
experiment, and were chosen from target groups. The participants were
graduate level students having background knowledge of computer
programming, data visualization and Java collection APIs. The participants
were taken on voluntary bases, having motivation to the task. They had
different programming experience in terms of the number of years. Each
participant was shown the seven other visualizations, evolved visualization
and the visualization created using the SoTA visualization tool for an equal
amount of time, i.e. 18 minutes (2 minutes per visualization). Later, they were
asked to rank these on the scale of 0-6. Where 0 indicate “not useful”, 2 is for
“neutral”, 4 shows “useful” and 6 for “very much useful”. The visualizations
were labelled with character identifiers instead of using their names. Table 4.15
lists the results of this experiment. The results support the usefulness of the
evolved visualization as compared with other seven visualizations. The
visualization created using the SoTA tool is rated second best. Friedman test is
performed on the user evaluation results to find a statistical difference among
the nine visualizations. Furthermore, post hoc analysis of the result is done by
performing Wilcoxon signed-rank test to find the statistical difference between
various visualizations. The test shows a statistically significant difference
among the visualizations with respect to likeness, χ2 (8) = 92.98, p < 0.001,
with median likeness for parallel coordinates (2), sunburst (3), circular packing
(2), line chart (2), pie chart (4), scatter plot (0), bar chart (2), evolved (4) and
SoTA (6), respectively. Wilcoxon singed-rank test results in Table 16 also
suggest difference between evolved and other visualizations. The table shows
the z and p values for each of the visualization in comparison to the evolved
one. All results are checked on confidence level of 95%, where alpha equals
0.05. Figure 4.19 shows the boxplots for the user evaluation.
24
https://www.datapine.com/
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 144
Table 4.19 Wilcoxon singed-rank test results
Test value Parallel
Coordinates
Sunburst
Circular
Packing
Line
chart
Pie
chart Scatterplot
Bar
chart SoTA
Z -1.93 -2.23 -2.26 -3.95 -3.74 -3.99 -3.82 -2.48
p 0.043 0.2 0.009 <001 <.001 <.001 <.001 0.013
Figure 4.19 Boxplot for the user evaluation results
4.5 Variation in the EA settings
The EA can be executed with a variety of setting for each of its components,
like the population size, mutation rate, crossover strategy and elitism scheme,
to name a few. The results in the Section 5 are conducted with the optimal EA
settings. However, this section covers some of the settings and their effect on
the results. For calculating the fitness of a chromosome each individual is
executed ten times and an average is taken. Figure 4.20 shows the standard
deviation (SD) of the normalized fitness value by taking fitness based on six
different numbers of samples for averaging. It can be seen that for two, five,
and eight samples the SD is higher, however, this seems to settle after ten.
Thus we opt for ten samples for averaging. In the experiments the number of
0
5
10
15
20
25
30
Parallel
Coordinates
Sunburst Circular
Packing
Line chart Pie chart Scatterplot Bar chart Evovled SoTA
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 145
iterations for the EA is fixed to 1000. However, Figure 4.21 shows the
normalized fitness (combined) for around 1500 iterations using three
populations. It can be seen that the best solution is found before 1000
iterations, thus justifying the stopping criterion in this case. The current
solution uses an EA-based approach, to be specific, a genetic algorithm (GA)
is employed for optimizing the visualization layout. There can be other
optimization approaches utilized for this task. Evolution Strategy (ES) being
one of these. To demonstrate this, Figure 4.22 shows the convergence results
for EA-GA and (1+1)-ES. Results show convergence of both approaches,
however, the EA-GA converges quickly as compared to the (1+1)-ES. The
reason for this is the reliance of ES on mutation only. Additionally, the ES
replaces parent chromosome only if the mutated solution performs better. The
aforementioned experiments use random parent selection for the reproduction
operation. There can be other options for this, a probabilistic procedure like
roulette wheel can also be utilized. Figure 4.23 lists the convergence results
with random and the proportional selection of individuals for the reproduction
operation. For the proportional selection the fitness of all the chromosomes in
the current population are summed and each chromosome is assigned relative
fitness. This relative fitness is calculated by dividing the chromosome’s fitness
by the total fitness of the current population. Later, a roulette wheel is spun,
resulting in the higher selection probability of the chromosomes with larger
fitness. The results in Figure 4.23 indicates that with the probabilistic selection
of the individuals, EA finds the best solution quickly, i.e., within 450 iterations
as compared to the random selection. However, this decreases the selective
pressure during the remaining iterations, thus hindering the EA to find even
better solutions.
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 146
0
0.2
0.4
0.6
0.8
1
1.2
1
30
59
88
11
7
14
6
17
5
20
4
23
3
26
2
29
1
32
0
34
9
37
8
40
7
43
6
46
5
49
4
52
3
55
2
58
1
61
0
63
9
66
8
69
7
72
6
75
5
78
4
81
3
84
2
87
1
90
0
92
9
95
8
98
7
10
16
10
45
10
74
11
03
11
32
11
61
11
90
12
19
12
48
12
77
13
06
13
35
13
64
13
93
14
22
14
51
14
80
15
09
15
38
No
rma
lize
d f
itn
ess
Iterations
Population-1
Population-2
Population-3
Figure 4.20 Number of samples for fitness averaging and their standard deviation
Figure 4.21 Number of iterations vs. convergence
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
2 5 8 10 15 20
Sta
nd
ard
de
via
tio
n
No. of samples for averaging
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 147
0
0.2
0.4
0.6
0.8
1
1.21
20
39
58
77
96
11
5
13
4
15
3
17
2
19
1
21
0
22
9
24
8
26
7
28
6
30
5
32
4
34
3
36
2
38
1
40
0
41
9
43
8
45
7
47
6
49
5
51
4
53
3
55
2
57
1
59
0
60
9
62
8
64
7
66
6
68
5
70
4
72
3
74
2
76
1
78
0
79
9
81
8
83
7
85
6
87
5
89
4
91
3
93
2
95
1
97
0
98
9
No
rma
lize
d f
itn
ess
Iterations
EA-GA
(1+1)-ES
0
0.2
0.4
0.6
0.8
1
1.2
1
20
39
58
77
96
11
5
13
4
15
3
17
2
19
1
21
0
22
9
24
8
26
7
28
6
30
5
32
4
34
3
36
2
38
1
40
0
41
9
43
8
45
7
47
6
49
5
51
4
53
3
55
2
57
1
59
0
60
9
62
8
64
7
66
6
68
5
70
4
72
3
74
2
76
1
78
0
79
9
81
8
83
7
85
6
87
5
89
4
91
3
93
2
95
1
97
0
98
9
No
rma
lize
d f
itn
ess
Iterations
EA-With random selection
EA-With probabilistic selection
Figure 4.22 Convergence using EA-GA and (1+1)-ES
Fig.ure 4.23 EA with random and probabilistic selection
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 148
4.6 Discussion
This study used EA for the optimization of visualization based on
quantitative assessments, i.e., the visualization metrics. The individual metrics
included effectiveness, expressiveness, interactivity, and readability. A combined
fitness function was built from the liner combination of these metrics; this
combined fitness function was then used to evolve a population of EA. A user
study was designed to evaluate the effectiveness of the resultant visualizations
using benchmark tasks. The user study recorded several parameters, including:
response time, mental efforts, accuracy, and visualization score for each
participant on every task. The same experiment was performed for the five
evolved visualizations, random visualization, and SoTA. Furthermore, two
external metrics; efficiency and quality were computed from the standardized
z-score of dependent parameters using Eq. (4.14) and Eq. (4.15).
Analysis of the user study shows that there is a significant difference among
the visualizations, both numerically and statistically. This is depicted in Table
4.7 and from the boxplot in Figure 4.17 which shows the mean values of the
dependent variables. The visualization evolved using the combined fitness
function performs better in all aspects when compared with the random
visualization. Moreover, the visualization builds with SoTA tool also
performed well as compared to the random visualization. However, this is not
the case for all the evolved visualizations using individual metric in isolation.
Nevertheless, the time taken by random visualization (M=19.4) is greater than
SoTA (M=18.17) and the visualization evolved using the combined fitness
function (M=16.72).
The statistical test’s results also show that there is a significant difference
among the visualizations evolved using fitness function proposed in this work
and the other visualization create using state-of-the-art tool. As shown with the
ANOVA test the visualization build from the combined fitness function
performs better based on the dependent variables. However, the significant
difference with random visualization is greater as compared to other evolved
visualizations. The visualizations evolved with fitness function other than the
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 149
combined fitness function also perform better in some aspects when compared
with the random visualization and SoTA. The results show that difference also
exist between evolved and random visualization across the benchmark tasks.
The post hoc analyses of the tests reveal that the visualization evolved using
the combined fitness function performs better than all other visualizations
considered in the experiment. The visualization evolved using the combined
fitness function achieves more accuracy with less mental efforts and requires
less time. Visualization evolved with fitness function other than the combined
performs better than the random visualization. However, these visualizations
comparatively perform at the lower side when compared with SoTA using
various aspects.
The proposed visualization matrices and the optimization approach can also
be utilized for visualization techniques other than the treemap. The individual
metrics, effectiveness, expressiveness, readability, and interactivity are devised
keeping in view the generic visualization aspects. Thus, these can be adopted
without significant modification. However, for other visualization techniques
the chromosome encoding and the mapping function requires customization
depending on the particular visualization technique’s features.
4.7 Chapter Summary
The work in this chapter proposed a set of visualization metrics to evaluate
visualization techniques. The proposed metrics were based on a
comprehensive literature survey. These metrics included effectiveness,
expressiveness, readability, and interactivity. Experiments demonstrated the impact
on the aesthetical and perceptual aspects of visualization due to the metrics.
The work employed an evolutionary algorithm (EA) to optimize layout of a
visualization technique. The aforementioned visualization metrics were
combined to form a fitness function for the EA. The treemap visualization was
employed as a case study for the layout optimization task. The EA evolved five
visualizations using effectiveness, expressiveness, readability, interactivity, and the
Chapter 4 Visualization Optimization: An EC-Based Approach
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 150
combined fitness function. These five evolved visualizations were compared
with a randomly created visualization and visualization created using a state-
of-the-art treemap visualization tool. The comparison was made using internal
and external evaluation metrics. A user study was also conducted on the
evolved visualizations using benchmark tasks. The user study was followed by
analysis of variance test. The results suggest effectiveness of the proposed
visualization metrics and the EA-based approach for optimizing treemaps
layout. The visualization evolved using the combined fitness function was
more effective than the visualizations optimized for effectiveness, expressiveness,
readability, and interactivity in isolation. All evolved visualizations performed
better than the randomly created one. This was due to no reference to the
aesthetics and perceptual features in the randomly created visualization.
Analysis of the user study showed a significant difference among the
visualizations, both numerically and statistically. The visualization evolved
using the combined fitness function also achieved more accuracy with less
mental efforts and required less time. It is observed that the individual criterion
for gauging visualization quality plays and important role, however, when
combined together they produce better visualizations. This produces a visual
layout that can be effective, expressive, readable and interactive at the same
time. There can be situations where the problem domain may require a
visualization to be more effective than to be expressive, or may not require
interactivity. In such cases the weights for the effectiveness, expressiveness,
readability, and interactivity in the combined fitness function can be set to a
lower value or ignored all together by assigning zero. This provides a general
framework for quantifying and optimizing visualization layouts. The next
chapter will present the third contribution of this dissertation by discussing
and elaborating dynamic code analysis and visualization of collection APIs for
Java program.
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 151
Visualizing Trace of Java Collection
APIs by Dynamic Bytecode
Instrumentation
“We're entering a new world in which data may be more important than
software” Tim O'Reilly
Object oriented languages are widely used to help the software
developer using dynamic data structures which evolve during
program execution. However, the task of program
comprehension and performance analysis necessitates the
understanding of data structures used in a program.
Particularly, in understanding which application
programming interface (API’s) objects are used during
runtime of a program. This chapter aims to give a concise
visualization of a program’s code and to provide the user with an interactive environment to explore details. This chapter
presents an interactive visualization tool. A given program is
tracked during execution time and data is recorded into a log
file. The log file is then converted to XML format which
proceeds to the visualization component. The visualization
provides a global view about the usage of collection API
objects at different locations during program execution. An
empirical study is conducted to evaluate the impact of the
proposed visualization in program comprehension. The
experimental group, on average, completes the tasks in
approximately 45% less time as compared to the control
group. Results show that the proposed approach enables
programmers to comprehend more information with less effort
and time. Performance of the proposed approach is also
evaluated using twenty benchmark software tools. The
proposed approach helps the developer in understanding Java
collection API’s object usage and assists in program comprehension and maintenance.
Chapter
5
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 152
Table 5.1 An example of collection API objects analysis using clustering
Cluster
No.
Objects
created
Objects
destroyed
Objects
invoked
Total
invocations
Code
locations
1 12 10 10 991 5
2 329 328 328 74928 6
3 2867 2867 2867 10001 3
4 2767 2767 2767 8301 2
5 2424 2418 2 27 2
6 24 16 16 3130 17
Modern software tools have become more complex due to ever increasing
functionality and complex interaction of components. The static analysis of
software no longer presents the best picture since the object’s runtime
behavior may be substantially different. It is true for object-oriented software
due to its intrinsic nature, i.e., inheritance, polymorphism, and dynamic
binding. These properties make object-oriented programs understandable,
however the behaviour of the program become complex [30]. The
performance of objects during runtime shapes the behavior of a program.
Developers are always keen to know how objects evolve during program
execution. Object-oriented software has a vast usage of objects of different
data structures, such as Java collection APIs which makes the code difficult to
understand. Programmers are always interested in optimizing their code to
make it efficient. Cases where code size is huge and objects from other
multiple classes are instantiated, makes it difficult to understand the code.
Additionally, programmers are also interested to know the number of objects
created for a particular class and their hierarchy while the code executes. This
is useful in optimizing code of the frequently used classes. Modern object-
oriented software is built using different components like, third party libraries
accessible through APIs. Inefficient data structures and API usage degrade
the program performance [167]. The developers must know the APIs used in
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 153
a program during execution and how its objects evolve.
Large program code normally uses collection APIs to handle various
features. Although use of these collection APIs has the advantage of
extensibility, large number of collection API objects may cause the code to
consume more memory and/or time. This may degrade program
performance. Therefore, for better performance the programmer must know
the location of the program in source code where objects are created to
optimize APIs usage [20].
The work in [20] identifies performance issues of the program based on
usage, location, and relevance of API objects. Table 5.1 (extracted from an
experiment in [20]) shows that cluster 1, 2, and 4 perform normal, based on
their object creation and usage. However, cluster 5 creates a large number of
objects with no method invoked on these objects during their lifetime. This
makes cluster 5 an ideal example of object(s) location where the developer
needs to focus to optimize the code.
To support program comprehension, most of the existing approaches [18,
30] are based on static analysis and thus do not provide a complete picture.
Visualization is an effective technique used for program comprehension [168,
169]. In this context, treemap [43] is a powerful visualization method used for
hierarchical information visualization [45]. The use of visualization can
simplify the evaluation and detection of API usage. Particularly, such
visualization can portray the best picture to the developer for optimizing the
code if it is based on runtime information about the software. This chapter
addresses the problem of program comprehension using visualization.
The chapter presents a treemap-based visualization tool to visualize Java
collection framework objects based on dynamic data. The tool presents a
global view to the developer about the objects location and state. Using the
proposed approach, the developer can inspect where the object was created in
a program. The information can be viewed at different levels, i.e., package,
class or method. The approach is evaluated using twenty benchmark software
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 154
tools. The evaluation is based on the delay caused due to instrumentation and
an internal evaluation metrics, where the results show better performance.
5.1. Proposed System for Java Tree Visualization
This section elaborates the proposed system. Initially an overview of the
system is given and then it is further elaborated in three steps. 1)
Instrumentation: This step deals with the development of instrumentation
code to extract runtime information from a Java program. 2) Data Collection
and Analysis: This phase records the significant data which is collected
during program execution under the instrumentation code. 3) Visualization:
This step deals with the final data visualization using treemap providing
different views of the recorded data. A mathematical formulation of the
proposed solution is developed as follows:
The set of collection API objects in a particular Java application are
represented using Eq. (5.1). = { , , , | ∈ , ∈ , ∈ , ∈ } (5.1)
Where, p represents a set of Packages, c is a set of Classes, m represents a set of
Methods, and t is an item from the set of Types. The sets Package, Class, Method,
and Collection are represented using Eq. (5.2), Eq. (5.3), Eq. (5.4), and Eq.
(5.5), respectively.
= {⋃= | } .
= {⋃𝑃= | ∈ , ∈ } .
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 155
ℎ = {⋃𝐶= | ∈ , ∈ } .
= {⋃= | ℎ 𝐴 } .
Having all these sets, we can represent all the collections objects in a
particular java application using Eq. (5.6) and Eq. (5.7).
𝐴 = ⋃ ⋃ ℎ .
𝐴 = ∑ ∑ ∑=𝐶= .𝑃
=
5.1.1. Java traces visualization system overview
As stated earlier, the proposed system consists of three steps, shown in
Figure 5.1. In the instrumentation phase, we have written code to extract
required information from a Java program during runtime. The
instrumentation is an effective technique for dynamic analysis. A selective
instrumentation approach is utilized to minimize the performance
overheads, where all sections of program are not tracked, rather, only those
methods and lines are tracked where an object is instantiated. The original
bytecode of a targeted program remains unchanged and the probe is inserted
only during load time of the class. This approach minimizes the
instrumentation overhead. As the code block of a target program executes,
it generates the information about its runtime behavior. This information is
handled by the utility code and recorded into a log file. Various key features
are collected in the log file, including: object type, method name, package
name, thread, timestamp, and line number. The log file is then converted to
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 156
XML tree format and proceeds as input to the visualization component. The
XML tree structure consists of a dummy root node, branches, and leaf
nodes. The transformation process of text file into XML tree starts by
building the tree having dummy root node and then adding each branch to
the root down to the leaf level. The visualization part is used to depict
treemap-based visualization. The system is implemented in Java using
Eclipse IDE 4.3.
For instrumentation, we focus on the locations, where the object is
instantiated through analysis of runtime data. During program execution,
the information is generated and stored in the log file which is later used to
build the visualization. Selective instrumentation is used to avoid any major
degradation of performance in the targeted program.
Figure 5.1 System overview of the system for Java traces visualization
5.1.2. Instrumentation
The proposed system first requires instrumentation of the classes. The
instrumentation code is built using the bytecode engineering library (BCEL)
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 157
[170] to parse the class files. Some alternatives for this purpose are also
available, including third party tool like: ASM [171], Java virtual machine
profiler interface (JVMPI), and Java virtual machine tools interface (JVMTI).
BCEL provides flexible control over the instrumentation process. It may also
be used with different JVM implementations. There are two key packages in
BCEL, instrumentator and utility. Both these packages work together and
have different functions. The instrumentator package contains classes used to
add probe to the bytecode prior to execution and report when these events
occur. The utility package classes do the supplementary work of monitoring
the interesting events and record the information reported by instrumentator
classes. The utility classes are also responsible for generating log file to store
the collected information.
One of the problems with the dynamic analysis is the large amount of
data generated during program execution. We use selective tracking to
minimize the runtime overhead and restrict the size of log file. Our tracking
classes insert probes only to specific locations, like: object creation, object
destruction, and method entry. Only two tracking functions for collection
API, mutator and accessor are used. These methods are responsible for
changing object state. Additionally, add and remove methods are used to
change the number of elements (size) of any collection objects. While in
contrast, the mutator is responsible to modify an object by changing value of
private fields. Since the focus is on the object instantiation location; these
two methods are responsible for the change of state of the objects. When a
method is invoked on an object, these two methods are used to access
private field of that objects.
5.1.3. Data Collection
Once the instrumentation classes are ready, the next step is the data
collection and analysis. A target application is executed under the control of
instrumentation code. The code adds the additional probe to Java bytecode
and then program executes. The interesting events are written into the log
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 158
file. The probe is inserted in respective sites, for example, after each new
operator to report the object creation and a unique hashcode is written to
the log file for newly created objects. The specific features we extract
through instrumentation are: 1) object unique identifier, 2) object owner
thread, 3) object type, 4) time of event, and 5) object status. Some static
information is also collected, which includes: 1) source code line number, 2)
package, 3) class, and 4) method name for the object. Figure 5.2 provide a
snapshot of the log file. The program’s runtime information extraction has
the advantage of precision, however, information may not be complete and
may only show some particular aspects. The information can be extracted
only for those classes which are loaded at that time and their code is being
executed.
5.1.4. Visualization and user interaction
The key component of the proposed system is visualization of collection
APIs. As mentioned earlier, we utilized treemap visualization technique.
Treemap space-filling approach enables to show a large amount of
information (millions of objects) on a single screen. The system is
implemented in Java, where the graphical user interface (GUI) capabilities
are built using the Java Swing toolkit. Treemap visualization is
implemented using the layout presented in [51] with Prefuse toolkit [172].
Prefuse is a powerful tool for interactive visualization, having support for
different types of file formats. One of the advantages of using Prefuse is its
object oriented design pattern and support for various layout algorithms.
The treemap is used with labels, where each treemap node is decorated with
a label at the top corner, except for leaf node. The leaf nodes do not come
with label which gives a clear look to the treemap. For better look and
distinction between the nodes, a border to each node is applied. The
treemap starts with a dummy root, shown at the top (root) of the treemap.
There are two objectives of this visualization, first to give a compact view of
all information at a single glance and second is to provide the user an
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 159
interactive environment to explore further details. The color shows the level
of node. As we go from root to the leaf nodes, it gets lighter. For colouring
RGB color scheme is used. Same color for all nodes is used, as our aim is to
find the hierarchical structure of object creation.
Figure 5.2 Segment of log file
The developer is interested in the program location where the objects are
instantiated, treemap-based visualization is used to hierarchically visualize
this information. The hierarchy descends from the package to class and to
the method. We take a dummy root node at level 0 followed by package at
level 1, class at level 2, method at level 3, and type at level 4. Finally, the
object is at leaf node of the treemap.
5.1.4.1 Global view
The system shows a big picture of the whole information to the user at a
glance. The user may interact with this view to find details and the data can
be filtered into different levels. The API can be viewed in a hierarchy from
package to method level. In this view the user can overview the entire
information on one screen. Figure 5.3 show the global view. The root node
in Figure 5.3 represents a dummy node to start with and is parent for other
1,main,2011-09-13 17:48:02,ab014ef2-9672-4638-a856-80e705035f09,java.util.Vector,org.gjt.sp.jedit.jEdit,<clinit>,3091
3,main,2011-09-13 17:48:02,ab014ef2-9672-4638-a856-80e705035f09,org.gjt.sp.jedit.jEdit,main,115
1,main,2011-09-13 17:48:02,ea33cf16-4a41-4835-9017-eba563cae8ad,java.util.Vector,org.gjt.sp.jedit.io.VFSManager,<clinit>,460
1,main,2011-09-13 17:48:02,dafbc723-26a7-4cad-afdb-200da8d6982e,java.util.LinkedList,org.gjt.sp.jedit.EditBus$HandlerList,<init>,399
1,main,2011-09-13 17:48:02,b9e1c609-4caf-472b-8698-702b2f225372,java.util.LinkedList,org.gjt.sp.jedit.EditBus$HandlerList,<init>,400
1,main,2011-09-13 17:48:02,f0093e4d-d135-4d89-b461-
a4f19dd6fd15,org.gjt.sp.jedit.EditBus$HandlerList,org.gjt.sp.jedit.EditBus,<clinit>,197
4,main,2011-09-13 17:48:02,f0093e4d-d135-4d89-b461-a4f19dd6fd15,org.gjt.sp.jedit.EditBus,addToBus,140
4,main,2011-09-13 17:48:02,f0093e4d-d135-4d89-b461-a4f19dd6fd15,org.gjt.sp.jedit.EditBus$HandlerList,addComponent,394
1,main,2011-09-13 17:48:02,15a96c0a-aab7-4f71-a8a4-b8f16accd1c8,java.util.LinkedList,org.gjt.sp.jedit.EditBus$HandlerList,safeGet,297
3,main,2011-09-13 17:48:02,15a96c0a-aab7-4f71-a8a4-b8f16accd1c8,org.gjt.sp.jedit.EditBus$HandlerList,addComponent,394
1,main,2011-09-13 17:48:02,586eb2e9-75e1-4a61-9d34-a7cf15f70106,java.util.Hashtable,org.gjt.sp.jedit.io.VFSManager,<clinit>,463
1,main,2011-09-13 17:48:02,76271a65-e332-4cda-95b5-2eb60a51e3d2,java.util.Hashtable,org.gjt.sp.jedit.io.VFSManager,<clinit>,464
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 160
nodes. The leaf nodes shown in this view, represent the objects of each
collection type. The objects are grouped in the view by their type. The view
also has support for tooltips which appears as the mouse moves over a
treemap node showing the important information about that node. The user
interface always show the collection APIs with its number of objects
created. One issue with this treemap is that larger nodes cover more area
and can be seen more clearly as compared to the smaller nodes. As shown
in Figure 5.3 (a) with increase in the number of inner nodes the respective
rectangles get smaller and hence produce space efficient visualization.
Although we have inserted probes both at object creation and object
destruction levels, however, the visualization shows all objects created
during program execution. It is due to the need that the developer is
primarily interested in the objects and their locations in memory for longer
time.
5.1.4.2 Interactivity and sub views
The system also supports interactivity. The interaction with the system is
available through the interface provided to explore the detailed information
and insight. The user can zoom in a node by clicking. For example, to
overview different collection API objects in a particular package, the
treemap will show only that particular node and its sub nodes against a
mouse click. During this, information of all other nodes becomes invisible.
Similarly, the user can see information about nodes by bringing the mouse
over that particular node. Figure 5.4 presents a package wise view of the
visualization. The user can explore the information on the bases of its type,
package, class, and thread. User can explore hashtable collection and is able
to see the information about this collection, such as: number of objects so
far created and the package creating these objects. By using this information
user would be able to deduce the frequently used collection APIs and the
classes responsible for the creation of these objects. The developer may use
this information to improve program performance and maintenance issues.
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 161
Figure 5.5 shows another view of collection API objects on the basis of
mutator methods called by different objects during program execution. Each
inner small rectangle represents objects that call a particular method. The
user interface provides text-based search facility to find specific information
in a particular visualization. The search facility is available at various views
of the system. Figure 5.6 show the search result for HashTable objects in
medium purple color which were used by different classes of jEdit [173].
5.2. Case study
We take several Java open source software tools, execute these under our
instrumentation program, and collect the execution traces for each. Table
5.2 lists the key features of log file generated by each program run for 200
seconds on a 2.93 GHz Intel Core i3 system. Figure 5.7 shows a
visualization result generated by all programs collectively and shows the
number of objects created for each collection API. Appendix-B contains the
visualization of ten software tools listed in Table 5.2. Collecting events over
a longer time is also possible, provided large storage is available. Table 5.2
shows the results on ten programs executed for the same duration with
varying log file size, objects created, and other attributes. It is due to the fact
that the internal implementation of each program is different and has no
direct correlation with the run time.
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 162
Table 5.2 Log File Detail
Program Running time Log file size Object created
Accessor methods
called
Mutator methods
called
(Sec.) (MBs) (Approx.) (Approx.) (Approx.)
Prefuse 200 110 54 1045600 16636
Browser 200 63 230 523530 63290
Eclipse 200 20 7765 78400 37790
JMoney 200 15 11935 54200 31965
JEdit 200 12 2170 59800 4500
freemind 200 10 11900 55000 936
Fire 200 10 4660 38100 18957
M3D 200 06 40 71 57200
JHotdraw 200 03 1670 8770 24058
jImage
200 0.4 80 2540 680
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 163
Figure 5.3 Visualization main view
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 164
Figure 5.4 Visualization Package-wise view
Figure 5.5 Mutator methods view
Figure 5.6 Search result for HashTable
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 165
Using the proposed visualization, a comparison of different programs,
based on their use of APIs is performed. In Figure 5.8 (a) the visualization is
shown for M3D program which uses only two types of collection objects,
i.e., vector (19 objects) and stack (17 objects). Figure 5.8 (b) shows that
JMoney program creates over 7000 objects of arraylist and almost 3000
objects of hash types. Other rectangles show HashMap and HashSet
collection APIs, respectively with over 800 leaf nodes inside each rectangle.
Figure 5.9 shows the objects created in Eclipse and the mutator methods.
Through visualization it is found that Eclipse creates over 7000 collection
objects. Therefore, this program made an extensive use of such collection
objects. Table 5.3 shows summary of collection API objects created during
the execution of test case.
5.3. Performance evaluation and comparison
To evaluate the performance in supporting the programmer in
understanding the collection API usage, we conducted multiple
experiments. Since the empirical method is effective form of quantitative
evaluation of information visualization tools [55], we used the same. A
controlled experiment was designed to evaluate the effectiveness of
visualization tool in the task related to the understanding of API usage in
large programs. The objective of the experiment is to measure the impact of
the proposed visualization in better software/program comprehension in a
small time. We evaluated the proposed approach using twenty benchmark
software tools. The evaluation is based on the delay due to instrumentation
and an internal evaluation metric. The overhead time and slowdown factor
are measured to investigate the performance of instrumentation code. These
experiments are performed for both the target applications and benchmark
tools.
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 166
Table 5.3 Collection APIs per program
Collection APIs
(Numbers of objects created)
Program HashTable ArrayList HashMap LinkedList HashSet Vector Stack StringBuffer Other
JEdit 908 310 280 300 - 250 201 - -
JHotdraw - 1616 - - - -
- 50
Prefuse - 24 6 - - - - - 30
jImage 12 2 10 - - 37 - - 12
Browser - 139 - - - - - - 60
JMoney 1240 7880 890 -- 980 - - - 2
Eclipse - 1235 2378 - 950 - - - 100
Fire - 1270 2320 - 957 - - - 70
freemind - - -
- - - 10550 50
M3D - - - - - 19 17 - -
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 167
Figure 5.7 All programs visualization
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 168
Figure 5.8 Collection APIs objects usage (a) M3D (b) JMoney
Figure 5.9 Eclipse objects vs. mutators calls
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 169
5.3.1. Experiment Design
This experiment is used to quantitatively evaluate the effectiveness of
visualization, which helps the developer to identify various collection APIs
used during program execution. At the same time, the evaluation also
searches for the insight provided by the tool to find new information. To
answer following research questions were formulated.
1. Does the use of the proposed visualization reduce the time needed
to find collection APIs in a program?
2. Does the use of the proposed visualization reveal the hierarchy
(package, class, method) for objects of respective APIs?
3. Whether or not a developer can determine the APIs and the number
of objects for each API?
4. Does this information helps developer in program comprehension?
Two null hypothesis are devised. The hypotheses with their respective
alternates are given in Table 5.4.
Table 5.4 Null hypotheses with their alternatives
Null Hypothesis Alternate Hypothesis
H10 : The tool does not reduce the time to
understand the collection APIs usage in a
program
H1: The tool reduces the time to understand the collection
APIs usage in a program
H20: The tool is not useful in
comprehension tasks. H2: The tool is useful in comprehension tasks.
A. Subjects
A total of 24 subjects participated in a controlled experiment, and were
chosen from target groups. The demographics of these subjects are listed in
Table 5.5. The subjects were master and bachelor level students having
background knowledge of computer programming and Java collection
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 170
APIs, however the subjects had no experience with the proposed
visualization. The subjects were selected on voluntary basis, having
motivation for the tasks. They were divided into two groups, an
experimental group and a control group. The experimental group was
provided with the visualization tool, while the control group was not
provided with the tool.
Table 5.5 Demographics of the subjects
Characteristics Control Group Experimental Group
Age (years) 21-29 21-29
Education (years) 15-18 15-18
Programming experience (years) 1-5 1-5
B. Object System and Tasks
A rich number of open source software tools are available: ten software
tools as mentioned in Table 5.2 were selected. One of these ten tools is a
popular text editor jEdit. It consists of around 900 classes with 32 packages
and approximately 5000 methods. Primary rationale behind the selection of
jEdit is its popularity among Java developers as well as its source code is
available. In addition to this, nine other tools are also used in evaluating the
proposed visualization. The subject systems (mentioned in Table 5.2) are
open source tools/applications and we put our tracking classes in their
application path before executing these applications. The probe is inserted
into the application’s classes during load time and these do not change the
application’s original bytecode. As the program executes the probe, it also
runs and generates event-based information which is stored into the log file.
The basic concern of the tasks in our experiment is that whether the
subject is able to find and identify the collection APIs used in a software
tool? The subject should also be able to understand the hierarchy of
package, class, and method for a particular collection API object. The
respective tasks with descriptions are listed in Table 5.6. The tasks are
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 171
presented to both (experimental and control) groups as open questions.
These tasks are mostly related to program understanding and analysis
through collection API objects. The tasks include: a) find the collection
APIs used in a program, b) mostly used collection type, and c)
package/class in a program, where a particular type of collection object was
created.
Table 5.6 Task description
Task Description
T1 Identify the collection APIs used in a program
T2 Identify the packages and classes of collection APIs in a program
T3 List 3 classes responsible for creation of most objects of a particular API
T4 Identify methods that create maximum number of collection APIs objects
C. Experiment Procedure
The experiments were performed in different sessions with both the
groups on the computer systems having similar specifications. The subjects
were given time to familiarize with the task and the environment. The
experimental group was provided with the tool, while the control group just
used the IDE and text data to extract the information. The visualization was
available to the experimental group, while the control group had no such
facility. The subjects were asked to find the collection API usage, where
they could easily see the collections and their usage across various packages
and classes. The experiment’s data were collected and recorded for further
analysis and evaluation. We performed the experiments for both the
hypotheses simultaneously.
The first experiment is related to the hypothesis that, “user could
understand the collection API usage easily using the proposed tool and
reduce the time required for this task.” The subjects were given task to
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 172
understand collection API usage. The independent variable in the
experiment was the visualization tool while the dependent variables were
the time taken to complete task and its accuracy. The experimental group
quickly identified the collection API used in the program through the
visualization. The control group took more time and was not able to
identify all the collection API usages. To evaluate the second hypothesis,
we recorded scores given by each subject after a particular task was
performed. The score is on a scale of 0-5, where 0 stands for “not useful”,
while 5 stands for “very useful”.
D. Variables Analysis
The experiment has some independent and dependent variables involved.
The independent variables for the experiment are the availability of
proposed visualization tool and the system size under consideration.
Dependent variables are the completion time of a particular task and the
task usefulness per task score provided by the system. Table 5.7 shows the
time taken and the points for usability by each subject of the experimental
group. Table 5.8 shows the time taken and the points for usability by each
subject of the control group. The task analysis, as shown in Table 5.9,
describe statistics per task for the experiment.
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 173
Table 5.7 Experimental group statistics for time and usability score (0-5)
Subject Task1
time
(min.)
Task2
time
(min)
Task3
time
(min.)
Task4
time
(min.)
Total
time
(min.)
Task1
usability
(0-5)
Task2
usability
(0-5)
Task3
usability
(0-5)
Task4
usability
(0-5)
Total
usability
Subject 1 17 13 5 11 46 5 3 4 0 12
Subject 2 15 16 6 12 49 4 2 1 3 10
Subject 3 10 9 4 11 34 5 2 4 3 14
Subject 4 9 12 5 8 34 5 1 4 4 14
Subject 5 12 8 7 8 35 3 2 5 4 14
Subject 6 8 9 4 11 32 3 4 4 1 12
Subject 7 11 8 6 6 31 4 3 2 4 13
Subject 8 8 9 7 6 30 4 3 0 3 10
Subject 9 17 14 5 8 44 3 2 4 3 12
Subject 10 16 9 4 9 38 4 2 2 1 9
Subject 11 13 11 6 8 38 5 1 3 1 10
Subject 12 11 9 7 10 37 4 1 3 0 8
Average 12.30 10.60 5.50 9.00 37.30 4.10 2.20 3.00 2.30 11.50
Median 11.50 9.00 5.50 8.50 36.00 4.00 2.00 3.50 3.00 12.00
Std. dev. 3.30 2.60 1.20 2.00 6.10 0.79 0.94 1.48 1.50 2.10
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 174
Table 5.8 Control group statistics for time and usability score (0-5)
Subject Task1
time
(mins.)
Task2
time
(mins.)
Task3
time
(mins.)
Task4
time
(mins.)
Total time
(mins.)
Task1
usability (0-
5)
Task2
usability (0-
5)
Task3
usability (0-
5)
Task4
usability (0-
5)
Total
usability
Subject 1 19 18 13 11 61 2 3 1 0 6
Subject 2 18 16 11 12 57 2 3 2 2 9
Subject 3 17 19 10 14 60 3 2 3 3 11
Subject 4 18 12 13 11 54 2 3 0 0 5
Subject 5 17 15 11 16 59 3 1 3 3 10
Subject 6 19 17 7 12 55 2 2 1 2 7
Subject 7 16 11 9 11 47 3 1 3 3 10
Subject 8 15 14 15 9 53 2 3 1 0 6
Subject 9 18 16 8 10 52 3 2 2 1 8
Subject 10 14 13 9 12 48 2 1 2 2 7
Subject 11 18 15 11 9 53 1 3 3 2 9
Subject 12 17 13 8 13 51 2 3 1 0 6
Average 17.2 14.9 10.4 11.7 54.2 2.30 2.30 1.80 1.50 7.80
Median 17.5 15.0 10.5 11.5 53.5 2.00 2.50 2.00 2.00 7.50
Std. dev. 1.5 2.4 2.4 2.0 4.5 0.62 0.87 1.03 1.20 1.90
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 175
Figure 5.10 Boxplot representation of the results from control and experimental groups
20 28 36 44 52 60 68
Con.group
Exp.group
Time (minutes)
2 6 10 14 18
Con.group
Exp.group
Score (points)
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 176
Table 5.9 Per task comparison
E. Analysis of time completion (hypotheses H10)
The first null hypothesis state that the tool does not reduce the time
required to understand collection API usage in a program. The mean time
for the experimental group is 37.3 minutes and for the control group this is
54.2 minutes. As the results in Table 5.10 show, experimental group on
average completes the tasks in approximately 45% less time as compared to
the control group. The Shapiro-Wilk test is performed to verify the
normality of samples in both the groups. The results are shown in Table
5.10, where if the value of W is greater than the critical value the null
hypothesis cannot be rejected and the samples are confirmed for normality.
The Levene test gives value greater than 0.05 indicating same variance in
both samples. For 12 samples in each group t-statistics were calculated with
α =0.05 as confidence level, which give the p-value as 0.000009. The p-
value is much lower than α, thus the null hypotheses H10 is rejected. It is
concluded that there is considerable difference between the groups. In most
of cases, the tool helped the user to understand APIs usage. Figure 5.10
depicts the results of time taken by both groups as boxplot showing the total
time taken by each subject for all tasks. As shown in the figure, the range of
time in terms of minutes is less for the experimental group as compared to
the control group.
Time (minutes) Usability (Points)
Task
Experiment
Group
Control
Group
% Diff. Experiment
Group
Control
Group
% Diff.
Task1 147 206 -28.64 49 27 81.48
Task2 127 179 -29.05 26 27 3.7
Task3 66 125 -47.2 36 22 63.63
Task4 108 140 -22.85 27 18 50
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 177
Table 5.10 Results statistics
Group Mean Diff. Max Min SD
One-tail Student
t-test Levene
test
Shapiro-Wilk
test
p-value t df p-
value
w
Time
Contr
ol 54.2
-
45.30
%
61 47 4.5
0.0000
019
7
.
7
20 0.73
0.82 0.97
Exper
iment 37.3 49 30 6.1 0.23 0.91
Usability
Contr
ol 7.3
37.17
%
4 0 1.9
0.0001
92
4
.
2
21 0.28
0.16 0.90
Exper
iment 11.5 5 0 2.1 0.23
0.91
2
F. Analysis of the tool’s usability score (hypotheses H20)
In case of second hypothesis, the experimental group score is 37 % better
compared to the control group’s mean point scores as shown in Table 5.10.
We also calculate the t-statistics for both groups. After verifying the
normality through Shapiro-Wilk test and variance equality through Levene
test t-test is performed for the usability experiment result. For α = 0.05
confidence level, the p-value is calculated to 0.00019 which is much less
than α. Therefore, the null hypotheses H20 is rejected.
Figure 5.10 (b) shows the boxplot of scores in terms of points for both the
groups by taking the sum of total score for each subject. The range of points
for experimental group is higher than the control group; hence the usability
of proposed tool is better.
G. Task analysis
This section lists analysis of the tasks for both types of activities, i.e., time
taken and usability score. Figure 5.11 (a) shows the average time taken for
each task by the experimental and control groups. In general, the
experimental group performs better in all tasks as compared to the control
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 178
group. Since the control group is to evaluate the text file of several MBs,
they took more time. In contrast, the experimental group visually analyze
all data and quickly perform their task. Similarly, Figure 5.11 (b) does a
comparison of the average usability score for both these groups. For all the
tasks the proposed tool scored better compared to score assigned by control
group.
5.4. Performance evaluation
This section shows the performance of the proposed tool over ten Java
open source software tools used in the case study. The proposed approach
evaluation using twenty benchmark software tools is also shown.
The instrumentation process always brings performance overhead for the
target software. Due to selective instrumentation, the increase in program
execution time and the trace writing time is negligible. Table 5.11 lists the
time taken for each of ten software tools with and without instrumentation
code. It also lists the size of the log file in megabytes (MB) generated for a
session of 200 seconds. The instrumentation overhead degrades the startup
of a particular program with some factor. The time taken by program to
write trace data to the hard disk is ignored. During startup Eclipse takes
(a) (b)
Figure 5.11 Per-task analysis
0.0
1.0
2.0
3.0
4.0
5.0
Task1 Task2 Task3 Task4
Av
era
ge
sco
re(p
oin
ts)
Tasks
Experimental Group
0.0
5.0
10.0
15.0
20.0
Task1 Task2 Task3 Task4A
ve
rge
tim
e (
min
uts
)
Tasks
Experimental Group
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 179
more time since it has a larger number of classes to load. In case of
instrumentation for Eclipse, the program takes almost twice as much time to
load due to the extra code that tool puts at runtime. However, after startup
Eclipse runs smoothly and the instrumentation did not slow down the
application.
Table 5.11 Software time taken while loading
Time taken in seconds
Software Without
instrumentation With instrumentation Log file size in MB
Prefuse 10 17 110
Browser 4 8 63
Eclipse 25 49 20
JMoney 3 7 15
JEdit 5 12 12
Freemind 3 5 10
Fire 5 11 10
M3D 4 9 6
JHotdraw 6 11 3
jImage 4 7 0.4
The slowdown factor, Sf, is calculated using Eq. (5.8). Where, 𝑔 𝑒 is
the log file size, ′ is the time taken without instrumentation, and is the
time taken with instrumentation.
= (( 𝑔 𝑒′ ) − ( 𝑔 𝑒)) .
Figure 5.12 show the slowdown for ten software tools used in this
chapter. It is clear from Figure 5.12 that for eight out of ten software tools
the slowdown is less than 3, for five it is even less than 1 which is negligible.
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 180
However, for two software tools (Browser and Prefuse) slowdown is high,
i.e., 4.5 and 7.8, respectively. This is due to the fact that these two software
tools instantiate less number of objects in their start-up. The value of
slowdown decreases where the log file size is large and also the software
tool that instantiates a large number of objects in its start-up phase. This
amortizes the instrumentation overhead.
Figure 5.12 Slowdown for ten software tools due to instrumentation
0
1
2
3
4
5
6
7
8
9
Software tools
Slowdown
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 181
Figure 5.13 Runtime overhead for benchmarks tools
Figure 5.14 Slowdown factor and log file size
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
avr
ora
ba
tik
ecl
ipse
fop
h2
jyth
on
luin
de
x
luse
arc
h
pm
d
sun
flo
w
tom
cat
xala
n
com
pile
r.co
mp
iler
com
pile
r.su
nfl
ow
de
rby
seri
al
sun
flo
w
he
llo
wro
ld
xml.
tra
nsf
orm
xml.
vali
da
tio
n
Tim
e [
sec.
)
Software tools
Without instrumentation
With instrumentation
0.00
0.50
1.00
1.50
2.00
2.50
avr
ora
ba
tik
ecl
ipse
fop
h2
jyth
on
luin
de
x
luse
arc
h
pm
d
sun
flo
w
tom
cat
xala
n
com
pile
r.co
mp
iler
com
pile
r.su
nfl
ow
de
rby
seri
al
sun
flo
w
he
llo
wro
ld
xml.
tra
nsf
orm
xml.
vali
da
tio
n
No
rma
lize
d s
low
do
wn
an
d lo
g f
ile
siz
e (
MB
s)
Software tools
SlowDown File[MB]
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 182
To further verify the performance of the proposed approach we calculate
the time delay caused due to instrumentation using the benchmark tools of
DaCapo and SPECJVM2008. This gives a total of twenty software tools to
be used in verifying performance of the proposed approach. Figure 5.13
shows the runtime overhead due to instrumentation code. Each benchmark
application ran with and without instrumentation. The runtime of each
application is normalized to the original code time. The application with
lesser number of classes have lower overhead considering their load time.
While others with large number of classes have higher impact on the
runtime. As shown in Figure 5.13, serial, compress, derby, and transform
take more time which is mainly due to the larger base code.
The slowdown factor, Sf, is calculated using Eq. (5.8) for twenty
benchmark applications. The results of which are shown in Figure 5.14. The
slowdown factor is contributed by the instrumentation code and the output
file size. Each benchmark output file is normalized and then the slowdown
factor is computed. Figure 5.14 also shows the relative comparison of
output file and slowdown factor for twenty benchmark tools. The graph in
Figure 5.14 shows that, as the file size increases the slowdown factor also
increases.
5.5. Chapter Summary
This chapter presented a treemap-based tool to support Java collection
API usage in a software program. The proposed tool enables the developer
to comprehend the API usage at different levels of abstraction. Dynamic
data are collected during program execution and visualized using treemap.
The tool provides interactive facility to evaluate the APIs usage at method,
class, and package level and at the same time provides a global view. The
proposed system is helpful in general program comprehension, API usage
evaluation, and supports program maintenance activities. The evaluation
results confirm the tool’s effectiveness. The proposed tool can be used for a
Chapter 5 Visualizing of Traces of Java Program
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 183
software corpus to analyze the frequency of collection API usage. It has also
been evaluated using twenty benchmark software tools. The results showed
better performance.
Chapter 6 Conclusion and Future work
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 184
Conclusions and Future Work
"Nothing ever becomes real until it is experienced." John Keats
This dissertation contributes towards the domain of
information visualization from three aspects: automatic
selection of visualization technique, quantifying visualization
and layout optimization, and usage of visualization for
dynamically collected software data. The main objective of
this work was to build computational intelligence-based
framework for visualization technique prediction and
optimization. The visualization technique prediction for a
particular dataset is based on the characteristics of the dataset
and the related tasks that are to be performed on the data.
Furthermore, the study analysed and formulated
visualization metrics computed using the existing knowledge
of human perceptual theories. The visualization metrics were
used to automatically build a perceptually and aesthetically
appealing visualization. Moreover, visualization-based tool
was utilized to understand and get the insight of dynamically
collected data of collection APIs in Java programs. A
bytecode instrumentation framework was developed to collect
Java collection APIs objects trace from a program and
visualized using treemaps-based visualization technique. This
chapter revisits the research questions posed in the beginning
of this dissertation. The chapter also lists some limitations of
this work followed by prospective directions.
Chapter
6
Chapter 6 Conclusion and Future work
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 185
Information visualization is becoming ubiquitous in every field,
especially in domains where large volumes of data are needed to be
analyzed for pattern recognition, data mining, and knowledge discovery.
Visualization, being an efficient and effective approach, assists the user to
get the insight of data quickly. The visual analysis expedites the decision
making process by providing early insight of data. However, visualization
is not merely pretty picture, since an inappropriate visualization could lead
to erroneous decision. The work presented in this dissertation tackles
information visualization with three aspects: an appropriate visualization
selection, quantifying and optimizing visualization, and using treemaps-
based visualization to analyse Java collection APIs usage data collected
through bytecode instrumentation. A careful investigation of existing
research in the field of visualization shows that selection of appropriate
visualization for a particular dataset is mainly influenced by the metadata
and related tasks. Furthermore using the existing knowledge and empirical
research findings, the study devised visualization metrics to build aesthetic
visualization that can be computationally measured. It utilized
computational intelligence-based framework for automatic visualization
selection and used evolutionary computation for better visualization
layout. Moreover, the bytecode instrumentation and treemaps-based
visualization was used as an effective method to understand Java collection
APIs usage in Java program during the course of execution. In the next
section, the research questions posed in the first chapter have been revisited
with the findings.
6.1. Primary research questions
The introductory chapter of this dissertation posed a few primary
research questions, the answers to which were investigated in the rest of the
dissertation. The research questions covered three aspects of information
visualization this research undertook: automatic visualization selection,
visualization optimization, and using treeemaps visualization for the
Chapter 6 Conclusion and Future work
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 186
comprehension of runtime Java APIs usage. This section revisits those
primary research questions one-by-one and discusses the answers provided
in different sections of this dissertation.
RQ-1: What are important characteristics of a dataset that
influence the selection of visualization technique?
This research question is addressed in chapter 3 by
conducting careful investigations of existing literature and
through development experience of visualization tools. It
was found that there is no standard dataset or rules
available that guide a naïve user to select an appropriate
visualization technique for his/her dataset. Nevertheless,
some visualization types are more suitable in a particular
context for visualization. Each dataset has its own
characteristics known as metadata, that metadata actually
plays an important role in the selection of a visualization
technique. This study found and formulated four main
metadata attributes; data dimensions, number of instances,
number of attributes, and primary attribute type, which are
important for the selection of an appropriate visualization.
The dimension of a dataset could be 1D, 2D, or 3D and the
data can be hierarchical too. Additionally, four primary
attribute types considered were: ordinal, continuous,
categorical, and geographical.
Furthermore, the study found that the task that is to be
accomplished with a dataset is also a key factor in the
selection of visualization technique. After a careful
treatment of the available literature, four tasks, i.e.,
relationship, trends, distribution, and comparison were
Chapter 6 Conclusion and Future work
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 187
taken into account. Therefore, along the metadata
attributes, these tasks also influence the selection of an
appropriate visualization technique for a specific dataset.
Sensitivity analysis of each parameter was performed to
show the relative importance and individual influence. It
has been found that the most important characteristics of
the dataset for visualization selection are dimension,
primary attribute and task. Hence, the removal of each
input parameter from the model influence the accuracy.
However, when dimension and primary attributes are
mutually left from the input the error rate increase by more
than 41%
RQ-2: How metadata and a particular task related to the data, be
used to predict visualization for a dataset?
While incorporating second research question, chapter 3
focused on a systematic way to handle metadata, relevant
tasks, and visualization. Using the contemporary
knowledge in the literature, a novel dataset was built,
mapping metadata and tasks that are to be accomplished
through visualization. The newly created dataset
comprised of metadata about the original dataset, the
relevant tasks that needs to be performed, and the
visualization technique used for that particular dataset.
One main issue with building such dataset is the limited
availability of the constituting instances. A dataset with
almost four hundred instances consisting of records on
eight different visualization techniques was built. This
particular dataset was then utilized for training and testing
of the artificial neural network (ANN)-based model to
classify the current instances and predict visualization for
Chapter 6 Conclusion and Future work
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 188
unseen data instances. Input to the ANN model consists of
four metadata attributes and relevant task while the output
is an appropriate visualization from a set of eight visual
techniques. An exhaustive training experiment was
performed with different ANN models in combination with
various training parameters. The empirical results showed
that ANN-based computational intelligence model
accurately predicts visualization by incorporating the
metadata and relevant tasks. The detail of various aspects
of these experiments may be seen in chapter 3.
RQ-3: What is the best CI model to predict visualization based on
metadata?
The question has been addressed by putting emphasis on
selection of best possible computational intelligence-based
model. In chapter 3, the experiment and discussion section
is comprehensively sought out for various steps and
techniques to achieve the objective. Initially, several feed-
forward neural networks (FFNN)-based models were
deployed with various numbers of neurons in the hidden
layer, while keeping the input and output neurons fixed.
Each model was provided with five input neurons and
eight neurons in the output layer. The deployed models
were tested with different combination of training method
and neural network parameters to make sure all possible
scenarios are evaluated. Performance parameters, i.e.,
accuracy, sensitivity, precision, and correlation coefficient
were used to evaluate each model. The detail analysis
depicts that the ANN model with 14 neurons in the hidden
Chapter 6 Conclusion and Future work
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 189
layer show the highest accuracy of almost 98%. Thus
becoming the best possible model for the problem in hand.
To compare the best ANN model with other five well-
known classifiers, another exhaustive experiment was
carried out to ensure the suitability of the best model. The
relative performance of these classifiers on common
parameters showed that the ANN-based model is best
suitable CI-based approach in the context of automatic
visualization prediction.
RQ-4: What aesthetic and perpetual design parameters are
important for specific visualization?
The field of information visualization is concerned about
the evaluation and creating optimized visualizations which
are aesthetically better and perceptually pleasing. However,
building such visualizations is a non-trivial task,
particularly for the naïve user. The existing knowledge in
the field of empirical research in visualization, human
computer interaction (HCI), cognitive science, and
perceptual theories could be formulated to find the basic
characteristics for better visualization. While answering
this question, chapter 4 provides a rigorous study on the
evaluation and investigation of all such contemporary
theories and knowledge. Each visualization technique
apparently has some unique properties, i.e., border size and
colour for treemaps, number of crossing lines for parallel
coordinates, and node colours for graphs. However, there
are some common design attributes that contribute to the
better visualization creation. The detailed studies show that
Chapter 6 Conclusion and Future work
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 190
these parameters are of both types, common in all
visualizations and unique for a particular visualization
technique. Moreover, optimal or sub-optimal values of
these parameters could be found by careful investigation of
the existing knowledge. Chapters 4 analyses and explores
various design parameters for treemap visualization
technique.
RQ-5: How the visualization features and design parameters map
to the visualization metrics?
Over the years, many metrics have been presented in
literature to evaluate and compare visualizations.
However, these visualization metrics are mostly theoretical
and due to the subjective nature of the problem, automatic
visualization optimization is difficult. Chapter 4 proposed
four visualization metrics, i.e., effectiveness,
expressiveness, readability, and interactivity. These metrics
were exploited computationally to automatically optimize
visualization. The basic idea is the mapping of
visualization specific attributes to the visualization metrics.
Furthermore, each visualization attribute may be mapped
to more than one metrics at the same time. The mapping
process was based on the domain knowledge that exists in
the literature for each visualization metrics. Additionally, a
combined metric provided weight to each attribute that
enables the customization of visualization with specific
attributes.
RQ-6: How the visualization metrics computationally evolve to
optimize the layout of a visualization technique?
Chapter 6 Conclusion and Future work
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 191
The question related to the optimization and
computationally evolution of metrics is also addressed in
chapter 4. A CI technique was used to formulate a
framework that computed and evolves a set of possible
solutions using the visualization metrics. Additionally, the
visualization metrics were then utilized to build a common
fitness function with a linear sum of the four metrics.
Moreover, each metric was combined with weights that
could vary. In addition, the problem was formulated by
encoding visualization design parameters into chromosome
of fixed length. A random population of individual
chromosome was created that provide an initial seed to the
system. The framework used several evolutionary
operators and the combined fitness function to evolve an
initial population in search of the best possible solution.
The framework was evaluated with different configurations
of operator and related parameter values to get the
optimum results. A treemap-based case study was
presented to validate the effectiveness of the proposed
framework. The exhaustive experimental results show that
the framework provided a computationally sound method
to optimize the visualization for better aesthetic and
perceptual properties. Furthermore, the optimized
visualization was then evaluated using user surveys and
statistical analysis of the result.
RQ-7: Which types of Collection APIs are frequently used by a
given/target Java program during the execution?
Java developers utilize different kinds of application
programming interface (APIs), i.e., swing, abstract window
Chapter 6 Conclusion and Future work
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 192
toolkit (AWT), and Util for various functionalities in their
programs. Java collection APIs are mostly used as program
data structure to handle data and variables. The efficient
usage of these APIs is critical for program performance,
particularly in case of large commercial applications.
Chapter 5 mainly focused on the answer of this question.
The solution to this is twofold, the extraction of relevant
information from a program during the course of
execution, where no source code is available in advance.
Secondly, this information must be effectively presented to
the developer to clearly identify the APIs used and assist in
program comprehension. Initially, a bytecode
instrumentation tool was developed to extract the snapshot
of collection APIs usage in a program. The proposed tool
was used to trace each class with probe code. The probe
was responsible for the collection APIs usage information,
which was handled and stored into a log file by the utility
module. Additionally, the log file was given as input to the
treemaps-based visualization module to effectively get the
insight. An investigation was made by evaluating ten Java-
based applications for collection APIs usage. The detailed
study showed that these applications used more than eight
collection APIs, including HahsTable, ArrayList, LinkedList,
and Vector. Moreover, JEdit, a Java-based text editor used
six collection APIs, while others, such as JHotdraw used
only ArrayList. The respective visualization and collection
APIs usage of each application may be found in chapter 5.
RQ-8: Which packages/classes/methods of target program are
responsible to instantiate Collection APIs objects?
Chapter 6 Conclusion and Future work
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 193
Another correlated and interesting scenario in the solution
of RQ-7 is the sections of a program responsible for
collection APIs usage in terms of the number of objects
instantiated. The answer to this question is also important
for the program comprehension and maintenance. The
instrumentation tool discussed in chapter 5 is used to
extract various types of information from a program during
its execution. The information also includes spatial
information about the collection APIs objects instantiation.
The extracted information include: the package, class, and
method name along with the line number in source code,
where a particular collection APIs is used. The utility
module was responsible to log a single record per newly
instantiated object of each collection API type. The
visualization module was then utilized to depict this
hierarchical information using treemap-based visualization
technique. The detail analysis of respective visualization of
different applications showed that there were some
packages or class responsible for the large amount of
collection API objects. Moreover, large applications, i.e.,
JEdit and Eclipse used a large number of packages and
classes and they heavily used collection APIs.
RQ-9: How dynamic bytecode instrumentation is used to extract
APIs objects traces with minimal runtime overheads?
This question addresses the issues related to the
development and working of bytecode instrumentation-
based dynamic analysis tool. The instrumentation tool
developed in chapter 5 use minimal runtime overhead to
avoid the performance degradation in the target
Chapter 6 Conclusion and Future work
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 194
application. Initially, in development phase the proposed
tool is kept simple and only actual application based classes
were selected for instrumentation. Furthermore, a selective
instrumentation was adopted where only collection API
objects creation location is tracked. Moreover, the
proposed instrumentation tool was evaluated for
performance analysis using real world applications and
standard benchmark tools. The detailed experiments
showed that the slowdown factor and runtime overhead of
the proposed tool was not high and avoided any
degradation in the target application’s performance.
However, in case of collection API intensive applications
where large numbers of objects were instantiated, the
degradation was comparatively higher.
RQ-10: Can treemap-based visualization be utilized for the
analysis of Collection APIs objects of a particular Java program?
Another main objective was to utilize and evaluate the
treemap-based visualization for collection API trace data
understanding. The bytecode instrumentation extracted
runtime information from a program and stored into a log
file. Getting the insight about such a large text file is not
trivial. The information in the log file is of hierarchical
nature, i.e., package, class, and method. Simultaneously,
treemap visualization is space-filling hierarchical technique
that depicts large hierarchical information on a single
screen. The visualization tool described in chapter 5 was
based on treemap visualization, where user could see
information about the collection APIs on a single screen.
The tool provided interactive facility to change package,
Chapter 6 Conclusion and Future work
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 195
class, and method wise hierarchy. The effectiveness of the
proposed tool was evaluated using controlled experiment
and exhaustively elaborated in chapter 5. The results of
statistical analysis of controlled experiment showed that
the proposed visualization tool was better with respect to
time and usability for collection APIs trace comprehension.
6.2. Summary of the findings
The dissertation explored the proposed work from three aspects related to
information visualization; automatic visualization selection, visualization
optimization based on quantifiable metrics, and the utilization of
visualization and bytecode instrumentation for code comprehension. This
section summarizes the major findings in the context of these three aspects.
An appropriate visualization selection for a specific dataset is inevitable
for the user. The work presented proposes an automatic visualization
selection based on the metadata of dataset to be visualized and its relevant
tasks. The study found important metadata attributes and tasks after a
careful investigation of the existing literature. A novel dataset was built
comprising of metadata and tasks that already used in information
visualization community. Additionally, the knowledge about visualization
technique and their supported task was utilized to enhance the newly
created dataset. Furthermore, an ANN-based model was deployed with
fixed number of neurons in the input and output layer. The generic ANN
model accommodated eights visualization techniques; histogram, line chart,
pie chart, scatter plot, parallel coordinates, map, treemaps, and linked
graph, to properly train and test the proposed model. In addition, the
proposed model was evaluated using five well known classifiers and state-
of-the-art automatic visualization selection systems. The exhaustive
comparison showed that the proposed ANN-based model could be utilized
for the problem with high accuracy and less computational resources. To
Chapter 6 Conclusion and Future work
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 196
the best of our knowledge, the work brings new perspective in the field of
visualization, where new visualizations may be added to the dataset in
order to build a comprehensive database. The dataset, therefore, provides a
foundation for an expert system to create a knowledge base.
In the perspective of the second aspect, the dissertation contributed in the
visualization optimization. The previous aspect of automatic visualization
selection concentrates on how to predict an appropriate visualization for a
specific dataset. The supplementary step is provided with optimization of
selected visualization based on aesthetic and perceptual theories. The
proposed framework is fed with visualization design parameters and
metrics. Furthermore, the design parameters are extracted from the
contemporary knowledge in literature through careful analysis.
Additionally, the ranges of values for each parameter were investigated
empirically. These design parameters are important in setting the aesthetic
and perceptual properties of any visualization. Yet another main objective
of this work is to formulate visualization metrics to be computationally
measured and compared. Therein, four visualization metrics: effectiveness,
expressiveness, readability, and interactivity were defined in terms of
visualization design parameters. The visualization optimization process was
formulated with the development of evolutionary algorithm-based
framework. The proposed framework and visualization metrics were
evaluated using several experiments and case studies. The analysis showed
that the new formulation of visualization metric and the evolutionary
algorithms-based optimization techniques provided better visualization with
respect to aesthetic and perceptual qualities. Furthermore, the proposed
approach is yet another step towards the automatic evaluation of
visualization objectively.
The last part of the dissertation focused on the information usage for the
purpose of program comprehension and understanding. It concentrates on
the collection APIs usage in Java program during execution. The proposed
Chapter 6 Conclusion and Future work
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 197
work in this regards is twofold: the extraction of relevant information from
program and to represent visually it to the user. To perform dynamic
analysis of a program, bytecode instrumentation tool was developed with
the objective to keep the runtime overhead minimal. The tool was then
utilized for selective instrumentation of the target program to capture
collection API objects instantiation traces. The tool was tested on ten real
world applications and twenty benchmark tools to evaluate the runtime
overheads and slowdown impact. The traces collected during the
instrumentation step were brought into the visualization module as an
input. The proposed visualization tool was based on treemap visualization
to depict the hierarchical information on a single screen. The visualization
module provided interactive and search facility to evaluate the collection
APIs on various aspects, i.e., types, package, or class-wise. Furthermore,
the visualization part was evaluated using controlled experiment with 24
participants. The statistical analysis of the controlled experiment showed
the tool’s effectiveness and usability in the context of understanding the
collection APIs usage through visualization.
6.3. Limitations
All research suffers some limitations; same is the case with the work
presented in this dissertation. This study is perceived with some limitations
regarding the dataset. Firstly, the current dataset consist of only 400
instances for training and testing of the classifier. Secondly, only 8
information visualization techniques are considered while building dataset.
However, the dataset may be extended and more visualization techniques
may be incorporated. In this case, the ANN model shall be trained with the
enhanced dataset. The visualization metric optimization experiment is
limited to the knowledge and theories already taken and established true.
Furthermore, in case of instrumentation tool, subjective applications must
be Java-based and bytecode is available in jar file. The subjective application
and instrumentation tool is running on the same system and there are no
Chapter 6 Conclusion and Future work
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 198
classloader issues. The instrumentation process is also limited to only those
classes that were loaded during the execution.
6.4. Future work
In the previous sections of this chapter all the answers to the posed
research questions were formulated and discussed. Nevertheless, the work
presented in this dissertation may be extended in several directions for
future research. This section provides a discussion of future work in the
context of three main aspects covered throughout the dissertation.
Automatic visualization selection aspect may be extended in several
directions in future. As we used eight visualization techniques, exploring
more visualization techniques for integration to current dataset is one
perspective. Visualization weight may be added to increase the selection
probability of a particular visualization technique for a specific dataset.
Similarly, we have used four tasks in this dissertation, where only one task
is considered for the selection of visualization. However, the work could be
extended to incorporate more than one task with the dataset. Another
interesting future aspect would be developing a library based on this work
and to integrate it into various development environments, e.g., data mining
packages, electronic worksheets, and online services. A user study, such as
control experiment may be performed to evaluate the output of system and
at the same time in future user feedback on selected visualization could be
added to the system for customization.
Visualization optimization aspect also leads to various directions in which the
research may be extended. One of the obvious future directions of the
proposed work is the implementation to other visualization technique, i.e.,
graph, progress bars, and parallel coordinates. Similarly, yet another
perspective is to find more generic visualization specific parameters and
their evolution for visualization optimization. Further research may include
the tasks that user requires to accomplish with a particular visualization for
Chapter 6 Conclusion and Future work
Selecting, Quantifying, Optimizing, and Understanding Visualization Techniques: A CI-Based Approach 199
optimization. Still another extension may be to the work is to build
sophisticated user interface to evaluate the results. Finally, a web-based
optimization environment may be built to facilitate online visualization.
Treemaps visualization and instrumentation also has its future aspects. The
study may be processed for the cognitive effects on programmers from such
package/class visualization that are responsible for creating/accessing
runtime collections of Java objects. It will also be interesting to apply some
data mining techniques, like: clustering, temporal analysis, and association
rules mining on the data collected via instrumentation, to discover useful
patterns. Information like: which method has been invoked the most in a
given package, which objects are surprisingly the most frequently used
together, clusters of related objects, outlier detection, and many such
implicit patterns are expected to be extracted using data mining techniques
from the recorded data. From the visualization perspective, there is an
opportunity to optimize the visualization of such large datasets in order to
make visualization more effective. For this purpose, optimization
techniques from computational intelligence may be utilized.
References
200
References
[1] M. Tsytsarau and T. Palpanas, "Survey on mining subjective data on the
web," Data Mining and Knowledge Discovery, vol. 24, pp. 478-514, 2012.
[2] D. Westerman, P. R. Spence, and B. Van Der Heide, "Social Media as
Information Source: Recency of Updates and Credibility of Information," Journal of Computer-Mediated Communication, vol. 19, pp. 171-183, 2014.
[3] C. W. Y. Wong, K.-h. Lai, T. C. E. Cheng, and Y. H. V. Lun, "The role of
IT-enabled collaborative decision making in inter-organizational information integration to improve customer service performance," International Journal of
Production Economics, vol. 159, pp. 56-65, 1// 2015.
[4] C. S. Parr, R. Guralnick, N. Cellinese, and R. D. Page, "Evolutionary
informatics: unifying knowledge about the diversity of life," Trends in ecology
& evolution, vol. 27, pp. 94-103, 2012.
[5] L. Gwanhoo, J. A. Espinosa, and W. H. DeLone, "Task Environment
Complexity, Global Team Dispersion, Process Capabilities, and Coordination in Software Development," Software Engineering, IEEE
Transactions on, vol. 39, pp. 1753-1771, 2013.
[6] G. J. Myatt, Making sense of data: A practical guide to exploratory data analysis
and data mining: John Wiley & Sons, 2007.
[7] F. Gorunescu, "Exploratory Data Analysis," in Data Mining, 2011, pp. 57-
157. [8] D. Pineo and C. Ware, "Data visualization optimization via computational
modeling of perception," Visualization and Computer Graphics, IEEE
Transactions on, vol. 18, pp. 309-320, 2012.
[9] M. Gleicher, D. Albers, R. Walker, I. Jusufi, C. D. Hansen, and J. C.
Roberts, "Visual comparison for information visualization," Information
Visualization, vol. 10, pp. 289-309, 2011.
[10] L. Grammel, M. Tory, and M. Storey, "How Information Visualization
Novices Construct Visualizations," Visualization and Computer Graphics, IEEE
Transactions on, vol. 16, pp. 943-952, 2010.
[11] S. Lallé, D. Toker, C. Conati, and G. Carenini, "Prediction of Users'
Learning Curves for Adaptation while Using an Information Visualization," presented at the Proceedings of the 20th International Conference on Intelligent User Interfaces, Atlanta, Georgia, USA, 2015.
References
201
[12] W. Huang, P. Eades, and S.-H. Hong, "Measuring effectiveness of graph visualizations: A cognitive load perspective," Information Visualization, vol. 8,
pp. 139-152, 2009. [13] B. Cornelissen, A. Zaidman, A. Van Deursen, L. Moonen, and R. Koschke,
"A systematic survey of program comprehension through dynamic analysis," Software Engineering, IEEE Transactions on, vol. 35, pp. 684-702, 2009.
[14] C. Ware, Information visualization: perception for design: Elsevier, 2012.
[15] S. Liu, W. Cui, Y. Wu, and M. Liu, "A survey on information visualization:
recent advances and challenges," The Visual Computer, vol. 30, pp. 1373-1393,
2014. [16] E. Bertini, A. Tatu, and D. Keim, "Quality metrics in high-dimensional data
visualization: an overview and systematization," Visualization and Computer
Graphics, IEEE Transactions on, vol. 17, pp. 2203-2212, 2011.
[17] I. Sommerville, D. Cliff, R. Calinescu, J. Keen, T. Kelly, M. Kwiatkowska, et
al., "Large-scale complex IT systems," Communications of the ACM, vol. 55,
pp. 71-77, 2012. [18] J. Wu, X.-x. Jia, Y.-p. Liu, and G.-h. Li, "Java object behavior modeling and
visualization," in Software Engineering Advances, International Conference on,
2006, pp. 60-60. [19] B. Mao, Y. Ban, and L. Harrie, "A multiple representation data structure for
dynamic visualisation of generalised 3D city models," ISPRS Journal of
Photogrammetry and Remote Sensing, vol. 66, pp. 198-208, 2011.
[20] M. A. Khan, S. Muhammad, and T. Muhammad, "Identifying performance
issues based on method invocation patterns of an API," in Proceedings of the 18th International Conference on Evaluation and Assessment in Software
Engineering, 2014, p. 51.
[21] P. Caserta and O. Zendra, "JBInsTrace: A tracer of Java and JRE classes at
basic-block granularity by dynamically instrumenting bytecode," Science of
Computer Programming, vol. 79, pp. 116-125, 2014.
[22] D. A. Keim, M. C. Hao, U. Dayal, and M. Hsu, "Pixel bar charts: a
visualization technique for very large multi-attribute data sets," Information
Visualization, vol. 1, pp. 20-34, 2002.
[23] T. O. Aydin, A. Smolic, and M. Gross, "Automated Aesthetic Analysis of
Photographic Images," IEEE Transactions on Visualization and Computer
Graphics, vol. 21, pp. 31-42, 2015.
[24] H. Lam, E. Bertini, P. Isenberg, C. Plaisant, and S. Carpendale, "Empirical
studies in information visualization: Seven scenarios," Visualization and
Computer Graphics, IEEE Transactions on, vol. 18, pp. 1520-1536, 2012.
References
202
[25] J. Bertin, "Semiology of graphics: diagrams, networks, maps," 1983. [26] E. R. Tufte and P. Graves-Morris, The visual display of quantitative information
vol. 2: Graphics press Cheshire, CT, 1983. [27] W. De Pauw, E. Jensen, N. Mitchell, G. Sevitsky, J. Vlissides, and J. Yang,
"Visualizing the execution of Java programs," in Software Visualization, ed:
Springer, 2002, pp. 151-162. [28] M. Shahin, P. Liang, and M. A. Babar, "A systematic review of software
architecture visualization techniques," Journal of Systems and Software, vol. 94,
pp. 161-185, 8// 2014. [29] B. Cornelissen, A. Zaidman, D. Holten, L. Moonen, A. van Deursen, and J.
J. van Wijk, "Execution trace analysis through massive sequence and circular bundle views," Journal of Systems and Software, vol. 81, pp. 2252-2268, 2008.
[30] P. Caserta and O. Zendra, "Visualization of the static aspects of software: a
survey," IEEE transactions on visualization and computer graphics, vol. 17, pp.
913-933, 2011. [31] K. Jezek, J. Dietrich, and P. Brada, "How Java APIs break – An empirical
study," Information and Software Technology.
[32] J. Singer and C. Kirkham, "Dynamic analysis of Java program concepts for
visualization and profiling," Science of Computer Programming, vol. 70, pp. 111-
126, 2008. [33] P. Lengauer, V. Bitto, and H. Mössenböck, "Accurate and Efficient Object
Tracing for Java Applications," pp. 51-62, 2015. [34] L. Marek, Y. Zheng, D. Ansaloni, L. Bulej, A. Sarimbekov, W. Binder, et al.,
"Introduction to dynamic program analysis with DiSL," Science of Computer
Programming, vol. 98, pp. 100-115, 2015.
[35] S. Diehl, Software visualization: visualizing the structure, behaviour, and evolution
of software: Springer Science & Business Media, 2007.
[36] R. Koschke, "Software visualization in software maintenance, reverse
engineering, and re-engineering: a research survey," Journal of Software
Maintenance and Evolution: Research and Practice, vol. 15, pp. 87-109, 2003.
[37] J. A. Jones, A. Orso, and M. J. Harrold, "Gammatella: Visualizing program-
execution data for deployed software," Information Visualization, vol. 3, pp.
173-188, 2004. [38] S. P. Reiss, "Visual representations of executing programs," Journal of Visual
Languages & Computing, vol. 18, pp. 126-148, 2007.
References
203
[39] F. Duseau, B. Dufour, and H. Sahraoui, "Vasco: A visual approach to explore object churn in framework-intensive applications," in Software
Maintenance (ICSM), 2012 28th IEEE International Conference on, 2012, pp. 15-
24. [40] S. Kelley, E. Aftandilian, C. Gramazio, N. Ricci, S. L. Su, and S. Z. Guyer,
"Heapviz: Interactive heap visualization for program understanding and debugging," Information Visualization, vol. 12, pp. 163-177, 2013.
[41] J. H. Cross II, T. D. Hendrix, J. Jain, and L. A. Barowski, "Dynamic object
viewers for data structures," ACM SIGCSE Bulletin, vol. 39, pp. 4-8, 2007.
[42] J. Ali, "Object visualization support for learning data structures," Information
Technology Journal, vol. 10, pp. 485-498, 2011.
[43] B. Johnson and B. Shneiderman, "Tree-maps: A space-filling approach to the
visualization of hierarchical information structures," in Visualization, 1991.
Visualization'91, Proceedings., IEEE Conference on, 1991, pp. 284-291.
[44] B. Shneiderman, "Discovering business intelligence using treemap visualizations," B-EYE-Network-Boulder, CO, USA, 2006.
[45] R. Vliegen, J. J. van Wijk, and E.-J. Van der Linden, "Visualizing business
data with generalized treemaps," Visualization and Computer Graphics, IEEE
Transactions on, vol. 12, pp. 789-796, 2006.
[46] J. Guerra-Gomez, M. L. Pack, C. Plaisant, and B. Shneiderman,
"Visualizing change over time using dynamic hierarchies: TreeVersity2 and the StemView," Visualization and Computer Graphics, IEEE Transactions on, vol.
19, pp. 2566-2575, 2013. [47] A. Fiore and M. Smith, "Treemap visualizations of Newsgroups," Technical
Report, Microsoft Research, Microsoft Corporation: Redmond, WA, 2001.
[48] M. Balzer, O. Deussen, and C. Lewerentz, "Voronoi treemaps for the
visualization of software metrics," in Proceedings of the 2005 ACM symposium on
Software visualization, 2005, pp. 165-172.
[49] A. L. Hugine, S. A. Guerlain, and F. E. Turrentine, "Visualizing surgical
quality data with treemaps," J Surg Res, vol. 191, pp. 74-83, Sep 2014.
[50] J. J. Van Wijk and H. Van de Wetering, "Cushion treemaps: Visualization of
hierarchical information," in Information Visualization, 1999.(Info Vis' 99)
Proceedings. 1999 IEEE Symposium on, 1999, pp. 73-78, 147.
[51] M. Bruls, K. Huizing, and J. J. Van Wijk, Squarified treemaps: Springer, 2000.
[52] R. Blanch and E. Lecolinet, "Browsing zoomable treemaps: Structure-aware
multi-scale navigation techniques," Visualization and Computer Graphics, IEEE
Transactions on, vol. 13, pp. 1248-1253, 2007.
References
204
[53] M. Rios-Berrios, P. Sharma, T. Y. Lee, R. Schwartz, and B. Shneiderman, "TreeCovery: Coordinated dual treemap visualization for exploring the Recovery Act," Government Information Quarterly, vol. 29, pp. 212-222, 2012.
[54] W. Collins, Data structures and the Java collections framework: Wiley Publishing,
2011. [55] D. Kawrykow and M. P. Robillard, "Improving api usage through automatic
detection of redundant code," in Automated Software Engineering, 2009. ASE'09.
24th IEEE/ACM International Conference on, 2009, pp. 111-122.
[56] R. Lämmel, E. Pek, and J. Starek, "Large-scale, AST-based API-usage
analysis of open-source Java projects," in Proceedings of the 2011 ACM
Symposium on Applied Computing, 2011, pp. 1317-1324.
[57] A. Shatnawi, A. Seriai, H. Sahraoui, and Z. Al-Shara, "Mining Software
Components from Object-Oriented APIs," in Software Reuse for Dynamic
Systems in the Cloud and Beyond. vol. 8919, I. Schaefer and I. Stamelos, Eds.,
ed: Springer International Publishing, 2014, pp. 330-347. [58] M. P. Robillard, E. Bodden, D. Kawrykow, M. Mezini, and T. Ratchford,
"Automated API property inference techniques," Software Engineering, IEEE
Transactions on, vol. 39, pp. 613-637, 2013.
[59] A. Zaidman and S. Demeyer, "Automatic identification of key classes in a
software system using webmining techniques," Journal of Software Maintenance
and Evolution: Research and Practice, vol. 20, pp. 387-417, 2008.
[60] C. R. de Souza and D. L. M. Bentolila, "Automatic evaluation of API
usability using complexity metrics and visualizations," in Software Engineering-Companion Volume, 2009. ICSE-Companion 2009. 31st International
Conference on, 2009, pp. 299-302.
[61] Y. M. Mileva, V. Dallmeier, and A. Zeller, "Mining API popularity," in
Testing–Practice and Research Techniques, ed: Springer, 2010, pp. 173-180.
[62] V. Bauer and L. Heinemann, "Understanding API Usage to Support
Informed Decision Making in Software Maintenance," in Software
Maintenance and Reengineering (CSMR), 2012 16th European Conference on, 2012,
pp. 435-440. [63] E. Moritz, M. Linares-Vásquez, D. Poshyvanyk, M. Grechanik, C.
McMillan, and M. Gethers, "Export: Detecting and visualizing api usages in large source code repositories," in Automated Software Engineering (ASE), 2013
IEEE/ACM 28th International Conference on, 2013, pp. 646-651.
[64] M. A. Saied, O. Benomar, H. Abdeen, and H. Sahraoui, "Mining Multi-level
API Usage Patterns," in Software Analysis, Evolution and Reengineering
(SANER), 2015 IEEE 22nd International Conference on, 2015, pp. 23-32.
References
205
[65] J. Yin, C. Ma, and S.-M. Hu, "PAST: accurate instrumentation on fully optimized program," Software: Practice and Experience, pp. n/a-n/a, 2015.
[66] H. Mitasova, R. S. Harmon, K. J. Weaver, N. J. Lyons, and M. F. Overton,
"Scientific visualization of landscapes and landforms," Geomorphology, vol.
137, pp. 122-137, 2012. [67] W. Merzkirch, Flow visualization: Elsevier, 2012.
[68] D. Patel, S. Bruckner, I. Viola, and E. Groller, "Seismic volume visualization
for horizon extraction," in Pacific Visualization Symposium (PacificVis), 2010
IEEE, 2010, pp. 73-80.
[69] A. Kuhn, D. Erni, P. Loretan, and O. Nierstrasz, "Software cartography:
Thematic software visualization with consistent layout," Journal of Software
Maintenance and Evolution: Research and Practice, vol. 22, pp. 191-210, 2010.
[70] R. Marty, Applied security visualization: Addison-Wesley Upper Saddle River,
2009. [71] B. Shneiderman and A. Aris, "Network visualization by semantic substrates,"
Visualization and Computer Graphics, IEEE Transactions on, vol. 12, pp. 733-740,
2006. [72] B. Shneiderman, "The eyes have it: A task by data type taxonomy for
information visualizations," in Visual Languages, 1996. Proceedings., IEEE
Symposium on, 1996, pp. 336-343.
[73] E. H. Chi, "A taxonomy of visualization techniques using the data state reference model," in Information Visualization, 2000. InfoVis 2000. IEEE
Symposium on, 2000, pp. 69-75.
[74] D. A. Keim, "Information visualization and visual data mining," Visualization
and Computer Graphics, IEEE Transactions on, vol. 8, pp. 1-8, 2002.
[75] S. Lange, H. Schumann, W. Müller, and D. Krömker, "Problem-oriented
visualisation of multi-dimensional data sets," in Proceedings of the International
Symposium and Scientific Visualization, 1995, pp. 1-15.
[76] G. G. Grinstein, P. Hoffman, R. M. Pickett, and S. J. Laskowski,
"Benchmark development for the evaluation of visualization for data mining," Information visualization in data mining and knowledge discovery, pp.
129-176, 2002. [77] A. E.-T. Guettala, F. Bouali, C. Guinot, and G. Venturini, "A user assistant
for the selection and parameterization of the visualizations in visual data mining," in Information Visualisation (IV), 2012 16th International Conference on,
2012, pp. 252-257. [78] W. T. Laaser, N. P. Dearden, and T. G. MacNary, "Automatic rules driven
data visualization selection," ed: Google Patents, 2009.
References
206
[79] B. L. Chronister, D. P. Cory, and D. B. Lee, "Ranking visualization types
based upon fitness for visualizing a data set," ed: Google Patents, 2014. [80] H.-J. Schulz, T. Nocke, M. Heitzler, and H. Schumann, "A design space of
visualization tasks," Visualization and Computer Graphics, IEEE Transactions on,
vol. 19, pp. 2366-2375, 2013. [81] Y. Tanahashi and M. Kwan-Liu, "Design Considerations for Optimizing
Storyline Visualizations," Visualization and Computer Graphics, IEEE
Transactions on, vol. 18, pp. 2679-2688, 2012.
[82] D. House, A. Bair, and C. Ware, "On the optimization of visualizations of
complex phenomena," in Visualization, 2005. VIS 05. IEEE, 2005, pp. 87-94.
[83] C. G. Healey, R. S. Amant, and M. S. Elhaddad, "Via: A perceptual
visualization assistant," in 28th AIPR Workshop: 3D Visualization for Data
Exploration and Decision Making, 2000, pp. 2-11.
[84] N. Marrero, "Visualization metrics: An overview," 2007. [85] J. Rigau, M. Feixas, and M. Sbert, "Informational aesthetics measures,"
IEEE Computer Graphics and Applications, pp. 24-34, 2008.
[86] C. E. Shannon, "A mathematical theory of communication," ACM
SIGMOBILE Mobile Computing and Communications Review, vol. 5, pp. 3-55,
2001. [87] M. Li and P. Vitányi, An introduction to Kolmogorov complexity and its
applications: Springer Science & Business Media, 2013.
[88] C. Li and T. Chen, "Aesthetic visual quality assessment of paintings," Selected
Topics in Signal Processing, IEEE Journal of, vol. 3, pp. 236-252, 2009.
[89] V. Matvienko and J. Kruger, "A metric for the evaluation of dense vector
field visualizations," Visualization and Computer Graphics, IEEE Transactions on,
vol. 19, pp. 1122-1132, 2013. [90] D. J. Lehmann, S. Hundt, and H. Theisel, "A Study on Quality Metrics vs.
Human Perception: Can Visual Measures Help us to Filter Visualizations of Interest?," it–Information Technology, vol. 57, p. 1, 2015.
[91] T. O. Aydin, A. Smolic, and M. Gross, "Automated Aesthetic Analysis of
Photographic Images," Visualization and Computer Graphics, IEEE Transactions
on, vol. 21, pp. 31-42, 2015.
[92] A. Vande Moere, M. Tomitsch, C. Wimmer, B. Christoph, and T.
Grechenig, "Evaluating the effect of style in information visualization," Visualization and Computer Graphics, IEEE Transactions on, vol. 18, pp. 2739-
2748, 2012.
References
207
[93] C. Demiralp, M. Bernstein, and J. Heer, "Learning perceptual kernels for visualization design," 2014.
[94] K. Hartmann, T. Götzelmann, K. Ali, and T. Strothotte, "Metrics for
functional and aesthetic label layouts," in Smart Graphics, 2005, pp. 115-126.
[95] A. Dasgupta and R. Kosara, "Pargnostics: Screen-space metrics for parallel
coordinates," Visualization and Computer Graphics, IEEE Transactions on, vol.
16, pp. 1017-1026, 2010. [96] G. Albuquerque, M. Eisemann, and M. Magnor, "Perception-based visual
quality measures," in Visual Analytics Science and Technology (VAST), 2011
IEEE Conference on, 2011, pp. 13-20.
[97] N. Kong, J. Heer, and M. Agrawala, "Perceptual guidelines for creating
rectangular treemaps," Visualization and Computer Graphics, IEEE Transactions
on, vol. 16, pp. 990-998, 2010.
[98] W. Lin and C. C. Jay Kuo, "Perceptual visual quality metrics: A survey,"
Journal of Visual Communication and Image Representation, vol. 22, pp. 297-312,
2011. [99] T. Isenberg, P. Isenberg, J. Chen, M. Sedlmair, and T. Moller, "A systematic
review on the practice of evaluating visualization," Visualization and Computer
Graphics, IEEE Transactions on, vol. 19, pp. 2818-2827, 2013.
[100] L. Harrison, Y. Fumeng, S. Franconeri, and R. Chang, "Ranking
Visualizations of Correlation Using Weber's Law," Visualization and
Computer Graphics, IEEE Transactions on, vol. 20, pp. 1943-1952, 2014.
[101] N. Cawthon and A. V. Moere, "The effect of aesthetic on the usability of
data visualization," in Information Visualization, 2007. IV'07. 11th International
Conference, 2007, pp. 637-648.
[102] J. S. Yi, Y.-a. Kang, J. T. Stasko, and J. A. Jacko, "Understanding and
characterizing insights: how do people gain insights using information visualization?," in Proceedings of the 2008 Workshop on BEyond time and errors:
novel evaLuation methods for Information Visualization, 2008, p. 4.
[103] C. North, "Toward measuring visualization insight," Computer Graphics and
Applications, IEEE, vol. 26, pp. 6-9, 2006.
[104] M. A. Borkin, A. A. Vo, Z. Bylinskii, P. Isola, S. Sunkavalli, A. Oliva, et al.,
"What makes a visualization memorable?," Visualization and Computer
Graphics, IEEE Transactions on, vol. 19, pp. 2306-2315, 2013.
[105] J. Schneidewind, M. Sips, and D. A. Keim, "An automated approach for the
optimization of pixel-based visualizations," Information Visualization, vol. 6,
pp. 75-88, 2007.
References
208
[106] R. Fuchs, J. Waser, and M. E. Groller, "Visual human+ machine learning," Visualization and Computer Graphics, IEEE Transactions on, vol. 15, pp. 1327-
1334, 2009. [107] N. Elmqvist, P. Dragicevic, and J. D. Fekete, "Color Lens: Adaptive Color
Scale Optimization for Visual Exploration," IEEE Trans Vis Comput Graph,
Jun 11 2010. [108] S. Lee, M. Sips, and H.-P. Seidel, "Perceptually driven visibility optimization
for categorical data visualization," Visualization and Computer Graphics, IEEE
Transactions on, vol. 19, pp. 1746-1757, 2013.
[109] S. J. Mason, S. B. Cleveland, P. Llovet, C. Izurieta, and G. C. Poole, "A
centralized tool for managing, archiving, and serving point-in-time data in ecological research laboratories," Environmental Modelling & Software, vol. 51,
pp. 59-69, 2014. [110] W. Xindong, Z. Xingquan, W. Gong-Qing, and D. Wei, "Data mining with
big data," Knowledge and Data Engineering, IEEE Transactions on, vol. 26, pp.
97-107, 2014. [111] H. Li, D. Liang, L. Xie, G. Zhang, and K. Ramamritham, "Flash-Optimized
Temporal Indexing for Time-Series Data Storage on Sensor Platforms," ACM
Transactions on Sensor Networks (TOSN), vol. 10, p. 62, 2014.
[112] S. Bajaj and R. Sion, "TrustedDB: A trusted hardware-based database with
privacy and data confidentiality," Knowledge and Data Engineering, IEEE
Transactions on, vol. 26, pp. 752-765, 2014.
[113] J. M. Banda, M. A. Schuh, R. A. Angryk, K. G. Pillai, and P. McInerney,
"Big data new frontiers: Mining, search and management of massive repositories of solar image data and solar events," in New Trends in Databases
and Information Systems, ed: Springer, 2014, pp. 151-158.
[114] Z. Halim, A. R. Baig, and K. Zafar, "Evolutionary Search in the Space of
Rules for Creation of New Two-Player Board Games," International Journal
on Artificial Intelligence Tools, vol. 23, p. 1350028, 2014.
[115] D. A. Keim, J. Kohlhammer, G. Ellis, and F. Mansmann, Mastering the
information age-solving problems with visual analytics: Florian Mansmann, 2010.
[116] M. X. Zhou and S. K. Feiner, "Data characterization for automatically
visualizing heterogeneous information," in Information Visualization'96,
Proceedings IEEE Symposium on, 1996, pp. 13-20, 117.
[117] D. Keim, "Designing pixel-oriented visualization techniques: Theory and
applications," Visualization and Computer Graphics, IEEE Transactions on, vol.
6, pp. 59-78, 2000.
References
209
[118] Y.-a. Kang and J. Stasko, "Characterizing the intelligence analysis process: Informing visual analytics design through a longitudinal field study," in Visual Analytics Science and Technology (VAST), 2011 IEEE Conference on, 2011,
pp. 21-30. [119] (2015, July 2015). Dataset. Available: http://ming.org.pk/datasets.htm
[120] S. Lindholm, M. Falk, E. Sundén, A. Bock, A. Ynnerman, and T. Ropinski,
"Hybrid data visualization based on depth complexity histogram analysis," in Computer Graphics Forum, 2015, pp. 74-85.
[121] D. W. Scott, Multivariate density estimation: theory, practice, and visualization:
John Wiley & Sons, 2015. [122] M. Zhou, L. O. Hall, D. B. Goldgof, R. J. Gillies, and R. A. Gatenby,
"Decoding brain cancer dynamics: a quantitative histogram-based approach using temporal MRI," in SPIE Medical Imaging, 2015, pp. 94142H-94142H-5.
[123] M. Kotera, Y. Moriya, T. Tokimatsu, M. Kanehisa, and S. Goto, "KEGG
and GenomeNet, new developments, metagenomic analysis," in Encyclopedia
of Metagenomics, ed: Springer, 2015, pp. 329-339.
[124] W. R. Stauffer, A. Lak, P. Bossaerts, and W. Schultz, "Economic choices
reveal probability distortion in macaque monkeys," The Journal of
Neuroscience, vol. 35, pp. 3146-3154, 2015.
[125] O. Špakov and D. Miniotas, "Visualization of eye gaze data using heat
maps," Elektronika ir Elektrotechnika, vol. 74, pp. 55-58, 2015.
[126] J. A. Guerra-Gómez, M. L. Pack, C. Plaisant, and B. Shneiderman,
"Discovering temporal changes in hierarchical transportation data: Visual analytics & text reporting tools," Transportation Research Part C: Emerging
Technologies, vol. 51, pp. 167-179, 2015.
[127] J. Heinrich, J. Stasko, and D. Weiskopf, "The parallel coordinates matrix,"
EuroVis–Short Papers, pp. 37-41, 2012.
[128] A. Inselberg, Parallel coordinates: Springer, 2009.
[129] C. Shenghui, J. Zhifang, Q. Qi, S. Li, and X. Meng, "The Polar Parallel
Coordinates Method for Time-Series Data Visualization," in Computational
and Information Sciences (ICCIS), 2012 Fourth International Conference on, 2012,
pp. 179-182. [130] X. Yuan, P. Guo, H. Xiao, H. Zhou, and H. Qu, "Scattering points in
parallel coordinates," Visualization and Computer Graphics, IEEE Transactions
on, vol. 15, pp. 1001-1008, 2009.
[131] M. Kanazaki, T. Matsuno, K. Maeda, and H. Kawazoe, "Wind Tunnel
Evaluation Based Design of Lift Creating Cylinder Using Plasma Actuators,"
References
210
in Proceedings of the 18th Asia Pacific Symposium on Intelligent and Evolutionary
Systems, Volume 1, 2015, pp. 663-677.
[132] J. A. Schwabish, "An Economist's Guide to Visualizing Data," The Journal of
Economic Perspectives, vol. 28, pp. 209-233, 2014.
[133] (2015, June 2015). Line Chart online. Available:
https://developers.google.com/chart/interactive/docs/gallery/linechart [134] C. L. Paul, "Analyzing Card-Sorting Data Using Graph Visualization,"
Journal of Usability Studies, vol. 9, pp. 87-104, 2014.
[135] M. Friendly, "A brief history of data visualization," in Handbook of data
visualization, ed: Springer, 2008, pp. 15-56.
[136] S. Jamil, A. Khan, and Z. Halim, "Weighted MUSE for frequent sub-graph
pattern finding in uncertain DBLP data," in Internet Technology and
Applications (iTAP), 2011 International Conference on, 2011, pp. 1-6.
[137] Z. Halim, M. M. Gul, N. Ul Hassan, R. Baig, S. U. Rehman, and F. Naz,
"Malicious users' circle detection in social network based on spatio-temporal co-occurrence," in Computer Networks and Information Technology (ICCNIT),
2011 International Conference on, 2011, pp. 35-39.
[138] I. Herman, G. Melançon, and M. S. Marshall, "Graph visualization and
navigation in information visualization: A survey," Visualization and Computer
Graphics, IEEE Transactions on, vol. 6, pp. 24-43, 2000.
[139] (2015, 13 June 2015). HCIL Treemap. Available:
http://www.cs.umd.edu/hcil/ [140] F. B. Viegas, M. Wattenberg, F. Van Ham, J. Kriss, and M. McKeon,
"Manyeyes: a site for visualization at internet scale," Visualization and
Computer Graphics, IEEE Transactions on, vol. 13, pp. 1121-1128, 2007.
[141] A. Shukla, R. Tiwari, and R. Kala, Real life applications of soft computing: CRC
Press, 2010. [142] M. Sivanandam, Introduction to artificial neural networks: vikas publishing
House PVT LTD, 2009. [143] D. Floreano and C. Mattiussi, Bio-inspired artificial intelligence: theories,
methods, and technologies: MIT press, 2008.
[144] H. Demuth, M. Beale, and M. Hagan, "Neural network toolbox TM 6 user’s
guide matlab," The MathWorks2009, 2013.
[145] F. Gao and G. Ge, "Optimal ternary constant-composition codes of weight
four and distance five," Information Theory, IEEE Transactions on, vol. 57, pp.
3742-3757, 2011.
References
211
[146] L. Xiong, L. Jiao, S. Mao, and L. Zhang, "Active learning based on coupled KNN pseudo pruning," Neural Computing and Applications, vol. 21, pp. 1669-
1686, 2012. [147] I. H. Witten and E. Frank, Data Mining: Practical machine learning tools and
techniques: Morgan Kaufmann, 2005.
[148] L. Breiman, "Random forests," Machine learning, vol. 45, pp. 5-32, 2001.
[149] K.-B. Duan and S. S. Keerthi, "Which is the best multiclass SVM method?
An empirical study," in Multiple Classifier Systems, ed: Springer, 2005, pp. 278-
285. [150] M. Galar, A. Fernández, E. Barrenechea, H. Bustince, and F. Herrera, "An
overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes," Pattern
Recognition, vol. 44, pp. 1761-1776, 2011.
[151] A. Bahrammirzaee, "A comparative survey of artificial intelligence
applications in finance: artificial neural networks, expert system and hybrid intelligent systems," Neural Computing and Applications, vol. 19, pp. 1165-1195,
2010. [152] S.-H. Huang and Y.-C. Pan, "Automated visual inspection in the
semiconductor industry: A survey," Computers in Industry, vol. 66, pp. 1-10,
2015. [153] D. Thom and T. Ertl, "TreeQueST: A Treemap-Based Query Sandbox for
Microdocument Retrieval," in System Sciences (HICSS), 2015 48th Hawaii
International Conference on, 2015, pp. 1714-1723.
[154] M. L. Huang, T.-H. Huang, and X. Zhang, "A novel virtual node approach for interactive visual analytics of big datasets in parallel coordinates," Future
Generation Computer Systems, 2015.
[155] W. Huang, M. L. Huang, and C.-C. Lin, "Evaluating Overall Quality of Graph Visualizations Based on Aesthetics Aggregation," Information Sciences,
2015. [156] C. Plaisant, "The challenge of information visualization evaluation," in
Proceedings of the working conference on Advanced visual interfaces, 2004, pp. 109-
116. [157] E. R. Tufte and E. Weise Moeller, Visual explanations: images and quantities,
evidence and narrative vol. 36: Graphics Press Cheshire, CT, 1997.
[158] J. Mackinlay, "Automating the design of graphical presentations of relational
information," Acm Transactions On Graphics (Tog), vol. 5, pp. 110-141, 1986.
[159] W. Huang, P. Eades, S.-H. Hong, and C.-C. Lin, "Improving multiple
aesthetics produces better graph drawings," Journal of Visual Languages &
Computing, vol. 24, pp. 262-272, 2013.
References
212
[160] C. Bennett, J. Ryall, L. Spalteholz, and A. Gooch, "The Aesthetics of Graph
Visualization," in Computational Aesthetics, 2007, pp. 57-64.
[161] S. Tak and A. Cockburn, "Enhanced Spatial Stability with Hilbert and
Moore Treemaps," IEEE Trans Vis Comput Graph, Apr 10 2012.
[162] Y. Tu and H.-W. Shen, "Visualizing changes of hierarchical data using
treemaps," Visualization and Computer Graphics, IEEE Transactions on, vol. 13,
pp. 1286-1293, 2007. [163] A. Tatu, G. Albuquerque, M. Eisemann, J. Schneidewind, H. Theisel, M.
Magnor, et al., "Combining automated analysis and visualization techniques
for effective exploration of high-dimensional data," in Visual Analytics Science
and Technology, 2009. VAST 2009. IEEE Symposium on, 2009, pp. 59-66.
[164] M. Borkin, A. Vo, Z. Bylinskii, P. Isola, S. Sunkavalli, A. Oliva, et al., "What
makes a visualization memorable?," Visualization and Computer Graphics, IEEE
Transactions on, vol. 19, pp. 2306-2315, 2013.
[165] D. Ren, T. Hollerer, and X. Yuan, "iVisDesigner: Expressive Interactive
Design of Information Visualizations," IEEE Transactions on Visualization and
Computer Graphics, vol. 20, pp. 2092-2101, 2014.
[166] A. Lau and A. Vande Moere, "Towards a model of information aesthetics in
information visualization," in Information Visualization, 2007. IV'07. 11th
International Conference, 2007, pp. 87-92.
[167] D. Kawrykow and M. P. Robillard, "Detecting inefficient API usage," in
Software Engineering-Companion Volume, 2009. ICSE-Companion 2009. 31st
International Conference on, 2009, pp. 183-186.
[168] G. d. F. Carneiro, R. C. Magnavita, E. Spinola, F. Spinola, and M.
Mendonça, "Evaluating the usefulness of software visualization in supporting software comprehension activities," in Proceedings of the Second ACM-IEEE
international symposium on Empirical software engineering and measurement, 2008,
pp. 276-278. [169] W. De Pauw, D. Kimelman, and J. Vlissides, "Modeling object-oriented
program execution," in Object-Oriented Programming, ed: Springer, 1994, pp.
163-182. [170] (2014, 13 May 2014). Available: http://commons.apache.org/bcel/ [171] E. Bruneton, R. Lenglet, and T. Coupaye, "ASM: a code manipulation tool
to implement adaptable systems," Adaptable and extensible component systems,
vol. 30, 2002.
References
213
[172] J. Heer, S. K. Card, and J. A. Landay, "Prefuse: a toolkit for interactive information visualization," in Proceedings of the SIGCHI conference on Human
factors in computing systems, 2005, pp. 421-430.
[173] (20 January 201). JEdit. Available: http://www.jedit.org/
[174] V. Setlur and M. C. Stone, "A Linguistic Approach to Categorical Color Assignment for Data Visualization," IEEE Transactions on Visualization and Computer Graphics, vol. 22, no. 1, pp. 698-707, 2016.
[175] North, Chris, Purvi Saraiya, and Karen Duca. "A comparison of benchmark
task and insight evaluation methods for information visualization." Information Visualization (2011): 1473871611415989.
[176] Saraiya, Purvi, Chris North, and Karen Duca. "An insight-based methodology
for evaluating bioinformatics visualizations." IEEE transactions on visualization and computer graphics vol 11, pp. 443-456, 2005
Appendix A
214
Automatic Visualization Selection
Table A.1 Single hidden layer NN structure time and MSE (split)
Network
Structure Iteration Time Train-MSE Test-MSE Validation-MSE
05-01-8 46 01 0.0818 0.0852 0.0799
05-02-8 45 02 0.0794 0.0817 0.0789
05-03-8 38 02 0.0462 0.0456 0.0526
05-04-8 34 01 0.0301 0.0492 0.0345
05-05-8 30 01 0.022 0.023 0.0225
05-06-8 25 02 0.0145 0.0182 0.02
05-07-8 21 01 0.0195 0.0308 0.0266
05-08-8 16 01 0.0098 0.0156 0.0086
05-09-8 14 01 0.01 0.0359 0.0314
05-10-8 24 01 0.0118 0.0252 0.0218
05-11-8 11 01 0.0106 0.013 0.0232
05-12-8 10 01 0.0103 0.0253 0.0138
05-13-8 18 01 0.0103 0.0108 0.0148
05-14-8 18 01 0.0097 0.0145 0.0077
05-15-8 13 01 0.01 0.0231 0.0153
05-16-8 12 01 0.0101 0.0154 0.0136
05-17-8 13 01 0.0101 0.03 0.0293
05-18-8 24 02 0.0125 0.0196 0.0202
05-19-8 14 01 0.0145 0.0156 0.0263
05-20-8 18 01 0.0139 0.0121 0.0218
05-21-8 10 02 0.0713 0.0698 0.0567
05-22-8 12 01 0.0108 0.0391 0.0179
05-23-8 06 01 0.0098 0.0221 0.0245
05-24-8 09 01 0.0158 0.0207 0.0299
05-25-8 16 02 0.0115 0.0161 0.0176
Appendix A
Appendix A
215
Table A.2 Two-Hidden layers NN structure time and MSE
Network
Structure Iteration Time Train-MSE Test-MSE Validation-MSE
05-05-02-8 23 01 0.0589 0.0547 0.0711
05-05-05-8 29 01 0.02 0.0259 0.0317
05-07-05-8 31 02 0.0135 0.0305 0.0291
05-08-05-8 46 03 0.0189 0.0283 0.0235
05-08-07-8 14 01 0.0092 0.0115 0.0142
05-12-11-8 16 02 0.0094 0.0128 0.0066
05-15-10-8 10 02 0.0108 0.032 0.0112
05-20-14-8 09 03 0.0097 0.0202 0.0096
05-24-16-8 10 02 0.009 0.0109 0.0112
05-30-20-8 11 03 0.0089 0.0172 0.0096
Appendix A
216
Table A.3 Network with Training and Test data
Network Structure Iteration Time Train-MSE Test-MSE
05-01-8 200 08 0.0791 0.0831
05-02-8 200 08 0.0566 0.0741
05-03-8 200 09 0.0553 0.0589
05-04-8 200 09 0.0464 0.0559
05-05-8 200 10 0.0179 0.0270
05-06-8 200 11 0.0138 0.0113
05-07-8 200 12 0.0215 0.0192
05-08-8 200 13 0.0167 0.0261
05-09-8 200 13 0.0108 0.0480
05-10-8 200 15 0.0101 0.0115
05-11-8 200 15 0.0050 0.0160
05-12-8 200 16 0.0055 0.0056
05-13-8 200 17 0.0026 4.8874
05-14-8 200 18 0.0040 0.0291
05-15-8 200 19 0.0035 0.1509
05-16-8 200 21 0.0045 0.2878
05-17-8 200 22 0.0044 0.7050
05-18-8 200 23 0.0029 6.5193
05-19-8 200 25 0.0025 0.0208
05-20-8 200 26 0.0039 7.1763
05-21-8 200 28 0.0037 0.7860
05-22-8 200 29 0.0054 0.0328
05-23-8 200 31 0.0024 2.8556
05-24-8 200 33 0.0033 0.0053
05-25-8 200 34 0.0023 9.0936
Appendix A
217
Table A.4 NN with validation check- Early stop
Network
Structure Iteration Time Train-MSE Test-MSE Validation-MSE
05-01-8 27 01 0.0811 0.0766 0.0837
05-02-8 26 01 0.0595 0.0609 0.0646
05-03-8 48 02 0.038 0.0622 0.0347
05-04-8 21 01 0.0314 0.0389 0.0316
05-05-8 18 01 0.0383 0.0423 0.0455
05-06-8 22 01 0.0507 0.0571 0.0508
05-07-8 35 02 0.017 0.0166 0.0205
05-08-8 22 01 0.0121 0.0182 0.0151
05-09-8 29 02 0.0132 0.0229 0.0324
05-10-8 23 01 0.0145 0.0191 0.0275
05-11-8 110 07 0.0057 0.0071 0.0067
05-12-8 27 02 0.0152 0.027 0.0246
05-13-8 22 02 0.0134 0.0234 0.0136
05-14-8 44 04 0.0066 0.0119 0.0125
05-15-8 108 10 0.0062 0.01 0.0108
05-16-8 58 05 0.007 0.0291 0.0068
05-17-8 29 03 0.0057 0.016 0.0147
05-18-8 58 06 0.0049 0.0166 0.0207
05-19-8 19 02 0.0095 0.0095 0.0215
05-20-8 28 03 0.0076 0.0155 0.0102
05-21-8 36 04 0.007 0.012 0.011
05-22-8 18 02 0.0095 0.0166 0.0105
05-23-8 19 02 0.0079 0.05 0.014
05-24-8 15 02 0.0131 0.0251 0.0224
05-25-8 22 03 0.0096 0.0236 0.0134
Appendix A
218
Table A.5 NN with stop with goal / Validation stop
Network
Structure Iteration Time Train-MSE Test-MSE Validation-MSE
05-01-8 46 01 0.0818 0.0852 0.0799
05-02-8 45 02 0.0794 0.0817 0.0789
05-03-8 38 02 0.0462 0.0456 0.0526
05-04-8 34 01 0.0301 0.0492 0.0345
05-05-8 30 01 0.022 0.023 0.0225
05-06-8 25 02 0.0145 0.0182 0.02
05-07-8 21 01 0.0195 0.0308 0.0266
05-08-8 16 01 0.0098 0.0156 0.0086
05-09-8 14 01 0.01 0.0359 0.0314
05-10-8 24 01 0.0118 0.0252 0.0218
05-11-8 11 01 0.0106 0.013 0.0232
05-12-8 10 01 0.0103 0.0253 0.0138
05-13-8 18 01 0.0103 0.0108 0.0148
05-14-8 18 01 0.0097 0.0145 0.0077
05-15-8 13 01 0.01 0.0231 0.0153
05-16-8 12 01 0.0101 0.0154 0.0136
05-17-8 13 01 0.0101 0.03 0.0293
05-18-8 24 02 0.0125 0.0196 0.0202
05-19-8 14 01 0.0145 0.0156 0.0263
05-20-8 18 01 0.0139 0.0121 0.0218
05-21-8 10 02 0.0713 0.0698 0.0567
05-22-8 12 01 0.0108 0.0391 0.0179
05-23-8 06 01 0.0098 0.0221 0.0245
05-24-8 09 01 0.0158 0.0207 0.0299
05-25-8 16 02 0.0115 0.0161 0.0176
Appendix A
219
Table A.6 Single hidden layer NN structure and MSEs (10Folds-CV)
Network Structure Train-MSE Test-MSE Validation-MSE
05-01-8 0.0812 0.0829 0.0817
05-02-8 0.0641 0.0701 0.0670
05-03-8 0.0544 0.0566 0.0600
05-04-8 0.0424 0.0490 0.0434
05-05-8 0.0347 0.0404 0.0403
05-06-8 0.0299 0.0367 0.0374
05-07-8 0.0200 0.0268 0.0261
05-08-8 0.0174 0.024 0.0202
05-09-8 0.014 0.025 0.0236
05-10-8 0.0144 0.0205 0.0231
05-11-8 0.0121 0.0206 0.0192
05-12-8 0.0108 0.0243 0.0212
05-13-8 0.0114 0.0197 0.0181
05-14-8 0.0101 0.0254 0.0213
05-15-8 0.0124 0.0240 0.0226
05-16-8 0.0090 0.0201 0.0257
05-17-8 0.0088 0.0196 0.0192
05-18-8 0.0119 0.0244 0.0200
05-19-8 0.0099 0.0261 0.0215
05-20-8 0.0089 0.0212 0.0190
05-21-8 0.0091 0.0226 0.0197
05-22-8 0.0090 0.0194 0.0163
05-23-8 0.0072 0.0213 0.0158
05-24-8 0.0098 0.0228 0.0210
05-25-8 0.0067 0.0275 0.0191
Appendix A
220
Table A.7 Comparison of classification and prediction accuracy (10Folds-CV)
Network
Structure Training Testing
05-01-8 47.77 52.50
05-02-8 60.00 72.50
05-03-8 75.27 72.50
05-04-8 79.44 82.50
05-05-8 84.16 82.50
05-06-8 89.72 87.50
05-07-8 92.22 90.00
05-08-8 93.88 95.00
05-09-8 94.16 92.50
05-10-8 95.27 95.00
05-11-8 96.11 95.00
05-12-8 96.38 97.50
05-13-8 96.11 97.50
05-14-8 95.00 95.00
05-15-8 95.00 95.00
05-16-8 96.94 97.50
05-17-8 96.38 95.00
05-18-8 96.66 97.50
05-19-8 96.83 95.00
05-20-8 98.33 95.00
05-21-8 96.38 95.00
05-22-8 98.61 95.00
05-23-8 96.66 95.00
05-24-8 96.38 95.55
05-25-8 96.38 95.00
Appendix A
221
Table A.8 Information visualization results for iris dataset
Histogram
Pie chart
Line chart
Parallel Coordinates
Scatter Plot
Linked graph
Appendix A
222
Treemap
Appendix B
223
Treemap Visualziation
JEdit
JHotdraw
Prefuse
jImage
Appendix B
Appendix B
224
Browser
JMoney
Eclipse
Fire
Appendix B
225
freemind
M3D
Appendix C
226
Visualziation Techniques
Jinsight Gammatella
LBM JIVE
Vasco Heapviz
Appendix C
Appendix C
227
Treemap TreeCovery
Object-Visualization Export
Appendix D
228
User Forms, Handouts
D-1 Subject’s Evaluation Questioner Form
Instructions:
Kindly fill this questionnaire form carefully with correct information
All the collected information will solely be used for the purpose of this research and shall remain
anonymous
Participation in this study is on voluntary basis
1. Name:
2. Age(in years):
3. Field of study:
4. Qualification:
5. Experience with computer (in years):
6. Experience with Java (in years):
7. How you rate your knowledge about Java:
8. Have you experience with visualization tools:
Thank You
Appendix D
General Information
Experience information
Basic
Intermediat
e
Advance
Yes
No
Appendix D
229
Treemap visualization and collection APIs usage-Handout
This experiment evaluates the collection APIs usage information using the treemap
visualization tool. The participants of the study are divided in two groups, the first group
will be provided with the proposed treemap visualization tool. The second group is only
provided a log file. They may use Excel to find the required information. Subjects of both
groups need to perform different tasks based on the material provided to them. The subjects
need to mention starting and end time for each task. The subjects should provide a usability
score for the task to be carried out on their tool. The score is based on a scale of 0-5, where 0
means less usable and 5 is for most usable.
Tasks detail
Task 1: Identify the collection APIs used in a program
The basic aim of this task is to find the collection APIs used in a
particular Java program. The log file of specific Java program will be
provided and the subjects need to find and list the name of collection
APIs apparent in the log file.
Task 2: Identify the packages and classes of collection APIs in a program
The subject is provided with a visualization of the particular Java
program and they need to identify the packages and their respective
classes. List the name of packages and classes of target program
responsible for the creation of collection APIs objects.
Task 3: List 3 classes responsible for creation of most objects of a particular API
Instructions: Instructions:
Appendix D
230
The task is to find the three classes of particular Java programs that
create a large number of collection APIs objects. Overview the
information presented and list three classes responsible for the large
number of collection APIs objects creation.
Task 4: Identify methods that create the maximum number of collection APIs objects
During this task the subjects are required to identify and list the
name of methods that are responsible for collection APIs objects.
List the method names in descending order of number of objects
created.
Usability score Scale
On the scale of 0-5 the following coding is used to report usability according to your
participant's opinion for each task.
0--- Absolutely not usable
1--- Little usable
2--- Usable
3--- Fairly usable
4--- Very usable
5--- most usable
Appendix D
231
This experiment is used to evaluate the visualizations that are optimized using the proposed
approach. Each participant of the study is provided with six types of visualizations one at a time
and he needs to performance five benchmark tasks. These tasks are to be carried out for all
visualizations. Each participant will provide the score for the parameters metal efforts,
visualization, and time taken. Accuracy of the task will be checked in post experiment analysis.
For each task the participant will be provided hard copy of multiple choice questions.
Visualization optimization experiment -Handout
Tasks detail
Task 1: Which Java program has maximum number of objects?
A single visualization will be presented to each participant showing
the collection APIs usage in several programs. Overviewing the
visualization the participants need to find the Java program that has
the maximum number of objects in the visualization.
Task 2: Which collection API across the programs is mostly used?
For each presented visualization find the APIs that have been used
by most of the Java program. In visualization several Java programs
with their collection APIs usage is presented only pick the API
names that are used by most programs.
Instructions:
Appendix D
232
Task 3: How many Java program used more than 3 collection APIs?
As the visualization presented several Java programs in a view, the
participant need to find the program that used 3 or more collection
APIs. List the name of those programs from the visualization.
Task 4: Which Java program used a large number of ArrayList objects?
This task is used to evaluate participant’s response to find and
identify the program that used a comparatively large number of
ArrayList APIs objects. The ArrayList may be shown in several
programs, only identify the program that has a large number of
objects for this particular API.
Task 5: How many different APIs are used by Java programs?
Each of the presented visualization is showing different Java
program and their collection APIs used during program execution.
The participants need to find how many different types of APIs are
used collectively by all programs shown in the visualization.
Mental effort score scale
On the scale of 1-5, the mental efforts are represented by the following codes.
1--- Minimal efforts
2--- Little efforts
3--- Fair efforts
4--- Substantial efforts
5--- Maximum efforts
Appendix D
233
Visualization- score scale
On the scale of 1-5, visualizations have score according to the following scale.
1--- Poor
2--- Fair
3--- Good
4--- Excellent
5--- Outstanding
Accuracy score scale
On the scale of 1-5, each participant response has following scale.
1--- Absolutely wrong
2--- Partially wrong
3--- Fair
4--- Partially correct
5--- Absolute correct
Appendix D
234
D-2 Subject’s Evaluation Questioner Form Filled
9. Name:
10. Age(in years):
11. Field of study:
12. Qualification:
13. Experience with computer (in years):
14. Experience with Java (in years):
15. How you rate your knowledge about Java:
16. Have you experience with visualization tools:
Ahmed Ali Khan
23
Computer Science
05
General Information
Experience information
02
Basic
Intermediat
e
Advance
Yes
No
BS
Appendix D
235
Treemap visualization and collection APIs usage
Group: Participant#:
Task Title
Please answer the following questions (Reference to manual/Handouts for help)
1. Provide your finding/answer for the task in this box
2. Provide Usability score of your tool for this particular task using the following scale
(tick one option )
Absolutely not usable Little usable Usable Fairly usable Very usable Most usable
3. Other comments
Control
Experimental
Task#
Start time End time
Date
Appendix D
236
Visualization Optimization
Participant#: Visualization type:
Task Title
Please answer the following questions (Reference to manual/Handouts for help)
1. Provide your finding/answer for the task in this box
2. How much Mental effort is required for this task using this visualization, use the
following scale (tick one option )
4. Provide Score of your tool for this particular task using the following scale (tick one
option )
3. Other comments
Minimal efforts Little efforts Fair efforts Substantial efforts Maximum efforts
Poor Fair Good Excellent Outstanding
Task#
Start time End time
Date
Appendix E
237
Evolved Visualziation
Combine
Effectiveness
Appendix E
Appendix E
238
Expressiveness
Readability
Appendix E
239
Interactivity
Random
Appendix E
240
SoTA
Appendix E
241
Some facts
Completed research June 2014
Completed writing thesis December 2014
Internal evaluation completed July 2015
First IF paper accepted October 2015
Second IF paper accepted August 2016
Third IF paper accepted December 2016
Foreign evaluation completed June/July 2016
Examination committee approved November 2016
Number of words 49054
Number of figures 85
Number of tables 50
Number of pages 260
Thesis open defense 30 November 2016