Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and...

94
Thesis for the Degree of Licentiate of Philosophy Visualization of Causal Relations Niklas Elmqvist Department of Computing Science Chalmers University of Technology and G¨ oteborg University 412 96 G¨ oteborg, Sweden oteborg, October 2004

Transcript of Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and...

Page 1: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

Thesis for the Degree of Licentiate of Philosophy

Visualization of Causal Relations

Niklas Elmqvist

Department of Computing ScienceChalmers University of Technology

and Goteborg University412 96 Goteborg, Sweden

Goteborg, October 2004

Page 2: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

Visualization of Causal RelationsNiklas Elmqvist

c© 2004 Niklas Elmqvist.

Technical Report no. 38LISSN 1651-4963School of Computer Science and Engineering

Department of Computing ScienceChalmers University of Technology and Goteborg University412 96 Goteborg, SwedenTelephone +46 (0)31-772 1000

Goteborg, Sweden, 2004

Page 3: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

Abstract

The notion of cause and effect is pervasive in human thinking and plays a sig-nificant role in our perception of time. Software systems, in particular paralleland distributed ones, are permeated by this causality, and the human mind isespecially well-suited to detect instances of this concept. Unfortunately, real-world systems of causally related events are often too large and complex to becomprehended unaided. In this thesis, we explore ways of using information vi-sualization to help humans perceive these complex systems of causal relations,not only for software systems, but also for more general application areas.

The Growing Squares visualization technique uses a combination of color,texture, and animation to present a sequence of related events in a distributedsystem. User studies show that this technique is significantly more effective forsolving problems related to causality in distributed systems than traditionalHasse diagrams for small data sets, and more effective (though not significantlyso) for large data sets.

The Growing Polygons visualization technique was designed to address someof the weaknesses of the Growing Squares technique, and presents the interact-ing processes in a system as color-coded polygons with sectors indicating theinfluences and information propagation in the system. User studies show thatthis technique is significantly more effective than Hasse diagrams for all datasets, regardless of size.

Finally, we have conducted a case study of causality visualization in thecontext of scientific citation networks, creating a bibliographic visualization toolcalled CiteWiz. The tool contains a modified Growing Polygons visualization,suitably adapted to citation networks with linear time windows and processhierarchies, as well as a new static timeline visualization that maps the citationcount and publication date of an article or author to its size and position onthe timeline, respectively.

Keywords: causality visualization, causal relations, citation network visual-ization, information visualization.

Page 4: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

iv

Page 5: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

Preface

To see a world in a grain of sand,And a heaven in a wild flower,

Hold infinity in the palm of your hand,And eternity in an hour.

– William Blake

This thesis came about in what could almost be seen as an accident. It was backin 2001, I was a new Ph.D student at the department and had just finalizedmy choice of research area, and was casting about for a suitable problem toattack as a first project. Naturally, lacking all perspective and possessing acertain form of foolhardiness, I was eager to embark on the most ambitious ofprojects and to attack the most difficult of problems in my area. My supervisor,Philippas Tsigas, wisely redirected and channeled my enthusiasm to the problemof effectively visualizing causal relations in distributed systems. Both me andhim thought that this would be a suitable first project; limited in size, yet withmany interesting and potentially significant issues to address.

That little research project has now grown far beyond what I originallyenvisioned it to become; in fact, it is now large enough to base my licentiatethesis on. Our original ideas, while useful and certainly an improvement overtraditional techniques, have now been superceded and improved upon by newerideas, and we continue to see new extensions to the visualization techniques wedevelop. In fact, the very application area we initially targeted with our firstexplorations has now been widened and generalized, and we are surprised tocontinually discover new uses for our techniques that we did not at all anticipatewhen we developed them. And, to my own humble satisfaction, the informationvisualization community has expressed an interest in our work. Most impor-tantly, I believe that through the course of this research project, I have gainedvaluable insights into the art and craft of science.

Acknowledgements

Despite the fact that you can only see my name on the cover of this thesis,this work would in truth not have been possible without the friendship andsupport of a large number of people. First and foremost, I want to extend mydeepest gratitude to my supervisor Philippas Tsigas, for introducing me to thenoble path of science in the first place, and for helping me through my firststumbling steps as a computer scientist trainee. Philippas has the rare gift of

v

Page 6: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

vi

being able to teach when not appearing to be teaching at all, and to guide whennot appearing to be guiding at all. Thank you.

I would also like to thank the members of my Ph.D committee—HenrikAhlberg, Ralph Schroeder, and Bjorn von Sydow. Your comments and encour-agement have been invaluable for me as a way to get objective feedback when Iwas far too mixed up in my work to get any perspective at all on it. I would liketo especially thank Henrik for giving me the chance to start my work as an un-dergraduate student at Chalmers Medialab all those years ago, most certainlyone of the first steps I took on the journey I have now embarked upon.

When I started my Ph.D studies at the department, there were no researchgroups specialized on information visualization, and instead the DistributedComputing and Systems group took me under its wings. I am proud to be amember of such a friendly as well as successful research group, and I want tothank its past and present members—Anders Gidenstam, Boris Koldehofe, Ma-rina Papatriantafilou, Phuong Ha, Hakan Sundell, and Yi Zhang—for puttingup with my talks on visualization and computer graphics when their researchinterests lie in totally different directions. Overall, I am grateful to the wholeDepartment of Computing Science at Chalmers for providing a challenging andfertile environment for my Ph.D studies.

Special thanks to Professor Stephan Diehl of Catholic University Eichstattfor accepting to be the discussion leader for my licentiate seminar.

There are also a few very special acknowledgements I would like to express:to my parents, Anita and Lars-Gunnar, and my brother Jonas, for always pro-viding that haven of peace and quiet I could retreat to when I needed seclusionand relaxation; to my best friend, Robert Karlsson, who, despite us no longerliving less than 100 meters from each other, continually injects much-neededfriendly banter into my life; and to Johanna, for loving me and for alwaysmaking me smile even in my darkest hour. Thank you all. You enrich my life.

Niklas ElmqvistGoteborg, October 2004.

Page 7: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

Contents

1 Introduction 11.1 Information Visualization . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 Visualization Model . . . . . . . . . . . . . . . . . . . . . 51.1.2 Visualization Techniques . . . . . . . . . . . . . . . . . . . 81.1.3 Research Outlooks . . . . . . . . . . . . . . . . . . . . . . 8

1.2 Causality Visualization . . . . . . . . . . . . . . . . . . . . . . . . 91.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2.2 Analysis Tasks . . . . . . . . . . . . . . . . . . . . . . . . 101.2.3 Existing Work . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3 Citation Network Visualization . . . . . . . . . . . . . . . . . . . 131.3.1 Formative Evaluation . . . . . . . . . . . . . . . . . . . . 131.3.2 Taxonomy of Citation Database Interaction . . . . . . . . 141.3.3 Existing Work . . . . . . . . . . . . . . . . . . . . . . . . 14

1.4 The CausalViz Framework . . . . . . . . . . . . . . . . . . . . . . 161.4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . 161.4.2 Poset Management . . . . . . . . . . . . . . . . . . . . . . 171.4.3 CiteWiz Extensions . . . . . . . . . . . . . . . . . . . . . 18

1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Growing Squares 212.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.2 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.1 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.2.2 Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . 252.2.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.3.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . 282.3.2 Subjective Ratings . . . . . . . . . . . . . . . . . . . . . . 29

2.4 Caveats of Growing Squares . . . . . . . . . . . . . . . . . . . . . 30

3 Growing Polygons 333.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.1 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2.2 Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

vii

Page 8: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

viii CONTENTS

3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.3.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . 383.3.2 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.3 Subjective Ratings . . . . . . . . . . . . . . . . . . . . . . 40

3.4 Caveats of Growing Polygons . . . . . . . . . . . . . . . . . . . . 41

4 CiteWiz: Citation Network Visualization 454.1 Citations as Causal Relations . . . . . . . . . . . . . . . . . . . . 464.2 The CiteWiz Platform . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.1 Datasets and Views . . . . . . . . . . . . . . . . . . . . . 474.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3 Influence Visualization . . . . . . . . . . . . . . . . . . . . . . . . 484.3.1 Linear Time Windows . . . . . . . . . . . . . . . . . . . . 494.3.2 Hierarchical Views . . . . . . . . . . . . . . . . . . . . . . 504.3.3 Interaction Techniques . . . . . . . . . . . . . . . . . . . . 514.3.4 Parent-Child Visualization . . . . . . . . . . . . . . . . . . 514.3.5 Color Assignment . . . . . . . . . . . . . . . . . . . . . . 514.3.6 Details-On-Demand . . . . . . . . . . . . . . . . . . . . . 52

4.4 Static Timeline Visualization . . . . . . . . . . . . . . . . . . . . 534.5 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.5.1 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.5.2 Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . 564.5.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.6.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . 584.6.2 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . 584.6.3 Subjective Ratings . . . . . . . . . . . . . . . . . . . . . . 59

4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.9 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Conclusions 63

6 Future Work 67

A Growing Squares User Study 75A.1 Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75A.2 Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

A.2.1 Duration Comparison . . . . . . . . . . . . . . . . . . . . 78A.2.2 Influence Importance . . . . . . . . . . . . . . . . . . . . . 78A.2.3 Influence Assessment . . . . . . . . . . . . . . . . . . . . . 78A.2.4 Inter-Node Causal Relations . . . . . . . . . . . . . . . . . 78

A.3 Post-Task Questionnaire . . . . . . . . . . . . . . . . . . . . . . . 78A.4 Post-Test Questionnaire . . . . . . . . . . . . . . . . . . . . . . . 79

Page 9: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

CONTENTS ix

B Growing Polygons User Study 81B.1 Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81B.2 Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

B.2.1 Duration Comparison . . . . . . . . . . . . . . . . . . . . 84B.2.2 Influence Importance . . . . . . . . . . . . . . . . . . . . . 84B.2.3 Influence Assessment . . . . . . . . . . . . . . . . . . . . . 84B.2.4 Inter-Node Causal Relations . . . . . . . . . . . . . . . . . 84

B.3 Post-Task Questionnaire . . . . . . . . . . . . . . . . . . . . . . . 84B.4 Post-Test Questionnaire . . . . . . . . . . . . . . . . . . . . . . . 85

C CiteWiz User Study 87C.1 Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87C.2 Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

C.2.1 Find a Paper . . . . . . . . . . . . . . . . . . . . . . . . . 90C.2.2 Find the Most Influential Paper . . . . . . . . . . . . . . . 90C.2.3 Study Author Collaboration . . . . . . . . . . . . . . . . . 90

C.3 Post-Test Questionnaire . . . . . . . . . . . . . . . . . . . . . . . 90

Page 10: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

x CONTENTS

Page 11: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

Chapter 1

Introduction

It is part of human nature to not simply accept things as they are, but to searchfor reasons and to try and answer the question “why?”. Thus, the concepts ofcause and effect have always fascinated human beings, and also lie at the core ofmodern science. In order to fully understand the workings of a complex system,a scientist often tries to ascertain its underlying mechanisms by observing theirvisible effects. Or, as Aristotle puts it in Physics II.3 [AriBC]:

Since we believe that we know a thing only when we can say why itis as it is—which in fact means grasping its primary causes (aitia)—plainly we must try to achieve this [...] so that we may know whattheir principles are and may refer to these principles in order toexplain everything into which we inquire.

Humans are particularly apt at inferring the cause for simple physical pro-cesses merely by tracing its effects backwards, for instance by backtracking thepath of a moving billiard ball on a pool table to identify the cue ball thatstruck it. However, as the number of action-reaction pairs grows, the humanmind reaches a point when it is no longer able to cope. Continuing with theanalogy above, fully comprehending the interactions, or causal relations, of allsixteen balls moving and colliding on the billiard table is impossible to do inreal-time.

One way to allay this problem is to employ some kind of graphical visual-ization that presents the information in a more digestible format suitable foroffline study. Simple directed-acyclic graphs (DAGs) or Hasse diagrams (alsoknown as time-space diagrams) offer an intuitive view of these causal relations,but are unsuitable for studying the node dependencies and information flow ina system, especially when the number of nodes and interactions grow.

In this thesis, we present two novel visualization techniques called GrowingSquares and Growing Polygons, respectively, that attack the problem of effec-tive causality visualization through the use of animation, colors, and patterns toprovide an accessible overview of a system of causal relations. We also presenta real-world case study using one of these techniques—Growing Polygons—tovisualize large scientific citation networks. Both techniques abandon the tra-ditional linear timeline of previous visualizations, and instead map the time

1

Page 12: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

2 CHAPTER 1. INTRODUCTION

parameter onto the size of the geometrical entities representing the processes(squares versus n-sided polygons, respectively). In the Growing Squares tech-nique, we represent each process in the system as a color-coded square, laid outin a suitable way, and then grow these squares as time progresses. Events thatcausally relate the process squares influence their coloring, somewhat akin tohow color pools would spread out on a piece of paper (see Figure 2.1). TheGrowing Polygons technique, on the other hand, is based on the idea of assign-ing each node in a system of n processes not only a color but also a triangularsector in an n-sided polygon, and have each such process polygon grow and besubsequently filled with the colors of the processes influencing it. Since both thecolor and position of each process sector are invariant, distinguishing betweenindividual processes is easier than for the Growing Squares technique and thevisualization is therefore more scalable.

Chronologically, the Growing Squares method was devised as a first alterna-tive to Hasse diagrams, and the Growing Polygons method was later designed toaddress some of the weak points of the Growing Squares. Both techniques havebeen implemented and tested as part of a visualization framework for causalrelations we have developed, allowing us to compare the new methods with eachother as well as with traditional techniques (see Section 1.4). In addition, thisframework allows the user to dynamically select different visualizations for thesame system of causal relations, essentially making it possible for the user toharness the strengths of each technique dependent on the analysis task beingperformed.

Formal user studies of the visualizations were performed to ensure the valid-ity of our findings. The results from the Growing Squares study show that theGrowing Squares method is significantly faster and more efficient than Hasse di-agrams for sparse data sets. However, the new method is not significantly moreefficient for dense data sets. Test subjects clearly favored Growing Squares overHasse diagrams for all analysis tasks performed. Overall, the subjective ratingsof the test subjects show that the Growing Squares method is easier, feels moreefficient, and is more enjoyable to use than Hasse diagrams.

While the test subjects’ opinion of the Growing Squares method were clearlyfavorable, the study revealed considerable room for improvement in the effi-ciency of the technique. Fortunately, the results from our study of the GrowingPolygons method are much more positive: the improved method is significantlyfaster and more efficient than Hasse diagrams for both sparse and dense datasets when performing tasks related to information flow in a system (i.e. notonly for sparse sets as for the Growing Squares method). In addition, subjectshave a much higher correctness rate using our technique to solve tasks thanwhen using Hasse diagrams. Furthermore, the subjective ratings of the sub-jects show that the new method, just as the previous Growing Squares method,is perceived as more efficient as well as easier and more enjoyable to use thanHasse diagrams.

As claimed above, causality is central to human thinking, and our visual-ization techniques should thus have wide applicability in many different areasof inquiry. To illustrate this, we have also conducted a real-world case studyof the use of the Growing Polygons technique in visualizing large-scale citation

Page 13: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

1.1. INFORMATION VISUALIZATION 3

networks in scientific literature. We argue that citations in scientific articles canbe seen as causal orderings in the sense that a citation signifies that the authorhas read the cited article and been influenced (in some unknown way) by it.This work, implemented as part of a general-purpose citation visualization toolcalled CiteWiz, includes some modifications to the original Growing Polygonsmethod to improve its scalability, such as linear time windows for handling longperiods of time and hierarchical clustering for coping with large quantities ofarticles.

CiteWiz also includes another visualization informally known as a “New-ton’s Shoulders” diagram, named after the concept voiced by Sir Isaac Newtonin 1676 of every researcher standing on the shoulders of the ones who came be-fore him. Thus, a Newton’s Shoulders diagram is a static graphical timeline ofauthors or articles in a citation database with their size and coloring indicatingthe number of citations they have and their citation density, respectively. Thisdiagram is yet another example of a useful way to visualize causality orderingsin the specific case of scientific citation networks.

This chapter serves as an introduction to the general research field of in-formation visualization, as well as causality visualization, one of its subfields.We begin by introducing a general model for information visualization. Wethen go on to explore the specific problems of visualizing causality and howthis relates to the visualization of scientific citation networks. We present theCausalViz and CiteWiz systems, the reference implementations of the tech-niques described in this thesis, and conclude with a summary of the differentchapters of the thesis and the author’s contributions in each of these.

1.1 Information Visualization

The field of information visualization is concerned with the graphical repre-sentation of abstract, nonphysical data that lacks a natural mapping to visualform. The purpose of these representations is to aid a human user in creatinga mental model of the data. Most research in the area focuses on developingeffective representations that allows the user to discover hidden information andinterpret the data in new and more efficient ways. These representations aregenerally called visualization techniques, and the mental process they make useof is referred to as external cognition [SR96], or what Norman [Nor88, Nor93]calls “knowledge in the world”.

We define cognition as the acquisition or use of knowledge [CMS99], i.e. theprocess of building a mental model by retrieving meaningful information fromraw data1. External cognition is then cognition aided by the external world,including the interaction between internal and external representations (i.e.the creation of the mental model in the user’s head), and is manifest in a widearray of human artifacts ranging from slider rules and maps to various kinds ofdiagrams and charts; all examples of real-world aids that amplify cognition. AsNorman [Nor93] notes, “it is things that make us smart”. Imagine performing

1It is important to make a clear distinction between data and information; users wish toderive information from data to gain insight into it, to be informed by the data.

Page 14: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

4 CHAPTER 1. INTRODUCTION

a multiplication of two three-digit numbers in your head and without the useof pen and paper, and most people will agree that this statement is true.

However, external cognition is a very broad concept, and includes as dis-parate disciplines as information design and data graphics. To put the field ofinformation visualization into perspective, we will use the following definitionin this thesis [CMS99]:

Information visualization: The use of computer-supported, in-teractive visual representations of abstract data to amplify cogni-tion.

The data sets involved in this task are often sufficiently huge and complexthat they yield next to no information when presented in textual or tabularform. Therefore, visualization may often be the only alternative for a humanuser to be able to understand and think about the data effectively. In fact, asCard et. al [CMS99] remark, the ubiquity of visual metaphors in describingcognitive processes hints at a strong interrelationship between what we see andwhat we think: to understand something is called “seeing” it, we try to makeour ideas “clear”, to bring them into “focus”.

Although many of the practices associated with information visualizationhave been in use for a long time, the field itself is generally recognized to havebeen defined as late as 1993 by Robertson et. al [RCM93]. Classic early exam-ples of information visualization include Charles Minard’s (1781-1870) famousgraph of Napoleon’s failed campaign against Russia in 1812, William Play-fair’s (1759-1823) invention of the line plot, bar chart, and pie chart in 1786 toshow the balance of trade between countries [Pla86], and Florence Nightingale’s(1820-1910) striking rose-like “Coxcomb” visualization from 1858 showing thatfar more deaths in the Crimean War were attributable to non-battle causesthan battle-related causes. These and more examples of famous visualizationscan be found in Edward R. Tufte’s excellent book [Tuf83].

Information visualization is a part of the more general research area knownas visualization, the purpose of which is the graphical representation of anykind of data or object, but the distinguishing feature of the subfield is the em-phasis on more or less abstract data with no straightforward visual mapping.Examples of this kind of data include document databases, financial data, com-plex hierarchies, etc. This distinction lends itself to ready comparison withthe sibling research field of scientific visualization, where the purpose again iseffective graphical representation, but where the data often is spatial or scalarand has a more or less natural interpretation that can be mapped directly toa visual form; examples include air flow around a car, temperature readings inthe oceans at various points around the globe, or sound levels in the immediateneighborhood of an airport.

In this section, we will begin by giving an outline of a general visualizationmodel describing the flow and transformation of data from a raw, unstructuredform to a visual object suitable for a human viewer. We will describe the conceptof visual structures in detail, and view transformations acting upon these. Weconclude with a definition of visualization techniques and a few words on futureresearch in the field.

Page 15: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

1.1. INFORMATION VISUALIZATION 5

Mathematicalobjects

Visualobjects

Views

dataRaw

Encoding Presentation

Viewer

Selection

Figure 1.1: A general model for information visualization.

1.1.1 Visualization Model

Figure 1.1 presents a general model for visualization (one that can actually beapplied to any visualization discipline, including scientific visualization) show-ing the flow from raw data to a mental model created by the viewer and thetransitions between the various intermediate stages of representation.

Selection

The process of selection converts raw data of some idiosyncratic format to amathematical object suitable for encoding into a visual form in the next step ofthe visualization pipeline. Intrinsic in the selection process lies not only how toconvert the data to the desired format, but also which data should be included,and which should be omitted.

The mathematical object most commonly used for structured data in in-formation visualization is the data table [CMS99], which consists of a set ofrelations (expressed as tuples) and metadata describing the relations. Eachtuple records a specific case of in the data set (i.e. a person, a movie, a times-tamped measurement, etc), and each entry in a tuple represents a variable (i.e.the person’s name, the movie title, the timestamp, etc). The data type of avariable will have bearing on its visual encoding further on in the pipeline.There are three basic variable types [CMS99]:

• quantitative – supports arithmetric (e.g. a temperature value),

• ordinal – obeys an ordering relation (e.g. the days of the week), and

• nominal – supports only equality and non-equality (e.g. movie titles).

Encoding

The encoding transformation accepts structured data (i.e. a mathematical ob-ject) and generates a visual object representing it. Since the data is abstract andlacks a physical mapping, the choice of representation is often not straightfor-ward, but depends on which aspects of the data the designer wants to highlightto the user; even then, much of the details of the visual representation is left in

Page 16: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

6 CHAPTER 1. INTRODUCTION

• Marks: points, lines, areas, volumes, etc.

• Properties:

Positional: 1D, 2D, 3D.

Temporal: animation.

Retinal: color, shape, size, saturation, texture, orientation.

• Compositions: connection, enclosure.

Figure 1.2: A simple graphical language.

the hands of the designer. Herein lies the great challenge of information visu-alization: to design a suitable visual representation that clearly and conciselycaptures the data we want to display.

We formulate these two criteria in the concepts of expressiveness and effec-tiveness [Mac86]. An encoding is said to be expressive if all and only the data inthe mathematical object can be represented in the visual object. Furthermore,the encoding is effective if it exploits the capabilities of the output medium andthe human visual system (effectiveness is often used for comparing two differentvisualizations).

A visual object consists of a spatial substrate, marks, and the graphicalproperties of the latter [Ber81, Mac86, CM97, CMS99]. Marks can be combinedusing a simple composition algebra which includes operations like connectionand enclosure. A simple visualization includes a number of marks (points,lines, or volumes), their retinal properties (color, texture, and size), and theirpositions on the spatial substrate (obeying the orientation and placement ofthe axes on the substrate). In the example of a scatterplot, the substrateis composed of two orthogonally placed axes (one for each of the variablesbeing expressed), creating a 2D Cartesian space, and a number of point marksrepresenting the cases.

Figure 1.2 summarizes the language of graphical encoding. Using this lan-guage, we can for instance express a tree diagram as a composition of pointmarks connected by line marks on a 2D spatial substrate. Having defined thislanguage, the step to automating the graphical encoding process is not far; seefor example Mackinlay’s work in this area [Mac86].

Presentation

Even if we have now chosen a visual representation of our data and encoded thedata into a visual object, we still need to present the visual object to the user, tocreate one or several views of the data. Views in information visualization arealmost always interactive, allowing us to exploit the time parameter to extractmore information out of the visualization than would be possible from a staticdiagram. Often it is not even feasible to view the entire visual object due to

Page 17: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

1.1. INFORMATION VISUALIZATION 7

its complexity or size; thus, we have to employ various view transformations toallow us to see relevant details of the data as well as getting an overview of thewhole or parts of the data set.

According to Card et. al [CMS99], there are three common view transfor-mations:

• Location Probes: Show details (often in a separate window) of the dataset at specific points in the visual structure chosen by the user (details-on-demand).

• Viewpoint Controls: Provide controls to allow the user to zoom in andpan around a detailed view of the data set (for example, scrolling in atext document). Also provide an overview of the data set to prevent theuser from getting lost [Shn96].

• Distortion: Distort the spatial substrate so that the detail view andthe overview are combined in the same space, creating a so-called fo-cus+context view [Fur86].

Interaction

Finally, the main distinguishing feature of information visualization over staticdiagrams is the existence of human interaction feedback into the visualizationpipeline. For every transformation in the model, there is a conceivable humaninteraction to allow the viewer to manipulate parameters in the visualization:guiding the selection process, mapping variables to visual objects in the encod-ing process, or controlling the views in the presentation. Discussion of theseinteraction techniques are outside the scope of this thesis, however.

1.1.2 Visualization Techniques

Based on the visualization model described above, we can now define the con-cept of a visualization technique as consisting of a visual object structure, anencoding function that accepts structured data and generates a visual object, avariable set of views, and a number of interaction techniques for manipulatingthe encoding function as well as the presentation (the views).

The technique may also have a metaphor associated with it, primarily forthe benefit of the users; for instance, modern user environments often use thedesktop metaphor (with a workspace, trash can, folders, files, etc). The purposeof this is to give users a familiar “hand-hold” in an otherwise alien environment,and to provide the user with some free knowledge about it (for instance, thatfolders can be opened and that files can be deleted by putting them in the trashcan).

1.1.3 Research Outlooks

Much of the research within information visualization concerns itself with thedevelopment of visualization techniques, and not necessarily with the formula-tion of theoretical frameworks for the field as a whole—this is for instance true

Page 18: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

8 CHAPTER 1. INTRODUCTION

about the work presented in this thesis. Thus it follows that in the great ma-jority of cases, the design of a new visualization is a craft activity [Spe01], anddraws upon the experience and skill of the developer in visual, cognitive, andalgorithmic design. No theoretical frameworks currently exist that allow us toprove which of two visual representations is superior, or which of two view trans-formations give the best presentation of a visual object; this must be provenempirically through user studies (which has also been the method employed inthis thesis). The closest to such frameworks include the work by Cleveland andMcGill [CM84] and Mackinlay [Mac86]. The reason for this is primarily dueto the relative youth and complexity of the information visualization field inparticular, not to mention human-computer interaction in general.

Therefore, one of the most important tasks for future research in the arealies not in the development of new and innovative visualization techniques, butrather in the formulation of a unified theoretical framework for information visu-alization, one that can explain and even anticipate the most novel visualizationtechniques in beforehand.

1.2 Causality Visualization

In modern use, the notion of causality is associated with the idea of something(the cause) producing or bringing about something else (its effect). In general,the term “cause” has a broader meaning, and is used as an explanatory orreasoning tool. Identifying causal relations in a complex system can be thefirst step towards understanding the underlying mechanisms that determinethe system’s laws. As such, causal relations cover a wide variety of scientificdomains where causality are of importance.

Our interest in causality originates mainly from the viewpoint of distributedand parallel computing, where causal relations are used extensively for exam-ple (i) in distributed database management to determine consistent recoverypoints; (ii) in distributed software systems for determining deadlocks; (iii) indistributed and parallel debugging for detecting global predicates and detect-ing synchronization errors; (iv) in monitoring and animation of distributed andparallel programs to determine the sequence in which events must be processedso that cause and effect appear in the correct order; and (v) in parallel anddistributed software performance to determine the critical path abstraction:the longest sequential thread, or chain of dependencies, in the execution of aparallel or distributed program. Improving the graphical visualization of causalrelations will thus benefit all these activities.

Causality is a much broader concept than this, however, and is not restrictedto computer science research. In an effort to show this (and to make use of thenew visualization techniques in a real-world application), we have conducteda case study of the use of our techniques for the visualization of large-scalecitation networks. See Section 1.3 for more information on this.

In this section we give a brief background to the causality visualizationproblem, including a brief formal introduction to causal relations, a descriptionof the various analysis tasks involved when studying causal relations, and a

Page 19: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

1.2. CAUSALITY VISUALIZATION 9

presentation of the existing work in the area.

1.2.1 Definitions

A causal relation is the relation that connects or relates two items, called events,one of which is a cause of the other. Obviously, for an event to cause another,it is not sufficient that the second merely happens after the first; however, it iswell accepted to state that this is necessary, and temporal order can be relied onto explain the asymmetrical direction of causal relations2. All events connectedin the causal relation are part of a set of processes, labelled P1, . . . , PN , eachof which can be thought of as a producer of a disjoint subset of the set of allevents in a system. Events performed by the same process are assumed tobe sequential; if not, we can split the process into sub-processes. Thus, it isconvenient to index the events of a process Pi in the order in which they occur:Ei = ei

1, ei2, e

i3, . . .

For our purposes, it suffices to distinguish between two types of events;external and internal events. Internal events affect only the local process state.An internal event on process Pi will causally relate to the next event on the sameprocess. External events, on the other hand, interconnect events on differentprocesses. Each external event can be treated as a tuple of two events: a sendevent, and a corresponding receive event. A send event reflects the fact thatan event, that will influence some other event in the future, took place andits influence is “in transit”; a receive event denotes the receipt of an influence-message together with the local state change according to the contents of thatmessage. A send event and a receive event are said to correspond if the samemessage m that was sent in the send event is received in the receive event.

We now formally define the binary causal relation → over all the events ofthe system E (→⊆ E × E) as the smallest transitive closure that satisfies thefollowing properties [Lam78]:

1. If eik, ei

l ∈ Ei and k < l, then eik → ei

l.

2. If ei = send(m) and ej = receive(m), then ei → ej where m is a message.

When e→ e′, we say e causally precedes e′ or e caused e′. Causal relationsare irreflexive, asymmetric, and transitive.

1.2.2 Analysis Tasks

At the onset of our investigation into visualization of causal relations, we orga-nized a formative evaluation of these concepts using a focus group consisting ofresearchers from our university working on distributed systems. The evaluationtook the shape of a panel discussion on questions related to causal relationsand their use, and six researchers from the Distributed Computing & Systems

2It has been argued that not even this is necessary, and that both simultaneous causationand “backwards causation” (effects preceding their causes) are at least conceptually possible.This, on the other hand, causes problems when considering the asymmetric nature of causalrelations.

Page 20: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

10 CHAPTER 1. INTRODUCTION

group at the Department of Computing Science at Chalmers participated inthe session. These discussions allowed us to identify the typical analysis tasksa user is interested in when studying a distributed system, and were vital intailoring our visualization to these tasks. Below follows a short overview ofthese analysis tasks.

Lifecycle Analysis

The lifecycle of individual processes are often of great interest when analyzinga system of causal relations. This includes aspects such as the duration of aprocess as well as its starting and stopping times (both in isolation as well asin relation to other processes), aspects that are vital in understanding how asystem works.

Influence Analysis

The analysis of influences and dependencies in a distributed system was foundto be one of the most important analysis tasks when studying the flow of in-formation in a system. Designing, debugging, or trying to grasp the underlyingmechanisms of a distributed system or algorithm all involve this task.

Inter-Process Causal Relations

Often, a practitioner studying a system of causal relations needs to knowwhether two nodes, Pi and Pj , in the system are causally related, i.e. if thereexists an event ei ∈ Ei and an event ej ∈ Ej such that ei → ej . Of course,this causal relation can go through several levels of transitive indirection, andis therefore quite difficult to spot manually or by using Hasse diagrams (as wewill see).

1.2.3 Existing Work

There has been surprisingly little work performed in the area of causality visu-alization, and the prevalent visualization method is still the traditional Hasse(also known as time-space) diagram. Figure 1.3 shows an example of a time-space diagram for a system comprised of three processes, where the progress ofeach process is described by a directed horizontal line, the process line. Time isassumed to move from left to right. Events are symbolized by dots on the pro-cess lines, according to their relative order of occurrence. Messages are shownas arrows connecting send events with their corresponding receive events. Vi-sualizations of causal relations in the form of such time-space diagrams arecurrently quite standard in visualization and debugging platforms for paral-lel and distributed systems, and the number of such platforms is too largeto allow discussing them all; we will just focus on a few of the noteworthysystems. One of the first of the new generation of visualization tools to in-clude the time-space diagram was the Voyeur [SBN89] system, which provideda framework for defining various animation views for parallel algorithms. The

Page 21: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

1.2. CAUSALITY VISUALIZATION 11

p

time

p

1

0

p2

Figure 1.3: Hasse diagram visualization with 3 processes.

TOPSYS [BB93] environment includes various standard concurrency visualiza-tions (called VISTOP) integrated with the debugging and performance analysistools of the system, with time-space visualization being one of them. Using thisprocess-based concurrency view, users can identify synchronization and com-munication bugs. Going one step further, the conceptual visualization modelof the VADE [MPT98] system is based on the causal relation notion. VADEis also geared towards more general algorithm visualization, and supports notonly communication events but also other algorithmic objects and events. Alsoof interest is LYDIAN [KPT99], an educational visualization system, which bydefault constructs the time-space diagram for every algorithm implemented inthe system. Kraemer and Stasko [KS98] describe the essential characteristicsof toolkits for visualization of concurrent executions, and introduce their ownsystem, called Parade. Parade also includes an animation component called theAnimation Choreographer that orders display events from a trace file in muchthe same way as the techniques described in this thesis. Also, for the purposeof our study, the Hasse visualization used in Figure 1.3 is very similar to thetime-space visualization view from the ParaGraph system [Hea90, HE91] andits adaption in the PVaniM tool [TSS98], as well as the Feynman or Lamportviews from the Polka animation library [SK93].

While Hasse diagrams certainly are in widespread use, they have a numberof deficiencies that lower their usefulness for realistic systems. First of all, aHasse diagram offers only local dependency information for each process andnot the transitive closure of all interactions involving it, making it difficult togain an overview of the overall information flow in the system; in essence, theuser is forced to manually backtrace every single message and process affect-ing a specific process to find its dependencies. Second, the fine granularity ofthe visualization makes Hasse diagrams difficult to use for large systems of tenor more involved nodes; the amount of intersecting message arrows simply be-comes too overwhelming for complex executions. And third, Hasse diagramsare intrinsically static in nature and thus make little use of the interactivenessof the computer medium; animation and creative use of color are likely to beuseful tools in this kind of visualization.

Ware et al. [WNB99] presented a new visualization construct called a visual

Page 22: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

12 CHAPTER 1. INTRODUCTION

causality vector (VCV) that represents the perceptual impression of a causalrelation and employed animation to emphasize this relation in a directed acyclicgraph. Three different VCVs were introduced based on different metaphors: thepin-ball metaphor, where the VCV is a ball that moves from the source to thedestination node, striking the destination and making it oscillate; the prodmetaphor, where the VCV is a rod that extends from the source to prod thedestination; and finally a wave metaphor, where the VCV accordingly is ananimated wave that moves towards the destination node. However, while theseconstructs are certainly an improvement over a simple DAG representation ofcausal relations, they do nothing to battle the complexity of large systemswith many nodes and relations. In fact, Ware’s primary contribution is theinvestigation of timing concerns for the perception of causality for users, notthe visualization technique per se. It might still be interesting to incorporateWare’s VCVs into our system in some form.

1.3 Citation Network Visualization

Citation networks consist of bibliographical entries representing scientific works,each being a tuple of attributes such as title, authors, source, date, abstract,keywords, etc. In addition, each entry has a number of references to otherentries representing the citations found in the article. Thus, citation networkscan be seen as directed graphs where each node represents an article, out edgesrepresent cited papers (i.e. the dependencies of the current paper), and in edgesrepresent citing papers. A citation graph is generally not acyclic since articlesmay mutually cite each other; this is often the case when an author (or a teamof authors) publishes two or more related articles to the same conference.

Traditional bibliographical databases generally provide means for searching,sorting, and filtering the citation data in various ways (examples include IEEEXplore3, the ACM Digital Library [Den97], and CiteSeer [GBL98]). Thesedatabase interfaces serve as suitable reference implementations when assessingnew visualizations for citation networks.

This section describes visualization of citation networks, a core informa-tion visualization topic due to the massive scale of citation databases and theabstract and highly contextual nature of the data. We begin by describingthe formative evaluation we conducted, leading to the creation of a taxonomyfor citation database interaction that we will later use as a basis for our ownvisualization tools, and conclude with a review of existing work in the area.

1.3.1 Formative Evaluation

In order to deduce the common user tasks associated with bibliographicaldatabases, we organized a formative user evaluation using a focus group of sixactive researchers from our department. Our intention with this session was toidentify the high-level issues and tasks involved with the use of bibliographicaldata, including various situations when researchers make use of such databases.

3http://ieeexplore.ieee.org/

Page 23: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

1.3. CITATION NETWORK VISUALIZATION 13

The session lasted for approximately one hour, and influenced us to develop ataxonomy of citation database interaction based on user roles and the tasks andsubtasks associated with each role. This taxonomy, presented in the followingsection, has proven useful when discussing bibliographic visualization and theanalysis tasks involved in this activity, but may have a slight bias towards aresearcher’s point of view; we plan to involve other users of citation databases(e.g. librarians) in future updates of the taxonomy.

1.3.2 Taxonomy of Citation Database Interaction

A researcher may assume any of a number of different roles when interactingwith a citation database, and we have thus chosen to base our taxonomy on theconcept of user roles and the goals and tasks associated with these. Clearly,a user has different goals to achieve depending on his or her current role, andthese govern which tasks need to be carried out. Using this taxonomy, we canmake decisions about which user roles and goals we want a tool to support, andaccordingly which tasks we must implement.

In the taxonomy below, the terms group and subgroup refer to any (po-tentially hierarchical) clustering of articles (and subgroups) according to somecriteria, such as shared keywords, author, source, etc. An event is defined asany scientific community activity, such as a journal issue, a conference, a work-shop, etc. Furthermore, we have categorized the user tasks depending on wherethe focus of the task lies; making a distinction between (i) article-, (ii) event-,(iii) author-, and (iv) group-focused user tasks is useful when discussing thenature of a visualization tool.

Table 1.1 presents the roles we have identified, including a short descriptionof each role. Table 1.2 gives a listing of the individual goals of each role, as wellas the tasks involved with completing that particular goal. Finally, Table 1.3shows the different tasks, including their focus category. Note that these tasksoperate on the current working group and not necessarily the entire database;for instance, task T3 should be interpreted as “find the most influential paperin the current group of papers”.

1.3.3 Existing Work

The common model of viewing citation networks as directed graphs (see thenext section) lends itself quite naturally to visualizing bibliographical data assimple node-link diagrams. However, node-link diagrams scale poorly withnetwork size, and furthermore only present local dependency information; it iseasy to see direct citing and cited articles, but the user must traverse the graphin order to see dependencies more than one step away. CiteWiz, on the otherhand, provides the surrounding context through influence mapping, and givesa more straightforward way to see the chronology of articles.

Modjeska et al. [MTFF96] propose a minimum set of functions necessary foreffective bibliographic visualization: (i) display of complete bibliographic infor-mation, (ii) filtering by record fields, (iii) display of chronology and influence ofarticles, (iv) information views at different levels of detail, (v) multiple simul-

Page 24: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

14 CHAPTER 1. INTRODUCTION

Role Description

NoviceA researcher that is new to a specific field; can either be anew student or an experienced researcher moving to a newarea.

Expert An experienced researcher with intimate knowledge of afield.

ReviewerA researcher tasked with peer-reviewing a new paper, po-tentially from a field he or she has only passing knowledgeof.

OrganizerA researcher responsible for organizing, editing, and/orsteering an event (such as a conference or journal).

EvaluatorA person, such as a recruiter, tasked with evaluating thework of a specific researcher.

Table 1.1: User roles in citation database usage.

taneous views, and (v) visualization of large search results. They also presentthe BIVTECI prototype system that partially implements this specification,but the visualization used in the tool is restricted to node-link diagrams withvisualized attributes. CiteWiz also implements this minimum functionality, butinstead employs the Growing Polygons causality visualization technique in orderto handle larger search results and provide stronger chronology and influenceinformation.

The Butterfly [MRC95] system provides a 3D visualization front-end of theDIALOG science citation databases, using the notion of “organic user inter-faces” to build an information landscape as the user explores the results ofvarious queries. Individual articles are represented by an innovative butterfly-shaped 3D object with references and citers on the left and right wings, respec-tively, and provides various graphical cues to orient the user when browsingthe citation network. Butterfly uses a node-link diagram for overview and con-text, however, and has no mechanism for showing the cumulative influences andchronology of articles.

CiteWiz and the above-mentioned systems are all article-focused tools inthat they emphasize the visualization of articles and their interdependencies.A number of group-focused techniques have also been proposed, where theemphasis lies on representing the groupings and structure of a scientific do-main through metrics such as relevance, bibliographic coupling [Kes63], and co-citation [Sma73]. Work in this area is numerous but peripheral to the system de-scribed in this thesis; examples include [CC92, HKW94, Che99, BW02, CM03].

Page 25: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

1.4. THE CAUSALVIZ FRAMEWORK 15

Role Goal Tasks

Novice Orientation in a new area T2, T3, T5, T6Find open problems T4

Expert Verify hypotheses/intuition T1Stay updated T1Find papers quickly T1

Reviewer Check originality T2, T3, T5Check correctness T2, T3Check adequacy of references T2, T5

Organizer Identify hot topics T4, T5, T6View chronology of an event T7View collaborations between events T8

Evaluator View the career of an author T7Assess the work of an author T2, T3, T5

Table 1.2: Goals for each user role.

1.4 The CausalViz Framework

In order to test the Growing Squares and Growing Polygons techniques and tosubsequently be able to perform user studies on their effectiveness, we imple-mented a general application framework for the visualization of causal relationscalled CausalViz (see Figure 1.4). The framework is implemented in C++ onthe Linux platform and uses the Gtk+/Gtk– widget toolkits for user interfacecomponents as well as OpenGL for graphical rendering.

1.4.1 System Architecture

The architecture of the CausalViz application (see Figure 1.5) is based around asingle partially ordered set (poset) representing the execution data under study.A number of visualization components observe this set and present graphicalrepresentations of the data (potentially allowing for the set to change duringrun-time). There currently exists three different visualizations, i.e. traditionalHasse diagrams, the 2D Growing Squares, and the prototype 3D Growing Pyra-mids.

Central in the system architecture is the application manager that createsall the other components, manages the graphical user interface (GUI), andperforms loading of data files into the application (stored in a general XMLformat for partially ordered sets). In order to allow for the animation of eventsin the visualizations, there also exists a general animation manager thread that

Page 26: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

16 CHAPTER 1. INTRODUCTION

Task Description Focus

T1 Find a particular paper ArticleT2 Find related papers ArticleT3 Find the most influential paper(s) ArticleT4 Find hot topics (at a specific time) GroupT5 Partition an area into subareas GroupT6 Study the overall citation network ArticleT7 Study the chronology of an author/event/group Au/Ev/GrT8 Study the collaboration between authors/events/groups Au/Ev/Gr

Table 1.3: Tasks for citation database interaction.

the visualization components can use to smoothly interpolate values in the posetwith respect to time.

1.4.2 Poset Management

System execution traces are stored in a general XML file format for partiallyordered sets. Here, a process Pi is represented by the subset Ei ⊆ E of allthe events in the system belonging to the process and a set of messages Mi.Messages are partial orderings between events in different subsets (processes),and can thus be represented by pairs of events, i.e. Mi ⊆ E × E. It is then upto the application to compute the minimal transitive closure for the poset.

In the CausalViz application, the transitive closure is computed using amodified topological sort [CLRS01]. The objective of the algorithm is two-fold:(i) to derive the transitivity information for each event (i.e. the processes whichhave influenced it so far) and (ii) to assign the event to a discrete time slot. Thisis done by greedily consuming sequential events in each subset (i.e. process) ofthe poset until reaching an event with unresolved dependencies (i.e. a partialordering to a previously unvisited event). When this happens, the algorithmmoves on to the next process to continue from where it last left off. This isrepeated until all events in the system have been visited. The current influenceof each event is easily maintained and updated during this process, and illegalcyclic dependencies are trivially detected by checking whether the algorithmhas cycled through all process without visiting any new events.

1.4.3 CiteWiz Extensions

The CiteWiz application described in Chapter 4 is largely based on the CausalVizframework with a few extensions. Instead of using a partially ordered set asthe main data structure, CiteWiz uses a citation database and a user-definedhierarchical view of this database which is then used for visualization.

The implementation of the Growing Polygons visualization used in CiteWizis the same as the CausalViz implementation, with some modifications to im-

Page 27: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

1.5. CONTRIBUTIONS 17

Figure 1.4: The CausalViz application.

prove the scalability of the technique for large and highly connected networks(see Chapter 4 for details on these modifications).

1.5 Contributions

The contents of this thesis is based mainly on a number of publications presentedat international conferences in the area of information and software visualiza-tion. More specifically, Chapter 2 introduces our first 2D causality visualizationtechnique, Growing Squares, and is based on the following paper:

Elmqvist, N., Tsigas, Ph. (2003): Growing Squares: AnimatedVisualization of Causal Relations. In Proceedings of the ACM Sym-posium on Software Visualization 2003 (SoftVis 2003), pp. 17–26.

Chapter 3 presents the Growing Polygons technique, which improves onsome of the weaknesses of Growing Squares, and builds on the following paper:

Elmqvist, N., Tsigas, Ph. (2003): Causality Visualization Using An-imated Growing Polygons. In Proceedings of the IEEE Symposiumon Information Visualization 2003 (InfoVis 2003), pp. 189–196.

Page 28: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

18 CHAPTER 1. INTRODUCTION

Poset AnimationManager

HasseViz

controls

GUIXML

usesrenders

creates

renders/uses

SquareViz PolygonViz

Application Manager

Figure 1.5: CausalViz system architecture.

Some of the comparative analyses of the two visualization techniques weretaken from the following journal paper:

Elmqvist, N., Tsigas, P. (2004): Animated Visualization of CausalRelations Through Growing 2D Geometry. In Information Visual-ization, Vol. 3 (2004) No. 3, pp. 154–172 (Special Issue of SelectedPapers from the ACM Symposium on Software Visualisation 2003 ),Palgrave Macmillan.

Finally, Chapter 4 is partially based on the following technical report:

Elmqvist, N., Tsigas, P. (2004): CiteWiz: A Tool for the Visualiza-tion of Scientific Citation Networks. Technical report CS:2004-05,Chalmers University of Technology, Goteborg.

This paper discusses both the adaptations and scalability modifications doneon the Growing Polygons technique, as well as describing the Newton’s Shoul-ders diagram for constructing influence timelines of papers and authors in sci-entific communities. In addition, the taxonomy of citation database interactiondescribed in Section 1.3.2 is also presented in this paper.

Page 29: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

Chapter 2

Growing Squares

As described earlier, there is surprisingly little work on visualizations of causalrelations besides various implementations of Hasse diagrams, a fact which is es-pecially curious in light of the shortcomings of Hasse diagrams for understandinga distributed system. The fine granularity of Hasse diagrams defeat their use asoverview tools, and they transfer the burden of maintaining transitive relationsto the user herself. This means that a user studying the information flow ina distributed systems visualized using a Hasse diagram might potentially haveto backtrace every single message and process in order to get a clear picture ofthe influences in the system.

The Growing Squares visualization technique (first presented in [ET03b])was designed to help the user quickly get an overview of the causal relations ina system by making use of animation, color and patterns in an intuitive way.The visual metaphor of the technique is that of “pools” of color spreading ona piece of paper as time progresses, each color and pool representing a specificprocess or node in the system. Messages in the system are shown as “channels”from one pool to another. Each color pool will start growing at the time itscorresponding process is started, and accordingly stop growing when the processstops executing events. The channels representing messages from one processto another intuitively carry the color of its source with it, resulting in thedestination pool receiving this color as well. However, like age rings on a tree,the color of the new influencing process will only be present in the destinationprocess starting from when the message was received.

Figure 2.1 gives an example of a system with two processes, P0 and P1,colored blue and white, respectively. The color pools are represented as 2Dsquares which grow over time. At a certain time t, P0 sends a message to P1

(denoted by the arrow in the figure), establishing a causal relation betweenP1 and P0. For all times t′ > t, the color pool of process P1 now shows thisinfluence from the blue P0 by means of a checkered pattern combining the twocolors.

In order to visualize the transitive property of the causal relation (see theprevious section), a similar color pattern scheme is used. In Figure 2.2, processP1 is sending a message to P2 (colored red) after having been influenced by amessage from P0. Now, both the color of the source process (white from P1

19

Page 30: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

20 CHAPTER 2. GROWING SQUARES

1PP0

Figure 2.1: Simple example of the Growing Squares technique with two pro-cesses.

1P P2P 0

Figure 2.2: Transitivity property of causal relations using Growing Squares.

itself) and any of its existing influences at the time of sending the message (bluefrom P0) are transferred to P2, making its texture from this time and onwardsbe a checkered pattern of all of the three colors. It is now easy to see that P2

is causally related to both P0 and P1.Multiple influences from the same source process will increase the amount

of the source process’s color in the texture of the destination process. Even ifthe checkered pattern makes it difficult to see the exact ratio, this fact can nev-ertheless be used as a visual indication that multiple influences have occurred.

Having abandoned a traditional timeline, the Growing Squares method isdependent on animation to allow the user to view the entire execution of thesystem under study. Starting at t = 0, the user can advance the time in thesystem to observe the system execution in chronological order, or choose to viewthe situation at specific points in time. This is another radical difference fromHasse diagrams; Hasse diagrams are static in nature and do not benefit muchfrom animation, whereas Growing Squares are dynamic and rely on animationto present the full data set to the user.

Figure 2.6 shows an example sequence consisting of 5 processes in a dis-tributed system visualized using the Growing Squares technique. The state ofthe visualization is here shown for each discrete time unit (in practice, the an-imation is fluid and continuous between the time steps) starting at t = 1 andending at t = 5, the end of the execution. Processes are laid out in a clock-wisefashion with P0 at the top. Screenshot (a) at t = 1 shows how P1 sends a

Page 31: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

2.1. DESIGN 21

message to P0, starting it (it has zero size up until this time), and (b) at t = 2depicts the two colors (green and black) in the process square of P0. In thesame screenshot, P4 sends a message to P1, causing P1 in (c) at t = 3 to holdinfluences from both P4 as well as indirectly from P3 (i.e. an example of thevisualization of transitivity in the Growing Squares visualization). In (d) att = 4, two messages originating from the otherwise isolated P2 reach P0 and P4,its blue color showing in the outer square of these two processes in snapshot (e)at t = 5.

2.1 Design

In order for the Growing Squares visualization to be effective, users must beable to easily distinguish between the individual process colors in the systemunder study. Selecting a suitable color scale is thus an important aspect ofthe method, and we investigated the use of perceptually uniform color scalessuch as LOCS [LH92, LHMR92] for this purpose. However, we found that thecontinuous nature of LOCS was not well-suited to our problem since it madedistinguishing between adjacent colors difficult, and the scale itself included aninordinate amount of dark colors. Instead, we opted for a simple color scalewith the individual colors uniformly distributed over the RGB spectrum1.

One of the central features of the presented visualization technique is thatit draws process squares with the checkered patterns containing all the colorsof the processes that have influenced the process. If the number of influencesis large, the on-screen space allocated for each color will be very small andthus hard to distinguish (see [WS91] for in-depth information on color percep-tion). In order to still allow the visualization to be effective, we need a zoomfunction that allows the user to effortlessly view the graphical representationat different magnification levels. We have implemented a simple continuouszoom mechanism for this purpose; in the future, it may be extended to bor-row techniques from the Pad [PF93] zoomable user interface and its descen-dants [BHP+96, BMG00].

It might be argued that using circles instead of squares would have beenmore in keeping with the metaphor of color pools spreading on a piece of paper.Our original intention was also to use circles, but we ultimately chose squares fora number of reasons: (i) the larger area of squares facilitates color recognitionbetter than circles, (ii) the layout of process squares into grids is easier (nowasted space), and (iii) squares are faster to render and easier to texture map(besides, we felt it was more logical to have checkered squares rather thancheckered circles).

The Growing Squares visualization makes use of animation to display thedynamic execution of the system under study. While it certainly is possible tomaintain all message arrows and just draw the visualization at full time, thiswould result in many of these messages coinciding (as in Hasse diagrams) andthus being hard to separate from other messages, as well as being impossible to

1The RGB color model was chosen for simplicity, while a color model like HSV might bemore suitable to human perception.

Page 32: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

22 CHAPTER 2. GROWING SQUARES

associate with a specific time. Animation solves these issues in a natural way.Another design aspect of the Growing Squares visualization is finding suit-

able layout methods for arranging the individual processes. Many such layoutstrategies exist. For instance, if the data set represents the execution of a dis-tributed algorithm in a network, the geographical location of the individualnodes can be used to position the squares in the visualization. Other alterna-tives include simple grid and circular layouts (see Figure 2.3) which may serveto minimize the amount of coinciding message arrows to greater or lesser extent.In this chapter, we chose to ignore this aspect and selected a simple circularlayout scheme that has the advantage of avoiding message arrows coincidingwith each other or passing over processes.

Figure 2.3: Growing Squares visualization with 20 processes.

2.2 User Study

Our hypothesis was that the Growing Squares technique is faster and more ef-ficient at quickly providing an overview of the causal relations in a distributedsystem, and that the new technique scales better with system size than tradi-tional methods. To test this, we conducted a formal comparative user study ofthe old Hasse diagram visualization and our new Growing Squares technique.The focus of this user study was to evaluate user performance of the “overviewtasks”, i.e. tasks associated with the general comprehension of how a systemworks. We also wanted to get a subjective assessment of the two methods.

Page 33: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

2.2. USER STUDY 23

2.2.1 Subjects

In all, 12 users, four of which were female, participated in this study. All userswere carefully screened to have good computer skills and basic knowledge ofdistributed systems and general causal relations. In particular, knowledge ofHasse diagrams was required. Subject ages ranged from 20 through 50 yearsold, and all had normal or corrected-to-normal vision.

2.2.2 Equipment

The study was run on a Intel Pentium III 866 MHz laptop with 256 MB ofmemory and a 14-inch display. The machine was equipped with a NVidiaGeforce 2 GO graphics accelerator and ran Redhat Linux 7.2.

2.2.3 Procedure

The experiment was a two-way repeated-measures analysis of variance (ANOVA)for independent variables “visualization type” with two levels (Hasse diagramsversus Growing Squares), and “data density”, also with two levels. The twolevels of data density were “sparse” and “dense” with 5 processes sending 15messages and 30 processes sending 90 messages, respectively. The visualiza-tion type was a within-subjects factor, as was the data density. Each subjectreceived the various task sets in different order to avoid systematic effects ofpractice.

The same set of four different data sets were used for all subjects. Two weregeared at the sparse case with 5 processes and 15 messages (one for each visu-alization type), and two for the dense case with 30 processes and 90 messages(see Table 2.1). The traces were all generated using a heuristic algorithm toavoid users taking advantage of special knowledge about real system traces. Inthe case of deducing inter-node causal relations, care was taken to ensure thatthe complexity of this was equivalent for both task sets of each density.

The evaluation procedure consisted of repeating overview tasks using Hassediagrams and Growing Squares for first the sparse and then the dense datadensities. The order of the visualization types was different for each subject tominimize the impact of a learning effect. The repeated tasks for each densityand visualization type is summarized in Table 2.2. Prior to starting work oneach task set, subjects were given the chance to adjust the window size andplacement to their liking. Subjects were informed that they should solve thetasks quickly and focus on using the visualization to get an overview of thesystem trace. The completion of each task was separately timed, except for thetasks Causality 1-3, which were timed together.

We enforced an 8 minute (480 seconds) time cap on the completion of eachtask in order to avoid excessive times skewing the results of the user study.Uncompleted or skipped tasks were set to the time cap for that particular task.

Since we were targeting overview tasks, it was not necessary for subjectsto find a precise answer to each exercise. Instead, it was deemed sufficient if

Page 34: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

24 CHAPTER 2. GROWING SQUARES

subjects named one of the processes in the top 20 %2 for each category; i.e. for30 processes, it was enough to pick one of the six processes that were most theinfluential, long-lived or influenced ones for the answer to be counted as correct.Only the Causality 1-3 tasks required a totally accurate answer.

After having performed each task set for a density and visualization type,subjects were asked to give a subjective rating of the efficiency, ease-of-use,and enjoyability of the visualization technique. When all of the tasks werecompleted, the subjects responded to a final questionnaire comparing the twovisualization techniques based on the previously-stated criteria (see Table 2.3).

Each evaluation session lasted approximately one hour. Subjects were givena training phase of ten minutes to familiarize themselves with the CausalVizapplication and the two visualization techniques. During this time, subjectswere instructed in how to use the visualizations to solve various simple tasks.

Data Density Processes Messages

Sparse 5 15Dense 30 90

Table 2.1: Experimental design. Both density and visualization factors werewithin subjects for all 12 subjects.

2.3 Results

After having conducted the user study, we analyzed the resulting test data. Theresults can be divided into two parts; the objective performance measurement,and the subjective ratings of the test subjects.

2.3.1 Performance

The mean times of performing a full task set (i.e. four tasks) using the Hassediagrams and the Growing Squares visualizations were 416.58 (s.d. 268.99) and334.79 (s.d. 230.86) seconds respectively. This, however, is not a significantdifference (F (1, 11) = 2.54, p = .139). The main effect for density was stronglysignificant (F (1, 11) = 30.99, p < .001), with means for the sparse and denseconditions of 222.96 (s.d. 77.24) and 528.42 (s.d. 272.94) seconds. Figure 2.4summarizes the mean task results for the two visualizations across the twodensities; error bars show one standard deviation above and below the mean.The figure also shows that the mean time for the task set was higher for theHasse method across all densities. For the sparse conditions the visualizationtype was significant (F (1, 11) = 15.82, p = .002), with mean values of 259.50

2This number was somewhat arbitrarily chosen, partly because it was felt to be an ac-ceptable margin of error, and partly because 20 % out of 5 processes for the sparse data settranslates to finding the single correct process for each task.

Page 35: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

2.3. RESULTS 25

Task Comments Measure

Duration Find the process with the longest duration. Time

Influence 1Find the process that has had the most influenceon the system.

Time

Influence 2 Find the process that has been influenced themost.

Time

Causality 1-3 Is process x causally related to process y? Time

Q1Rate the visualization w.r.t. ease-of-use (1=veryhard, 5=very easy).

Likert

Q2Rate the visualization w.r.t. efficiency (1=veryinefficient, 5=very efficient).

Likert

Q3Rate the visualization w.r.t. enjoyability(1=very boring, 5=very enjoyable). Likert

Table 2.2: Repeated tasks for each density and visualization type.

Task Comments

PQ1 Rank the visualizations w.r.t. ease of use.PQ2 Rank the visualizations w.r.t. efficiency.PQ3 Rank the visualizations w.r.t. enjoyability.

Table 2.3: Post-evaluation ranking questions.

(s.d. 75.23) and 186.42 (s.d. 62.46) seconds for the Hasse and Growing Squaresvisualizations. The Growing Squares method also gave better results for denseconditions; the mean times in Hasse and Growing Squares were 573.67 (s.d.302.96) versus 483.17 (s.d. 243.94) seconds. This, however was not a significantdifference (F (1, 11) = 1.03, p = .332).

The only exception where Hasse diagrams performed better than GrowingSquares is the Duration subtask for dense systems, while our technique per-formed better than Hasse diagrams in all other subtasks across both densities.For the Duration subtask, the mean completion times for the sparse data setusing Hasse diagrams were 30.92 seconds (s.d. 9.99) versus 21.17 seconds (s.d.17.93) for the Growing Squares method, while the mean times for the denseset were 37.00 (s.d. 15.72) and 54.75 (s.d. 28.08), respectively. This, however,was not a significant difference for this subtask (F (1, 11) = 0.492, p = 0.498).For the Influence 1 subtask the mean completion times for the sparse data set

Page 36: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

26 CHAPTER 2. GROWING SQUARES

using Hasse diagrams were 75.33 seconds (s.d. 50.71) versus 65.33 seconds (s.d.35.93) for the Growing Squares method, while the mean times for the denseset were 234.67 (s.d. 141.81) and 157.17 (s.d. 66.85), respectively. The visu-alization type did not have a significant effect on the completion time for thissubtask (F (1, 11) = 2.80, p = 0.122). Similarly, the Influence 2 subtask yieldedmean completion times of 76.17 (s.d. 31.94) versus 47.17 (s.d. 31.65) for thesparse data set, and 165.58 (s.d. 159.21) versus 136.00 (s.d. 114.83) for thedense case. Again, the type of visualization did not have a significant effect tothe completion time for this subtask (F (1, 11) = 1.062, p = 0.325). Finally, theCausality 1-3 subtask resulted in sparse mean completion times of 77.08 (s.d.31.04) for Hasse diagrams and 52.75 (s.d. 13.32) for Growing Squares, whereasthe dense means were 136.42 (s.d. 88.25) and 135.25 (s.d. 77.87), respectively.The type of visualization did not have a significant effect to the completiontime for this subtask (F (1, 11) = 0.707, p = 0.418).

The subjects’ comments revealed that one of the reasons for the absence of astatistically significant difference between visualizations in the dense conditionwas because of color similarities. Much time was spent by subjects matchingcolors to each other and looking up process numbers in the color legend.

Subjects made little use of the animation controls in the Growing Squaresvisualization except to play it through once at the beginning of each task togain a picture of the data set. Only a few of the subjects actively moved thetimeline back and forth to solve various subtasks, and most preferred to leavethe time setting at the end of the execution.

The fixed (circular) layout algorithm used in the user study turned out tobe limiting when it came to comparing the size (i.e. duration) of individualprocesses. Users remarked that it would have been useful to be able to clickand drag processes to arbitrary positions to facilitate comparison as well as togroup processes into semantic clusters (i.e. clusters of the same perceived type).

2.3.2 Subjective Ratings

The subjects consistently rated Growing Squares above Hasse diagram withrespect to efficiency, ease-of-use and enjoyment. The mean response valuesto the five-point Likert-scale questions are summarized in Figure 2.5. Thecomplete data analysis table is presented as Table 2.4.

The subjects’ responses to the efficiency question (Q2, Table 2.4) showeda higher rating for the Growing Squares visualization than Hasse diagrams inboth sparse (means 3.83 (s.d. .39) and 2.75 (s.d. .97)) and dense data densities(means 3.13 (s.d. .68) and 1.58 (s.d. .67)). Both higher rating readings weresignificant (Friedman Tests, p = .0209 for the sparse case and p = .0039 for thedense case). The subjects’ response to the ease-of-use question (Q1, Table 2.4)also showed a higher rating for the Squares visualization in both sparse (means3.92 (s.d. .67) and 2.67 (s.d. .89)) and dense data densities (means 2.79 (s.d..78) and 1.46 (s.d. .66)). Both higher rating readings were significant (FriedmanTests, p = .0094 for the sparse case and p = .0015 for the dense case). Thesubjects’ response to the enjoyment question (Q3, Table 2.4) also showed ahigher rating for the Squares visualization in both sparse (means 3.92 (s.d.

Page 37: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

2.4. CAVEATS OF GROWING SQUARES 27

0

100

200

300

400

500

600

700

800

900

SquaresSparse

HasseSparse

SquaresDense

HasseDense

Tim

e (s

ec)

0

100

200

300

400

500

600

700

800

900

Figure 2.4: Mean task completion times for all tasks across the Hasse andGrowing Squares methods and across levels of density. Error bars show standarddeviations.

.79) and 3.00 (s.d. .43)), and dense data densities (means 3.25 (s.d. .85) and1.92 (s.d. .67)). Both higher rating readings were significant (Friedman Tests,p = .0094 for the sparse case and p = .0015 for the dense case).

Figure 2.5 shows, not surprisingly, that the density of the data set stronglyinfluenced the subjects’ responses to each question for both visualizations. Thisdifference is reliable for all but the enjoyability question (Friedman Tests). Thesubjects’ response to this question (Q2, Table 2.4) when using the GrowingSquares visualization shows a higher rating when small data sets are considered(means 3.92 (s.d. .79) for sparse sets and 3.25 (s.d. .75) for large sets), but onthe other hand, this is not a significant difference (p > .05).

The final ranking questionnaire shows that most subjects preferred theGrowing Squares technique over Hasse diagrams with regard to ease of use,efficiency, and enjoyment (Table 2.5). Overall, the results from this ranking arevery favorable for the Growing Squares method.

2.4 Caveats of Growing Squares

The Growing Squares technique is based on animation, colors and patternsto improve the perception of causality in distributed systems, and the resultsfrom the user study show that the technique is consistently faster and moreefficient than Hasse diagrams. This difference, however, is not statisticallysignificant for the general case, although it is significant for the sparse dataset case. While there clearly is room for improvement, the Growing Squaresvisualization technique is nevertheless an improvement over conventional Hassediagrams.

Page 38: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

28 CHAPTER 2. GROWING SQUARES

0.00

1.00

2.00

3.00

4.00

5.00

sparse dense

Q1

0.00

1.00

2.00

3.00

4.00

5.00

sparse dense

Q2

0.00

1.00

2.00

3.00

4.00

5.00

sparse dense

Hasse

Squares

Q3

Figure 2.5: Responses to Q1-Q3 5-point Likert-scale questions across sparseand dense data densities for the Hasse and Growing Squares methods.

Question Hasse diagrams Growing Squaressparse dense sparse dense

Q1. Ease-of-use rating 2.67 (.89) 1.46 (.66) 3.92 (.67) 2.79 (.78)Q2. Efficiency rating 2.75 (.97) 1.58 (.67) 3.83 (.39) 3.13 (.68)Q3. Enjoyability rating 3.00 (.43) 1.92 (.67) 3.92 (.79) 3.25 (.75)

Table 2.4: Mean (standard deviation) responses to 5-point Likert-scale ques-tions. Reliability is defined as being significant at the .05 level.

However, as indicated by the user study, the Growing Squares techniquehas a number of issues. First and foremost, since the method is dependent ona simple color coding for each process in a system, it is often very difficult todistinguish individual processes in a large system due to the similarity of thecolors. This problem is exacerbated by the fact that Growing Squares presentsthe influences of a single process as colored pixels in a checkered pattern oneach square, meaning that each influence can become arbitrarily small dueto limited screen space (this problem is partially solved using a continuouszoom mechanism, however). And finally, a Growing Squares visualization doesnot explicitly communicate the absolute timing of events or process startup orshutdown; this must be manually deduced by studying the animated execution

Question Prefer GS?

PQ1 Rank visualizations w.r.t. ease-of-use. 92 %PQ2 Rank visualizations w.r.t. efficiency. 83 %PQ3 Rank visualizations w.r.t. enjoyability. 92 %

Table 2.5: Subject responses to ranking the two visualizations.

Page 39: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

2.4. CAVEATS OF GROWING SQUARES 29

of the system.

(a) (b)

(c) (d)

(e)

Figure 2.6: Growing Squares visualization of the dynamic execution of a 5-process distributed system.

Page 40: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

30 CHAPTER 2. GROWING SQUARES

Page 41: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

Chapter 3

Growing Polygons

Visualizing the causal relations in a system consisting of n processes using theGrowing Polygons [ET03a] technique is done by placing n-sided polygons (so-called process polygons) representing the individual processes on the sides of alarge n-sided polygon (the layout polygon). Instead of using a linear timeline,as in Hasse diagrams, the time parameter is mapped to the size of each processpolygon so that they grow from zero to maximum size as time proceeds from thestart to the end of the execution, just like in the Growing Squares technique.The visualization is animated to allow the user to study the dynamics of theexecution, and the discrete time steps are shown as dashed or greyed-out “agerings” in the interior of each polygon. In addition to this, each process polygon isdivided into triangular sections, with every process in the system being assigneda color and a specific sector in the polygon. This sector also corresponds to theside where the process polygon is positioned on the layout polygon. Wheneverthe process represented by a particular polygon is active, the appropriate timesegments of the associated sector in the polygon will be filled in with the processcolor. Messages between processes in the system are shown as arrows travellingfrom the source polygon to the destination, and will activate the correspondingsector in the destination polygon with the color of the source process. In otherwords, a message sent from process A to process B will contaminate A’s sectorin B starting from the time the message was received.

Figure 3.1 shows an example of a simple 3-process system (consisting ofprocesses P0, P1, and P2) where each process is represented by a triangle par-titioned into three sections, and with the process triangles positioned on thesides of a larger layout triangle. For each process triangle, the process’s ownsector has been marked with a thick black outline, and the internals of eachpolygon has also been segmented to show the discrete time steps of the execu-tion. In addition, the processes have been assigned the colors red, green, andblue, respectively. In this example, we see how P0 sends a message to P1 att = 0 that reaches the destination process at time t = 1, establishing a causalrelation between the two nodes. Notice how for all times t ≥ 1, P0’s sectorwithin P1’s process triangle is now filled, signifying this influence. By studyingthe polygons at t = tend, i.e. the end of the execution, we can get a clear pictureof the flow of information within the system.

31

Page 42: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

32 CHAPTER 3. GROWING POLYGONS

P1

P0P2

Figure 3.1: Growing Polygons visualization with n = 3 (i.e. the process poly-gons are triangles).

As we ascertained earlier, causal relations are transitive, so if A → B andB → C, then A → C. Figure 3.1 also shows how this is expressed in theGrowing Polygons visualization. At time t = 2, process P2 receives a messagefrom P1. P1 has already been influenced by P0 in the previous interaction (inother words, there is already a causal relation between P0 and P1). Thus, theprocess triangle of P2 now shows causal influences in all of its process sectors,including the transitive dependency to P0, not just the direct dependency toP1 which sent the actual message.

The simple execution in Figure 3.1 also gives information about the absolutelifecycles of the three processes. By studying the filled segments of each processtriangle’s own sector, we note that only process P0 executed from the start tothe end of the system trace; processes P1 and P2 were kickstarted by externalmessages at times t = 1 and t = 2, respectively. In fact, unlike the GrowingSquares technique, the new method allows users to deduce the exact timing ofall events in a system since the age rings in the interior of each polygon arefixed to absolute times.

Just like the Growing Squares technique, the Growing Polygons techniqueoffers a view of the transitive closure of the node dependencies and influences,facilitating analysis of global information flow in the system (and not just lo-cally, as for Hasse diagrams). The visualization is animated and can thus also

Page 43: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

3.1. DESIGN 33

avoid many of the message intersection problems of Hasse diagrams. In addi-tion to this, by assigning not only a color but also a fixed polygon sector to eachprocess, the Growing Polygons method largely remedies the difficulties of dis-tinguishing colors that plague the Growing Squares technique. Thus, the newmethod is considerably more scalable than the old one since it is now enoughthat two similar colors are not placed in adjacent sectors for a user to be ableto separate them.

Now let us study a full example to see the Growing Polygons visualizationin action. Figure 3.5 shows a sequence of screenshots taken at the discrete timesteps of the execution of a 5-process system of (in the real visualization, theseimages are smoothly animated). The processes are laid out in clockwise orderwith P0 at the top right. In (a), at t = 1, we see that all processes except P0 areexecuting and sending messages (the process sector of P0 is empty). However, amessage from P1 is just about to reach P0 and will activate it starting from thispoint in time. Screenshot (b) shows the subsequent situation at t = 2, whereP0 now has begun executing and exhibits a causal dependence to the greenprocess (P1) that started it, and where P4 similarly shows a dependence to P3

(P3’s sector in P4’s process polygon is filled in from time step 1 and onwards).Moving to t = 3 in (c), we see more causal dependencies appearing in theprocess polygons of the various nodes, the transitive dependencies in both P1

(cyan from P3) and P3 (green from P1) being of special interest. We can alsoobserve that process P2 appears to have stopped executing since it is no longerfilling up its own process sector. Image (d) displays the situation one time steplater (t = 4), where the two messages from the inactive P2 finally reach P0 andP4 respectively, and image (e) shows the final situation at t = 5, with the causaldependencies in the system plainly visible.

3.1 Design

One of the weaknesses of the original Growing Squares method that limitedits scalability was the difficulties of distinguishing between different processcolors. To remedy this problem, the Growing Polygons technique also assigns aunique triangular sector to each process. Nevertheless, for our method to workefficiently, adjacent process sectors should not have similar colors, or users caneasily mistake one process for another. Just like in the Growing Squares case,we opted for a straightforward non-continuous distribution of colors across theRGB spectrum.

While our new method does not exhibit the same congestion of screen spacethat plagues Growing Squares, where a much-influenced process square simplycannot convey all of its influences in its limited screen space, there are instanceswhere even Growing Polygons fail at this. For example, when visualizing a largesystem with many processes, the angle (θ = 360◦/n) assigned to each processsector will be small, making it difficult to distinguish events early on in theexecution. The same is also true if the time span of the execution is long, sincethe layout algorithm will then have to scale each time step to fit inside theallocated maximum size of each polygon. To cope with these two situations, the

Page 44: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

34 CHAPTER 3. GROWING POLYGONS

Growing Polygons visualization retains the simple continuous zoom mechanismof the Growing Squares technique, allowing users to zoom in arbitrarily closein order to distinguish details in the visualization.

The decision to use animation in the Growing Polygons technique wasmainly grounded on the wish to avoid a maze of cris-crossing message arrows(like in Hasse diagrams). At the end of the system execution, no message arrowsat all are visible, facilitating easy study of the inter-process dependendenciesin the system. Animation allows the user to still see the dynamic execution ofthe system in an intuitive way, just like in the Growing Squares technique.

3.2 User Study

Our intention with the Growing Polygons technique was to provide an efficientway of viewing the flow of information and the node dependencies in a systemof communicating processes. In order to check whether our method performsbetter than existing methods, we conducted a comparative user study betweenHasse diagrams and Growing Polygons. The study involved test subjects thatwere deemed representative of the target audience, and consisted of having themsolve problems using the two techniques. Timing performance and correctnesswere measured, as well as the subjective ratings of individual users.

3.2.1 Subjects

In all, 20 users, 15 of which were male, participated in this study. All userswere screened to have good computer skills and at least basic knowledge ofdistributed systems and general causal relations. Subject ages ranged from 20through 50 years old, and all had normal or corrected-to-normal vision (oneperson claimed partial color blindness but was still able to carry out the test).Ten of the subjects had participated in our earlier user study of the GrowingSquares technique.

3.2.2 Equipment

We used the same equipment that was used for the Growing Squares user studyfor this study as well (see Section 2.2.2).

3.2.3 Procedure

As before, the experiment was a two-way repeated-measures analysis of variance(ANOVA) for the independent variables “visualization type” (Hasse diagramsversus Growing Polygons) and “data density” (sparse versus dense). The sparsedata density consisted of system executions involving 5 processes and 15 mes-sages, while the dense data density involved 20 processes and 60 messages (seeTable 3.1). All subjects were given the same four task sets split into the twodensity classes. The system trace for each task set was generated using a simplerandomized heuristic algorithm to avoid subjects taking advantage of specialknowledge about the behavior of a particular distributed system. In addition,

Page 45: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

3.2. USER STUDY 35

care was taken to ensure that the complexity of both system traces for a specificdata density was roughly equivalent by removing ambiguities and ensuring thatthe number of indirect relations was the same.

The procedure consisted of solving two of the four task sets using conven-tional Hasse diagrams, and the other two using the Growing Polygons technique.Sparse task sets were solved first, followed by the respective dense sets. In orderto minimize the impact of learning effects, half of the subjects used the Hassediagrams first, while the other half used the Growing Polygons first. The tasksets themselves consisted of four tasks that were directly based on our previ-ous user study of Growing Squares (see Table 2.2 for an overview). Subjectswere given the opportunity to freely adjust window size and placement priorto starting work on each task set. Furthermore, subjects were instructed tosolve each task quickly but thoroughly, and were allowed to ask questions dur-ing the course of the procedure. Each individual task in a task set was timedseparately, except for the tasks Causality 1-3, which were timed together. Inaddition, answers were checked and the correctness ratio was recorded for eachtask.

In order to avoid run-away times on troublesome tasks, completion timeswere limited to 10 minutes (600 seconds). If a test subject chose for some reasonto skip a task, the completion time for that task was set to this cap.

After each completed task set, each subject was given a short questionnaireof three 5-point Likert-scale questions asking for their personal opinion on theusability, efficiency, and enjoyability of the visualization method they had justused (see tasks Q1 to Q3 in Table 2.2). The purpose of this questionnaire wasto measure how users’ ratings of the visualizations changed depending on thedata density. In addition, users also filled out a post-evaluation questionnaireafter having completed all of the task sets, where they were asked to rank thetwo visualizations on the above criteria (see Table 3.2).

Each evaluation session lasted approximately 45 minutes. Prior to startingthe evaluation itself, subjects were given a training phase of up to ten minuteswhere they were given instructions on how to use both visualization methodsto solve various simple tasks.

Data Density Processes Messages

Sparse 5 15Dense 20 60

Table 3.1: Experimental design. Both density and visualization factors werewithin subjects for all 20 subjects.

Page 46: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

36 CHAPTER 3. GROWING POLYGONS

Task Comments

PQ1 Rank the visualizations w.r.t. ease of use.

PQ2 Rank the visualizations w.r.t. efficiency for solving the following tasks:(a) Duration analysis(b) Influence importance (most influential)(c) Influence assessment (most influenced)(d) Inter-node causal relations

PQ3 Rank the visualizations w.r.t. enjoyability.

Table 3.2: Post-evaluation ranking questions.

3.3 Results

The analysis of the results we obtained from the afore-mentioned user studycan be divided into the timing performance, the correctness, and the subjectiveratings of the test subjects.

3.3.1 Performance

The mean times of solving a full task set (i.e. all four tasks) using Hassediagrams and the Growing Polygons visualizations were 433.90 (s.d. 378.59) and251.85 (s.d. 174.88) seconds respectively. This is also a statistically significantdifference (F (1, 19) = 20.118, p < .001). The main effect for density wassignificant (F (1, 19) = 26.932, p < .001), with means for the sparse and denseconditions of 191.80 (s.d. 87.57) and 493.95 (s.d. 359.35) seconds.

Figure 3.2 summarizes the mean task results for the two visualizations acrossthe two densities; error bars show the standard deviation above and below themean. The figure also shows that the mean time for the task set was higherfor the Hasse method across all densities. For the sparse condition, the meancompletion times were 234.40 (s.d. 87.09) and 149.20 (s.d. 65.85) seconds forthe Hasse and Growing Polygons visualizations. The Growing Polygons methodalso gave better results for dense conditions, with mean values of 616.05 (s.d.550.60) seconds for the Hasse visualization versus 354.50 (s.d. 190.41) secondsfor Growing Polygons.

The one exception where Hasse diagrams performed better than GrowingPolygons was for the Duration subtask across both densities, with sparse setmean times of 25.75 (s.d. 10.39) for Hasse diagrams versus 33.95 (s.d. 17.47)for Growing Polygons, and for the dense set, 34.40 (s.d. 18.54) versus 72.35(s.d. 36.06) seconds. This difference was also significant (F (1, 19) = 26.943,p < .001).

For the Influence 1 subtask, on the other hand, the mean completion timesfor the sparse data set using Hasse diagrams was 58.50 seconds (s.d. 22.25)

Page 47: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

3.3. RESULTS 37

0

200

400

600

800

1000

1200

HasseSparse

PolygonsSparse

HasseDense

PolygonsDense

Tim

e (s

ec)

0

200

400

600

800

1000

1200

Figure 3.2: Mean task completion times for all tasks across the Hasse and Grow-ing Polygons methods and across levels of density. Error bars show standarddeviations.

versus 36.65 seconds (s.d. 17.93) for the Growing Polygons method, while themean times for the dense set were 270.60 (s.d. 180.XX) and 169.70 (s.d. 140.72),respectively. This was a significant difference (F (1, 19) = 14.614, p = 0.001).Similarly, the Influence 2 subtask yielded mean completion times of 77.64 (s.d.53.58) versus 34.35 (s.d. 30.47) for the sparse data set, and 184.10 (s.d. 207.05)versus 50.85 (s.d. 26.61) for the dense case. Again, this was a significantdifference in favor of the Growing Polygons method (F (1, 19) = 14.170, p =0.001). Finally, the Causality 1-3 subtask resulted in sparse mean completiontimes of 72.50 (s.d. 29.28) for Hasse diagrams and 44.25 (s.d. 19.68) for GrowingPolygons, whereas the dense means were 144.30 (s.d. 116.37) and 61.60 (s.d.40.88), respectively. This was also a significant difference (F (1, 19) = 18.896,p < 0.001).

Page 48: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

38 CHAPTER 3. GROWING POLYGONS

00.5

11.5

22.5

33.5

44.5

55.5

66.5

7

HasseSparse

PolygonsSparse

HasseDense

PolygonsDense

Cor

rect

ans

wer

s (m

ean,

s.d

)

00.511.522.533.544.555.566.57

max

Figure 3.3: Mean correctness for all tasks across the Hasse and Growing Poly-gons methods and across levels of density. Error bars show standard deviations.

3.3.2 Correctness

The average number of correct answers when solving a full task set (i.e. sixtasks) using Hasse diagrams and the Growing Polygons visualization was 4.375(s.d. 1.148) versus 5.625 (s.d. 0.667) correct answers, respectively. This is asignificant difference (F (1, 19) = 46.57, p < .001). For the sparse data set, themean correctness was 4.70 (s.d. 1.218) for Hasse diagrams and 5.75 (s.d. 0.716)for Growing Polygons, versus 4.05 (s.d. 0.999) and 5.50 (s.d. 0.607) for thedense case. In fact, the mean correctness of the Growing Polygons visualizationis significantly better than for Hasse diagrams for all individual subtasks exceptfor the Duration subtask, where Hasse performs better with a correctness ratiosof 0.975 versus 0.950 for Growing Polygons. This, however, is not a significantdifference (F (1, 19) = 0.322, p = .577). See Figure 3.3 for a diagram of themean correctness values for all tasks.

3.3.3 Subjective Ratings

For the post-task questionnaire, the test subjects consistently rated GrowingPolygons above Hasse diagram in all regards, including efficiency, ease-of-useand enjoyment. The mean response values to the five-point Likert-scale ques-tions are summarized in Figure 3.4. See Table 3.3 for the complete data analysistable.

The subjects’ response to the ease-of-use question (Q1, Table 3.3) showeda higher rating for the Growing Polygons visualization than Hasse diagramsin both sparse (means 4.20 (s.d. .70) and 2.75 (s.d. .85), respectively) anddense data densities (means 3.75 (s.d. .79) and 1.90 (s.d. .91)). Both higherratings were significant (Friedman Tests, p < .001 for both the sparse and

Page 49: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

3.4. CAVEATS OF GROWING POLYGONS 39

dense cases). The subjects’ responses to the efficiency question (Q2, Table 3.3)showed a higher rating for the Growing Polygons visualization in both sparse(means 4.20 (s.d. .62) and 2.40 (s.d. .88)) and dense data densities (means 3.95(s.d. .51) and 1.55 (s.d. .51)). Both higher ratings readings were significant(Friedman Tests, p < .001 for the sparse case and p < .001 for the dense case).The subjects’ response to the enjoyment question (Q3, Table 3.3) also showeda higher rating for the Growing Polygons visualization in both sparse (means4.20 (s.d. .62) and 2.95 (s.d. .39)), and dense data densities (means 4.10 (s.d..64) and 2.00 (s.d. .73)). Both higher ratings were significant (Friedman Tests,p < .001 for the sparse case and p < .001 for the dense case).

The results from the post-task summary questionnaire can been found inTable 3.4, and clearly show that test subjects regard the Growing Polygonstechnique as superior to Hasse diagrams in all aspects except for duration anal-ysis (task PQ2 (a)). However, as can be seen from the this table, the overalluser rankings are very convincingly in favor of our method.

0.00

1.00

2.00

3.00

4.00

5.00Q1

0.00

1.00

2.00

3.00

4.00

5.00

sparse dense

Q2

0.00

1.00

2.00

3.00

4.00

5.00

Hasse

Polygons

Q3

Figure 3.4: Responses to Q1-Q3 5-point Likert-scale questions across sparseand dense data densities for the Hasse and Growing Polygons methods.

Question Hasse diagrams Growing Polygonssparse dense sparse dense

Q1. Ease-of-use rating 2.75 (.85) 1.90 (.91) 4.20 (.70) 3.75 (.79)Q2. Efficiency rating 2.40 (.88) 1.55 (.51) 4.20 (.62) 3.95 (.51)Q3. Enjoyability rating 2.95 (.39) 2.00 (.73) 4.20 (.62) 4.10 (.64)

Table 3.3: Mean (standard deviation) responses to 5-point Likert-scale ques-tions. Reliability is defined as being significant at the .05 level.

3.4 Caveats of Growing Polygons

While the Growing Polygons technique is clearly an improvement over both theclassic Hasse diagram as well as our earlier alternative causality visualizationtechnique, Growing Squares, there is certainly room for improvement. Specifi-cally, for very large systems of causal relations with large numbers of processes

Page 50: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

40 CHAPTER 3. GROWING POLYGONS

Task Comment GP Hasse Undec.

PQ1 Ease-of-use 95 % 0 % 5 %PQ2 Efficiency (avg) 80 % 11 % 9 %

(a) Duration 35 % 40 % 25 %(b) Importance 90 % 5 % 5 %(c) Assessment 95 % 0 % 5 %(d) Causality 100 % 0 % 0 %

PQ3 Enjoyability 100 % 0 % 0 %

Table 3.4: Subject responses to ranking the two visualizations.

spanning a long period of time, the Growing Polygons technique may performpoorly, simply due to the limited screen estate available to display a potentiallyhuge amount of data. In order for our technique to be able to handle theseextremes, we have to consider implementing mechanisms for space distortionand hierarchical node clustering into the technique.

Page 51: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

3.4. CAVEATS OF GROWING POLYGONS 41

(a) (b)

(c) (d)

(e)

Figure 3.5: Growing Polygons visualization of the dynamic execution of a 5-process distributed system.

Page 52: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

42 CHAPTER 3. GROWING POLYGONS

Page 53: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

Chapter 4

CiteWiz: Citation NetworkVisualization

One of the key tasks of scientific research is the management and study ofexisting work in a given field of inquiry. The specific nature of the tasks in-volved in this venture vary greatly depending on the situation and the roleof the researcher; for a new student just entering a research area, the task isthat of orientation within the existing work; for a reviewer, one of originalityand correctness checking; for a conference organizer, one of chronological sur-vey; and, finally, for an experienced scientist, one of staying abreast with newdevelopments and identifying current hot topics in his or her area of choice.Researchers spend a considerable portion of their time on these tasks, ampleevidence that it is in everyone’s best interest to streamline this process as muchas possible, and that large time savings can be made.

The highly connected and highly contextual nature of citation networks andthe large amounts of data to be displayed suggest that techniques of informationvisualization could successfully be brought to bear in this area.

In this chapter, we present CiteWiz, a tool for bibliographic visualization ofthe chronology and influences in networks of scientific articles. The primary vi-sualization in CiteWiz is an implementation of the Growing Polygons [ET03a]causality visualization technique, suitably adapted to the context of citationdata, but the architecture is sufficiently flexible to allow other visualizationsto be plugged in. The tool was designed for use by researchers, scientists, andstudents alike, and its baseline features were established through extended dis-cussions in a focus group consisting of such users. Guided by these discussions,we created a prototype implementation of the tool with a user interface thatallows for normal browsing and filtering of the citation meta-data as well asbuilding hierchical views of the dataset for visualization. We have conducted aformal user study to assess the efficiency of the tool in comparison with standardweb-based database interfaces. Our results indicate that CiteWiz is equally effi-cient as standard database interfaces for low-level analysis tasks such as findingpapers and correlating authors, and significantly more efficient than standarddatabases for higher-level analysis tasks related to overviews and influences ofbibliographical data.

43

Page 54: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

44 CHAPTER 4. CITEWIZ: CITATION NETWORK VISUALIZATION

The original Growing Polygons technique was designed for visualization ofgeneral causal relations, so we modified it slightly to be able to handle ar-ticles and citations. We chose an article-centered approach (as opposed toan author-centered one), where the articles themselves are the active entities(represented by processes), and citations are the information-bearing messagesbetween them. To allow the technique to cope with potentially huge datasets,we also improved its scalability in two different ways: we implemented multi-level process hierarchies for grouping sets of articles together, and we added afocus+context technique with variable time scale to handle long event histo-ries. The visualization was accordingly supplemented with a number of interac-tion techniques to support these new features as well as interaction techniquestargeted specifically at citation visualization; these include collapsing and ex-panding the group hierarchy, navigating in the citation network by followingbackward and forward references, and getting details-on-demand of the com-plete bibliographical data for a specific paper.

In addition to this, CiteWiz also contains a static influence visualization thatrenders a timeline of the articles or authors in the citation database, scaling theirsize and color depending on the number of citations and the citation density(see Figure 4.6). In this way, the authors (or articles) in the database form a“human pyramid” allowing users to easily see who the giants in the field are,and on whose work they rest upon.

This chapter begins by explaining how to model scientific citations as causalrelations. We then describe the CiteWiz system, an application built on topof the CausalViz framework. We describe the two visualization techniquesincluded in the application, an influence visualization and a static timelinediagram. We then present the formal comparative user study and the resultswe extracted from this. We end the paper with conclusions and our plans forfuture work.

4.1 Citations as Causal Relations

A causal ordering is a general relation that relates two events where one isthe cause of the other. We can interpret citations in scientific articles as causalorderings in at least two different ways; either with authors as the active entities(processes) and their papers as events, or with papers as the active entities and asingle event marking the paper’s publication for each entity. For both cases, werepresent citations by causal relations between the events. In this thesis, we havechosen the latter approach for the simple reason that the former causes problemwith the visualization when authors combine to work together on a paper; thus,our visualization is fundamentally article-focused instead of author-focused.

Seeing that a citation in a scientific article can be modeled by a causalrelation is quite straightforward; a citation implies that (a) the authors haveread the cited paper (and thus, indirectly, that the cited paper existed beforethe citing paper), and that (b) the citing paper has a dependency to the citedpaper. Admittedly, mutual citations cannot be represented and must be eitherremoved entirely or broken arbitrarily. In this thesis, we will use the term

Page 55: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

4.2. THE CITEWIZ PLATFORM 45

influence, which is a slightly relaxed interpretation of causality in this context:if a paper A cites a paper B, the authors of A have been influenced (in someundefined way) by paper B, and this is reflected in the paper (put shortly, Ahas been influenced by B).

4.2 The CiteWiz Platform

The CiteWiz system is a modularized bibliographic visualization platform basedon a central citation dataset and a number of dataset views that can be usedas input for the available visualization techniques. The primary visualizationtechnique in CiteWiz is an adaptation of the Growing Polygons causality visual-ization method, but the platform has been designed to be easily extensible withnew visualizations. One such visualization extension is the Newton’s Shoul-ders diagram that provides a static timeline of the authors and articles in thedatabase. Based on the taxonomy described in Section 1.3.2, we developed thetool to be primarily article-focused, meaning that we emphasize the visualiza-tion of articles and their interdependencies, but sufficient provisions exist forauthor-, group-, and event-focused user tasks as well.

An important point to note in the following description of the CiteWizplatform is that this is a system and not a visualization technique, and thatmany of the features (such as the tree view, the node-link arrows, and thenavigation window) were added for convenience and flexibility, not necessarilyto prove the purity of the platform.

4.2.1 Datasets and Views

CiteWiz has a central citation dataset that is used for all queries and visual-izations. Each entry in the set is a name/value pair, with fields for the con-ventional attributes such as title, authors, source (i.e. journal or proceedingsname), keywords, abstract, etc. Entries also have a list of references to otherentries cited in the paper. The dataset is loaded from disk using a simple XML-based file format for citation meta-data that was designed for the InfoVis 2004contest [FGP04]. This file format is basically a flat list of the bibliographicalentries in the dataset.

Users can browse, filter, sort, and search the CiteWiz citation database. Inaddition, users can also build views of the dataset for visualization; these areessentially subsets of the central dataset with the extra capability of containinghierarchical groups of bibliographical entries. This makes it possible to buildcomplex structures of nested groups according to some criteria relevant to theuser; for instance, when studying a dataset containing citation data for a specificconference over a period of time, one might create groups for each conferenceyear, and the papers could then be arranged in subgroups representing thedifferent sessions for each conference. Other groupings are possible and dependon the user’s goals. For instance, when performing author-focused tasks, itmight be useful to create groups for each author in the dataset and add theirpapers, allowing for easy study of author chronology and collaboration.

Page 56: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

46 CHAPTER 4. CITEWIZ: CITATION NETWORK VISUALIZATION

Views can be saved and loaded to disk using another straightforward XMLformat; each view file is associated with a specific dataset file, and uses theinternal identifiers to refer to bibliographical entries in the dataset.

The CiteWiz tool does not currently contain any functionality for automaticconstruction of hierarchical views; all views must be manually defined. How-ever, clustering algorithms for building views automatically could be worthwhilefuture extensions to the tool.

4.2.2 Implementation

The CiteWiz tool is implemented as a C++ application running under the Linuxoperating system, but should be easily portable to other platforms. It usesstandard OpenGL for efficient 2D rendering, and the GTK+/GTK– library forthe graphical user interface components. Most of the data management and thevisualization technique implementations are part of the CausalViz framework.

4.3 Influence Visualization

Views built by the user form the input for the visualization techniques sup-ported by CiteWiz. As mentioned above, the primary visualization techniqueis currently the Growing Polygons [ET03a] method for visualization of generalcausal relations, suitably modified to be able to handle citation networks andthe scalability issues associated with these. Please refer to Chapter 3 for adetailed description of this technique.

In our adaptation of the original technique, articles form the processes in thevisualization (thus represented by article polygons), and citations are messagesfrom a source (cited) article to a destination (citing) article. This mimics theinformation transfer implicit when authors reference another paper. Even ifarticles are more or less static once published, this article-focused approachgives us a way to easily see the influences and chronology of a set of articles,including global transitivity information for each article.

In order to make effective use of the Growing Polygons method in this con-text, we were forced to address two scalability issues in relation to (i) longexecution times, and (ii) large quantities of visualized articles. For the formerissue concerning time scalability, the problem lies in that visualizing a largecitation network may result in very long chains of causality, and the visualiza-tion will then run out of space for displaying individual time segments. For thelatter case, the quantity scalability issue comes from the fact that visualizinga sufficiently large amount of articles means that each individual article getsassigned a very small polygon sector and it will thus be difficult to distinguishbetween neighboring sectors. Both of these issues can be partially addressedthrough zooming mechanisms, but this instead results in loss of overview.

Our solution for these concerns in the modified, more scalable version of theGrowing Polygons method is two-fold: we introduce a focus+context [Fur86]technique based on adjustable linear time windows that lets the user concen-trate on certain areas of the execution while still retaining the context of the

Page 57: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

4.3. INFLUENCE VISUALIZATION 47

surrounding history. Furthermore, we address the quantity concern by mod-ifying the Growing Polygon technique to handle hierarchical views instead offlat article lists (this was our incentive for the distinction between datasets andviews in the design of CiteWiz).

4.3.1 Linear Time Windows

As stated above, our solution to the time scalability problem is a focus+contexttechnique based on a non-linear time scale and user-controlled linear time win-dows. Let T be the total number of time units in the execution we are studying.Each time window will then display k time units using a normal linear time scale(if k ≥ T , we have the standard Growing Polygons technique). A specific ratior of the maximum radius Rmax of each article polygon is reserved for the timewindow, and the remaining space is distributed among the T − k time unitsoutside of the time window. These peripheral time segments flanking the timewindow are called history panels. The user can control the parameters k andr, and can furthermore also control the index i, which is the index of the firsttime unit that is inside the time window. Normally, the user wants the windowto show the k latest time steps in the execution, but it is useful to be ableto change i to focus on different parts of the execution. Figure 4.3.1 shows anexample of an article polygon with a linear time window centered on the middleof the time execution (i = 4).

The history panels flank the time window and provide the surrounding con-text to the user, including both future and past events (the panels are accord-ingly referred to as the future and past history panels). We distribute theremaining 1 − r ratio of the maximum polygon radius simply by allocating afixed space to each time unit that is proportional to (1 − r)/(T − t) of theradius Rmax. A more intelligent space allocation scheme would assign recenttime periods (i.e. those adjacent to the current location of the time window)more screen space than older history.

A linear time window has three free parameters that can be controlled bythe user: the starting index i of the window, the number of time segmentsk shown in the window, and the ratio r of the maximum radius used by thetime window. In our implementation, these parameters are synchronized forall time windows (one for each article polygon), since independently controlledtime windows does not make sense in the context of causality where the timingof individual events cannot be decoupled and where we are mainly interestedin comparing the state of different articles.

4.3.2 Hierarchical Views

In order to allow the Growing Polygons technique to handle a large quantityof articles, we modify the visualization to be able to render hierarchical groupsof articles instead of single articles. These correspond directly to the viewsof the central dataset built by the users. The view hierarchy is visualized bytreating an article group as a normal article, except that the group will havethe cumulative influences of all of its children. We derive these influences by

Page 58: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

48 CHAPTER 4. CITEWIZ: CITATION NETWORK VISUALIZATION

k

i

index

history panelstime window

maxr R

maxR

radius

Figure 4.1: Growing Polygons visualization with linear time windows (i = 4,k = 2, r = 0.5).

a simple postorder traversal of the hierarchy, building the influence timelinesof the internal nodes from the bottom up (i.e. starting with the articles inthe leaves of the tree). The currently visible nodes (depending on how far thehierarchy has been expanded) are then rendered as normal article polygons,with the single exception that groups (i.e. non-leaves) have a drop shadow tosignify that the polygon represents more than one article.

4.3.3 Interaction Techniques

Merely visualizing the article hierarchy is not enough, users must also be able tobrowse it in order for the visualization to be useful. In our modified version ofthe Growing Polygons technique, we provide two simple interaction techniquesfor doing this: users can either click directly in the visualization to expand andcollapse article groups (using the left and right mouse buttons, respectively),or they can use a separate tree navigation window to study the structure ofthe hierarchy. The same tree window can also be used to search for the full orpartial name of a specific article, and the tree will be expanded to the level ofthe article to show the search result.

Page 59: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

4.3. INFLUENCE VISUALIZATION 49

In addition to these interaction techniques, we also provide an overview mapwindow with a color legend and clickable fields for quickly jumping to a specificarticle polygon.

4.3.4 Parent-Child Visualization

The parent-child relationships in the view hierarchy can be indicated by drawingthe chord on the layout polygon connecting the article polygons of the firstand last child of each parent. The Growing Polygons diagram to the right inFigure 4.2 gives an example of this, where the article groups b, e, and f areshown with dashed chords enclosing their children. Figure 4.3 shows an actualscreenshot of our implementation with a partially expanded hierarchy and theparent regions plainly visible (filled in with their respective colors).

c db e

a

ihgf

j lk

c

d

g

i

hj

k

l

f b

e

Figure 4.2: Simple article hierarchy (left) visualized as a expanded GrowingPolygons diagram (right). The dashed region in the hierarchy shows the levelof expansion, and the dashed lines (chords) in the GP diagram show parent-child relationships.

4.3.5 Color Assignment

Color assignment for the modified hierarchical Growing Polygons technique isslightly different than for the original technique. Even if we normally do notshow the entire set of articles at the same time, we still need to statically allocatea fixed color to each article so that these remain invariant as the user expandsand collapses the hierarchy. In addition, we need to assign colors to the interiornodes in the article hierarchy (i.e. the article groups), and this should ideallybe done in such a way that the color of a parent node has some relation to thecolor of its children.

In our implementation, we achieve this by normalizing the HSV spectrumto a range [0, 1) and assigning intervals of this range through a simple top-downrecursive traversal of the article tree. Each child gets assigned an interval ofthe allocated color range proportional to the number of articles (not countinginternal article groups) in its branch (see Figure 4.4 for an example of colorassignment on an article hierarchy consisting of n = 8 articles). This ensures

Page 60: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

50 CHAPTER 4. CITEWIZ: CITATION NETWORK VISUALIZATION

Figure 4.3: Modified Growing Polygons visualization with parent regions.

that article colors are evenly distributed across the spectrum. The article orarticle group itself chooses the center of the allocated color range as its owncolor. In this way, parents and children will at least potentially have a visualrelation.

In fact, this can be taken one step further by rendering the geometry rep-resenting the influences of an article group with a color gradient based on theinterval allocated to the group instead of using a single, flat color. This givesusers a visual cue that the polygon represents a group of articles and not a sin-gle one, and might also help in perceiving the parent-child relationship amongnodes. However, colors can be difficult to compare and group visually, andthere is no natural way to perceive color difference in a color spectrum, so wehave chosen not to perform this step in our implementation.

4.3.6 Details-On-Demand

As suggested by both Shneiderman [Shn96] and Modjeska et al. [MTFF96], bib-liographic visualization tools need to provide a mechanism to show the completebibliographical data of an article. In CiteWiz, this is handled by a detail win-dow that gives the full meta-data of the currently selected article. In additionto this, we augment the currently selected node in the visualization with bluearrows pointing from the cited nodes, and with red arrows pointing to citingnodes (see Figure 4.5).

Page 61: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

4.4. STATIC TIMELINE VISUALIZATION 51

c

a

b

d e f g h i j

k l

n

0.0 1.0

tree level

colora (100%)

b (25%) c (75%)

d e f g h (25%) i j

lk m

n

m

Figure 4.4: Color assignment (right) for a simple article hierarchy (left) with n =8 articles. The bars to the right show the assignment, filled-in bars representactual articles.

4.4 Static Timeline Visualization

Beyond the modified Growing Polygons visualization described above, CiteWizalso contains another visualization informally referred to as a Newton’s Shoul-ders diagram1. This visualization creates a static, non-interactive timeline ofeither articles or authors in the central CiteWiz citation database, displayingeach entity as an icon on the timeline according to their publication date (orthe date of their first publication, in the case of authors). The surface areaof each icon is scaled proportionally to the amount of citations the article orauthor has received (rounded up so that the icon conforms to a uniform grid).The timeline is split up into suitable time units (years or months), and eachtime segment gets assigned space on the timeline equal to the size of the largestentity in the segment. The icons representing the entities for each time segmentare then laid out using a greedy algorithm that places the entities in descendingsize within the allocated space on the timeline, always trying to minimize thedistance to the centerpoint of the diagram. An example of such a Newton’sShoulders diagram can be seen in Figure 4.6 depicting a modest-sized citationdatabase of some 1000 authors.

As can be seen in Figure 4.6, we can orient the timeline vertically and usehuman figures for the entity icons, giving the impression of people standingon the shoulders of others. This is exactly the metaphor we had in mind whendesigning the visualization, and matches the intuition of the work of a researcherresting on the work of those who came before him. The diagram now tells usthe relative chronology of researchers in a specific field, and instantly showsthe most influential authors and their relationships (for instance, that GeorgeRobertson, Ben Shneiderman, Jock Mackinlay, and Stuart Card seems to bethe “giants” of information visualization). Figure 4.7 shows a similar diagramfor the articles in the same citation database, and we can note that the “ConeTrees” paper by Robertson et. al seems to be the most cited paper in the

1So named after Sir Isaac Newton’s famous quote in a letter to Robert Hooke in 1676, “IfI have seen further, it is by standing on the shoulders of giants.”

Page 62: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

52 CHAPTER 4. CITEWIZ: CITATION NETWORK VISUALIZATION

Figure 4.5: Citation links displayed for a particular entity.

database, closely followed by Furnas’ work on generalized fisheye views.These diagrams can be modified to show additional dimensions by applying

color to the entity icons. The choice of metric to display this way can be cho-sen arbitrarily; one useful metric for authors could be citation density, whichwe define as the total number of citations for an author divided by the totalnumber of publications written by the author (i.e. a kind of “average paperquality” metric). Another, slightly more complex, metric would involve weigh-ing citations for an author or article by their age so that recently cited articlesor authors get a stronger and more visible color than older ones, signifying thatthis article or author is involved in a “hot topic”.

4.5 User Study

The purpose of the CiteWiz citation visualizer was to provide researchers withadditional tools for analyzing citation data beyond the standard low-level fea-tures available in most traditional database interfaces. In addition to providingbasic support for sorting, searching, and filtering article data, CiteWiz also sup-ports higher-level analysis tasks for exploring the structure and dependencies ofcitation networks using the visualizations in the tool. However, since standarddatabase interfaces lack these high-level tasks completely, a comparative userstudy is unbalanced from the start. We chose the IEEE Xplore database as agood baseline database web interface for comparison. Our hypothesis was thatCiteWiz would perform as well as IEEE Xplore for finding papers and correlat-

Page 63: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

4.5. USER STUDY 53

Leslie Lamport

James D. Foley

Marc H. Brown

William S. Cleveland

G. W. Furnas

Edward R. Tufte Jock Mackinlay

Richard A. Becker

Peter EadesRoberto Tamassia David HarelMarylyn E. McGill William C. Cleveland

Marc Levoy Teuvo KohonenSteven Feiner Frank,G. Halasz

Andries van DamS. K. Card

J. D. MackinlayG. Robertson

Stuart K. Card

George G. Robertson Jock D. Mackinlay

Steven F. Roth

Joe MattisS. K. Feiner

Edward Tufte

Clifford Beshers

John F. Hughes

Steven K. Feiner Andreas Buja

Werner Stuetzle

John Alan McDonald

Alfred Inselberg

Bernard Dimsdale

Ben Shneiderman

Manojit SarkarBrian Johnson

Robert R. KorfhageStephen M. Casner

Christopher Ahlberg

James D. Hollan Stephen G. EickChristopher Williamson

Steven P. ReissMatthew Chalmers

Paul Chitson

Peter Pirolli

Maureen C. Stone Eric A. Bier

Jack D. Mackinlay Oren J. TverskyScott S. Snibbe Ken PerlinDavid Fox William Buxton

Tony D. DeRose

Ken Pier

Robert Spence Edward A. FoxRobert K. France Kellogg S. BoothJohn T. Stasko Ben SchneidermanBay−Wei ChangDavid Ungar

John Stasko

Ramana Rao

Benjamin B. Bederson John Lamping

George W. Furnas Ken FishkinJade Goldstein Y. K. LeungM. D. Apperley

Daniel A. Keim Matthew O. WardJohn Kolojejchick Lisa TweedieMatthias Hemmje Clemens KunkelAlexander Willett

Pak Chung Wong Peter R. Keller

R. Daniel BergeronMary M. Keller

Hans−Peter Kriegel

Bob Spence

David Williams

Ravinder Bhogal

A. Schur J. A. WiseD. Lantrip V. CrowK. Pennock J. J. ThomasM. Pottier Tamara MunznerS. F. Roth Marti A. HearstCatherine PlaisantStuart Card M. Sheelagh T. CarpendaleDavid J. Cowperthwaite F. David FracchiaAllison Woodruff

R. KazmanJ. Carriere Mei C. ChuahSougata Mukherjea Paul BurchardNahum Gershon Scott HudsonMichael Stonebraker John RiedlEd Huai−hsin Chi Allan R. WilksChristopher G. Healey Alexander AikenPhillip Barry Deborah HixJolly Chen Lenwood S. HeathJohn Dill Lyn BartramM. C. Chuah James T. EnnsErik Wistrand

James Pitkow William York

P. J. StroffolinoP. Lucas C. C. GombergJ. A. Senn Qing−Wen FengM. B. Burks A. J. KolojechickC. Dunmire T. MunznerWilliam E. LorensenChristian Beilken Anne RoseMichael Spenke Kenneth M. MartinSeth Widoff Thomas BerlageBrett Milash

K. WengerD. DonjerkovicR. Ramakrishnan K. BeyerM. Livny S. LawandeG. Chen Barry G. BeckerJ. Myllymaki

Ioannis G. TollisEd H. Chi Ivan HermanGiuseppe Di Battista Rich Gossweiler

Paul Whitney Jim Thomas

Elke A. RundensteinerGraham J. WillsYing−Huey Fua Jarke J. Van WijkHuub van de Wetering Martin Wattenberg

M. Scott MarshallGuy Melançon

19741974

19751975

19761976

19781978

19791979

19801980

19811981

19821982

19831983

19841984

19851985

19861986

19871987

19881988

19891989

19901990

19911991

19921992

19931993

19941994

19951995

19961996

19971997

19981998

19991999

20002000

20012001

20022002

20032003

Figure 4.6: Newton’s Shoulders diagram of the authors in the IV04 contestcitation database.

ing bibliographical data, and that CiteWiz would perform significantly betterfor higher-level tasks.

4.5.1 Subjects

In all, 10 test subjects, 9 of which were male, participated in this study. Allsubjects were active researchers at our department, but were carefully screenedto have no previous knowledge of the IEEE InfoVis citation database. Further-more, all subjects had considerable previous experience in the use of citationdatabase interfaces. Ages ranged from 25 to 40. All subjects had normal orcorrected-to-normal vision.

Page 64: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

54 CHAPTER 4. CITEWIZ: CITATION NETWORK VISUALIZATION

4.5.2 Equipment

The study was run on an Intel Pentium III 1 GHz desktop computer with 512MB of memory and a 19 inch color display. The machine was equipped with aNVidia Geforce 3 graphics card and ran Redhat Linux 9.

4.5.3 Procedure

We designed the test to be a between-subjects comparative study of a tra-ditional database interface versus our CiteWiz citation visualizer. We selectedthree different tasks related to citation database interaction from our taxonomypresented earlier in this paper; see Table 4.1 for an overview of the experiment(the labels refer to Table 2.2). We designed task T1 and T8 to consist oflow-level analysis tasks such as searching, filtering, and correlating basic bibli-ographical data; task T3, on the other hand, required a higher-level analysis ofinfluence and structure of the citation network.

Completion time was capped at 15 minutes to avoid runaway tasks; sub-jects were given the option to abandon a troublesome task, in which case thecompletion time was set to the cap. Subjects were allowed a 5-minute trainingperiod prior to performing the test for both tools, and were asked to fill out aquestionnaire after having completed it (Q1 to Q3 in Table 4.1).

We selected the IEEE Xplore web-based database interface as a suitablerepresentative of traditional database interfaces. IEEE Xplore is widely usedamong scientists all over the world to access the bibliographical data and full-texts of IEEE publications and supports all standard search and filtering fea-tures. Accordingly, we selected all of the papers of the IEEE InfoVis conferencesfrom 1995 to 2002 as our test database (175 articles). Albeit a small dataset,this was a necessary delimitation for us to be able to use the same database forboth tools. In order to remove all distractions, we were able to design our ownsearch interface to the IEEE Xplore database (essentially a cleaner version ofthe standard IEEE Xplore Basic Search), allowing us to constrain searches tothe InfoVis conference and provide a browseable list of the InfoVis proceedingssorted by year. The CiteWiz XML-based database, on the other hand, wasadapted from a subset of the InfoVis 2004 contest database [FGP04].

4.6 Results

The main findings of the user study were the expected ones: that (i) there isno significant difference in efficiency for CiteWiz and IEEE Xplore for simpletasks involving finding papers and collerating basic citation data, and that (ii)CiteWiz is significantly more efficient for a higher-level task involving the studyof dependencies and influences of a set of articles. The following sections givethe details on timing, correctness, and the subjective ratings of the users.

4.6.1 Performance

The mean times of solving a full task set (i.e. all three tasks) using IEEEXplore and CiteWiz were 1202.20 (s.d. 158.35) and 485.20 (s.d. 72.82) seconds,

Page 65: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

4.6. RESULTS 55

Task Description Measure

T1 Find a particular paper. Time

T3Find the most influential pa-per.

Time

T8 Study author collaboration. Time

Q1Rate the ease-of-use of thetool (1=very hard, 5=veryeasy).

Likert

Q2Rate the efficiency of the toolfor the different tasks (1=veryinefficient, 5=very efficient).

(a) Find a paper. Likert

(b)Find the most influential pa-per.

Likert

(c) Study author collaboration. Likert

Q3Rate the enjoyability of thetool (1=very unpleasant,5=very pleasant).

Likert

Table 4.1: User study tasks.

respectively. This was a statistically significant difference (t(8) = −9.19, p <0.000). For task T1, the mean completion times were 195.60 (s.d. 61.49) secondsfor IEEE Xplore versus 200.40 (s.d. 23.89) seconds for CiteWiz. The marginallybetter performance by IEEE Xplore was not statistically significant (t(8) =0.164, p = 0.875). No user managed to solve task T3 within the 900 second timecap using IEEE Xplore (two subjects completed the task, three abandoned thetask); accordingly, the CiteWiz completion time of 154.60 (s.d. 38.12) secondswas also statistically significant (t(8) = −43.729, p < 0.000). Finally, for taskT8, the mean completion times were 106.60 (s.d. 100.14) versus 130.20 (s.d.24.98) for IEEE Xplore and CiteWiz, respectively. Again, the marginally betterperformance by IEEE Xplore was not statistically significant (t(8) = 0.511, p =0.623).

4.6.2 Correctness

No subjects using the IEEE Xplore managed to correctly solve task T3 (evenwhen exceeding the time cap), while all subjects using CiteWiz correctly solvedT3. All subjects registered correct answers on all other tasks.

Page 66: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

56 CHAPTER 4. CITEWIZ: CITATION NETWORK VISUALIZATION

Question Xplore CiteWiz

Q1. Ease-of-use 3.40 (1.67) 4.20 (1.10)Q2. Efficiency

(a) Find paper 4.40 (.55) 4.20 (.84)(b) Most influential 1.00 (.00) 4.60 (.55)(c) Collaboration 2.40 (.89) 3.20 (1.10)

Q3. Enjoyability 2.60 (.89) 4.20 (.84)

Table 4.2: Mean (standard deviation) responses to 5-point Likert-scale ques-tions.

4.6.3 Subjective Ratings

The ratings from the post-test questionnaire overall show encouraging results;see Table 4.2 for an overview. Note especially the responses to question Q2band Q2c, which show subjective ratings strongly in favor of CiteWiz over IEEEXplore. In addition, users consistently perceived the CiteWiz tool as moreenjoyable to use than IEEE Xplore.

4.7 Discussion

Our expectations of the results of the user study was that CiteWiz and theIEEE Xplore tool would perform equally well at low-level tasks related to basicsearching, sorting and correlation of bibliographical data, and that higher-leveltasks involving assessing influences and structure of the citation network wouldyield a significantly higher efficiency for CiteWiz. These expectations were ful-filled. Of course, it is certainly possible to improve standard database interfaceswith better support for these higher-level analysis tasks, but this is not alwayspractical; for instance, the IEEE publications database does not contain refer-ence information (the ACM Digital Library does, however). Nevertheless, thepurpose of this work was mainly to target the deficiencies of existing standardtools, and in this regard we succeeded.

The visual browsing and exploration features that CiteWiz provide are veryhard to measure qualitatively in comparison to standard database interfaces,but the test subjects expressed great enthusiasm when exposed to this visual-ization and some were very eager to use the tool with citation data relevantto their area of research. In fact, CiteWiz itself was used for researching pre-vious work for this paper, and revealed a few interesting articles we had notconsidered.

As a final note, the task set in the user study is limited, but this was adeliberate design decision due to the small feature set that the IEEE Xploredatabase provides. Choosing a more complex task set would give an unfairadvantage to our tool, and would also punish the test subjects who used IEEEXplore. We believe that the current task set comfortably shows that there is

Page 67: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

4.8. CONCLUSIONS 57

room for improvement, and that the techniques presented in this paper areviable alternatives.

4.8 Conclusions

We have described CiteWiz, an extensible platform for bibliographic visual-ization. The platform includes a modified version of the Growing Polygonsmethod for visualizing causal relations. At the onset of the project, we con-ducted a formative evaluation using a focus group of active researchers, allowingus to formulate a taxonomy of the usage of citation databases. Guided by thistaxonomy, we designed CiteWiz to emphasize the visualization of articles andtheir independencies. The modifications to the Growing Polygons techniquewere aimed primarily at adapting the method to citation networks, and in-cluded provisions for rendering hierarchies of articles rather than flat lists, anda focus+context technique with user-controlled time windows to more easilysupport long citation chains. Furthermore, we introduce interaction techniquesto the tool to allow for expanding and collapsing the hierarchies, navigatingforward and backwards references in the network, and for retrieving details-on-demand. In addition, the tool also contains another visualization techniquecalled a Newton’s Shoulders diagram that constructs static timelines of articlesor authors showing the causality and citations in a citation database. Finally,we also presented three usage scenarios for CiteWiz to highlight the wide rangeof uses possible for the tool.

4.9 Future Work

There exists a wide variety of possible extensions to the CiteWiz tool, not leastthe design of new bibliographic visualization techniques to provide alternateviews of the dataset. An interesting such technique would be an overview vi-sualization of an entire dataset, allowing users to see general trends and majorfeatures rather than individual articles. Furthermore, document clustering al-gorithms for automatic construction of hierarchical views would be a usefuladdition to the tool. We are also investigating various ways to build a web-based interface for CiteWiz and make it accessible on the Internet.

Page 68: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

58 CHAPTER 4. CITEWIZ: CITATION NETWORK VISUALIZATION

The elements of

Generalized fis

The visual disp

Automating the

Brushing scatte

Dynamic Graphic

Seeing the foreReflections on On visual forma

The cognitive c

Envisioning infComputer graphi Worlds within wParallel coordi

Rapid controllePainting multip

Cone Trees anim

The perspective Tree−Maps a spa

To see, or not Task−analytic a

Dynamic queries Tree visualizatThe dynamic Hom

Bead exploratio

Information vis Stretching the Pad an alternatGraphical Fishe Toolglass and m

Visualizing DatAnimation from

Visual informatPad++ a zooming The table lens A review and ta The movable filInteractive gra

LyberWorld−a viVisual Cues Pra Using aggregatiLaying out and The attribute e Dynamic QueriesData visualizat

A focus+context

Visualizing the Space−scale dia

Research report

Visualizing the

3−dimensional p

Visualizing com

TileBars visual

Visualizing Net

IVEE an environ

The WebBook and

Visage: a user FOCUS the interLifeLines visua A linear iterat

Self−organizingDEVise integrat

Visualizing theGraph Drawing A

Information vis

Using vision toThe Document LeNavigating larg

Graph Visualiza

19741974

19751975

19761976

19781978

19791979

19801980

19811981

19821982

19831983

19841984

19851985

19861986

19871987

19881988

19891989

19901990

19911991

19921992

19931993

19941994

19951995

19961996

19971997

19981998

19991999

20002000

20012001

20022002

20032003

20042004

Figure 4.7: Newton’s Shoulders diagram of the articles in the IV04 contestcitation database.

Page 69: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

Chapter 5

Conclusions

Causality is central to human thinking, but the sheer complexity of most se-quences of events appearing both in nature as well as in the world of computingmakes overview and insight extremely difficult. The goal of this thesis has beento bring the methods of information visualization to bear on this problem inorder to provide means with which users can study these complex causal rela-tionships and draw conclusions from the data. In particular, we have done so byintroducing two different visualization techniques for graphically representingcausality on a computer. Furthermore, we have conducted a case study of vi-sualizing citation networks, an instance of the causality visualization problem,and presented an adaptation of one of our techniques for solving it.

The Growing Squares method, on the one hand, was our first attempt atvisualizing causality, and originated from our studies in the visualization ofdistributed systems. This technique represents processes as color-coded squaresthat grow in size as time progresses. Messages between processes carry acrossthe source color to the destination, thus showing the casual influences of eachprocess. Unfortunately, the use of a simple color-coding scheme and patternhampered the scalability of the technique for large systems.

The Growing Polygons technique, on the other hand, was designed to solvethese deficiencies and provide a more scalable alternative to Growing Squares.Here, we use n-sided polygons partitioned into triangular sectors to representprocesses, analogously allowing them to grow from zero to full size over time.Each sector is assigned to a specific process and given a unique color, and isfilled in for each process polygon that receives an influence from the process itrepresents.

Nevertheless, the results obtained from our user studies quite comfortablyshow that both the Growing Squares and the Growing Polygons methods aresuperior to Hasse diagrams in terms of performance, correctness, and the sub-jective opinion of the test subjects across all data densities (although GrowingSquares are only significantly more efficient to use for the sparse density). Thetest subjects consistently ranked both techniques before Hasse diagrams in allaspects except for measuring process duration. Our findings show that usersare significantly more efficient and correct when using Growing Polygons to an-alyze the influences and check inter-process causal relations in a system (both

59

Page 70: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

60 CHAPTER 5. CONCLUSIONS

sparse and dense).Our intention with the design of the Growing Squares and Growing Polygons

techniques was to provide better alternatives to causality visualization than ex-isting techniques. We used Hasse diagrams as the basis for our comparativeuser study on the basis that it is still the standard way of visualizing causalrelations. However, the question is naturally where the Growing Polygons andGrowing Squares techniques stand in relation to each other. While we have notperformed a direct comparison between the two techniques, the Growing Poly-gons method is likely superior to the Growing Squares method. First of all, theGrowing Polygons method achieves statistically significant improvement overHasse diagrams in all subtasks (except the duration analysis subtask, whichthe Growing Squares method also failed at) and across all densities, somethingwhich the Growing Squares method does not manage for dense data sets. Sec-ond, the comments from the test subjects who also participated in the previoususer study clearly indicate that the new method is significantly superior to theolder one. Unfortunately, the nature of the work we conducted means that wecannot compare the two techniques directly.

In our user studies, all test subjects were well-familiar with Hasse diagramsprior to carrying out the experiments whereas they knew nothing of the new vi-sualizations in beforehand, yet performed consistently better using the newtechniques in almost all cases. This, we think, suggests that the GrowingSquares and Growing Polygons methods are intuitive and easily accessible, andthat the methods with practice might become even more efficient to use. Thesubjective ratings also support this belief.

In order to explore the applicability of our techniques for real problems,we also performed a case study of using the Growing Polygons for the visual-ization of scientific citation networks. This was perhaps the capstone of thisresearch, and lead us to construct a real system—CiteWiz—to visualize a cita-tion database with real bibliographic data from the information visualizationcommunity. We improved the scalability properties of the Growing Polygons vi-sualization by introducing hierarchical views to be able to handle large amountsof nodes, as well as linear time windows for managing long periods of time. Fur-thermore, we invented a new static timeline visualization to complement theGrowing Polygons visualization, allowing users to study the chronology of allauthors and articles in a citation database. While we have yet to perform for-mal user studies on the visualizations in the CiteWiz tool, we are confidentthe implementations and ideas are sound and we will strive to commence usertesting at the earliest opportunity.

In conclusion, the positive feedback that we have received from our testsubjects suggests that these kinds of alternate visualization methods of causalrelations are indeed useful and worthwhile avenues for future research. Bycombining them with traditional methods such as Hasse diagrams, users are ableto use the strengths of different methods to solve different problems. In addition,the ability to view systems of causal relations from different perspectives willgreatly aid in understanding the mechanics of such a system.

Page 71: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

Chapter 6

Future Work

Causality visualization is an interesting subfield to information visualizationand has great potential for many different application areas, especially withinsoftware visualization (our original point of interest). The work presented inthis thesis merely scratches the surface of this area, and much work remainsto be done. More specifically, we are interested in investigating additionalways to improve the scalability properties of our visualizations, primarily whenit comes to managing high numbers of processes. We are convinced that thecurrent parent-child hierarchy visualization can be made more effective, perhapsby making use of animation to show group expansion and collapse. In addition,the improved Growing Polygons technique makes poor use of screen space,and we suspect that clever distortion of the spatial substrate, perhaps throughfisheye [Fur86] techniques, would remedy this problem.

Further improvements to the techniques itself aside, we are also interestedin exploring additional application areas for them. Our case study of citationnetwork visualization gave rise to a number of significant improvements to thetechniques, and we believe that additional case studies would spawn similarmodifications as well as improve the generality of the techniques. In particular,our development of the CiteWiz system poses a very interesting application ofour techniques to an area that lies close to the heart of any scientist, and wewould be very interested in continuing the work on this tool. The system canbe easily extended with new visualizations of the citation database, and we arealso considering different ways of making the tool accessible as an online serviceon the Internet.

On a more general note, visualizing general causality, or indeed somethingas concrete as the execution of a program (especially a distributed one), is trulya core information visualization problem in that the data is abstract and lacksa natural visual mapping. Thus, the choice of visual representation is entirelyup to the designer. In this thesis, we have presented two visualization methodsfor this problem and motivated our choices with both basic theories taken frominformation visualization as well as empirical experiments, but there naturallyexists a nearly unlimited number of alternative representations, many of whichare bound to be superior in at least a few regards. One important researchdirection for the future is thus the continued development of new causality

61

Page 72: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

62 CHAPTER 6. FUTURE WORK

visualization techniques, perhaps building on the experiences gained from theGrowing Squares and Growing Polygons techniques. No single visualization willever be able to capture all the information a user might want to see, but bycombining several different visualizations, we will be able to satisfy user needsmuch better.

This fact also hints at a deeper, more profound fact: that information visual-ization is still a new research field in that it lacks a unified theoretical frameworkfor modelling visualization techniques and analyzing their properties. At thecurrent state of the art, we are forced to resort to empirical testing and statis-tical analysis to compare the merits of two different visualization techniques.This is also true for the more general field of human-computer interaction,where computer scientists and psychologists alike are working hard to designa coherent and correct model for the interaction between man and machine.Regardless, it is a fact that attention must be directed at the fundamentals ofthe field, and not only on the invention of new visualization techniques.

Page 73: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

Bibliography

[AriBC] Aristotle. Physics: Book II. 350 B.C. Translated by Richard Hooker(1993).

[BB93] Thomas Bemmerl and Peter Braum. Visualization of message pass-ing parallel programs with the TOPSYS parallel programming envi-ronment. Journal of Parallel and Distributed Computing, 18(2):118–128, June 1993.

[Ber81] Jacques Bertin. Graphics and Graphic Information Processing. DeGruyter, Berlin, 1981.

[BHP+96] Benjamin B. Bederson, James D. Hollan, Ken Perlin, JonathanMeyer, D. Bacon, and George W. Furnas. Pad++: A zoomablegraphical sketchpad for exploring alternate interface physics. Jour-nal of Visual Languages and Computing, 7:3–31, 1996.

[BMG00] Benjamin B. Bederson, Jon Meyer, and Lance Good. Jazz: Anextensible zoomable user interface graphics toolkit in Java. In Pro-ceedings of the ACM Symposium on User Interface Software andTechnology (UIST 2000), pages 171–180, 2000.

[BW02] Ulrik Brandes and Thomas Willhal. Visualization of bibliographicnetworks with a reshaped landscape metaphor. In Proceedings ofthe Symposium on Data Visualisation 2002, pages 159–164. Euro-graphics Association, 2002.

[CC92] Matthew Chalmers and Paul Chitson. Bead: Explorations in infor-mation visualization. In Proceedings of the Fifteenth Annual Inter-national ACM SIGIR Conference on Research and Development inInformation Retrieval, pages 330–337, 1992.

[Che99] Chaomei Chen. Visualising semantic spaces and author co-citationnetworks in digital libraries. Information Processing and Manage-ment, 35(3):401–420, 1999.

[CLRS01] Thomas H. Cormen, Charles E. Lesierson, Ronald L. Rivest, andClifford Stein. Introduction to Algorithms. MIT Press, second edi-tion, 2001.

63

Page 74: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

64 BIBLIOGRAPHY

[CM84] William S. Cleveland and Robert McGill. Graphical perception:Theory, experimentation and application to the development ofgraphical methods. Journal of the American Statistical Associa-tion, 79(387):531–554, September 1984.

[CM97] Stuart K. Card and Jock Mackinlay. The structure of the informa-tion visualization design space. In Proceedings of the IEEE Sympo-sium on Information Visualization 1997, pages 92–99, 1997.

[CM03] Chaomei Chen and Steven Morris. Visualizing evolving networks:Minimum spanning trees versus pathfinder networks. In Proceedingsof the IEEE Symposium on Information Visualization 2003, pages67–74, October 2003.

[CMS99] Stuart K. Card, Jock D. Mackinlay, and Ben Shneiderman, edi-tors. Readings in information visualization: Using vision to think.Morgan Kaufmann Publishers, San Francisco, 1999.

[Den97] Peter J. Denning. The ACM digital library goes live. Communica-tions of the ACM, 40(7):28–29, July 1997.

[ET03a] Niklas Elmqvist and Philippas Tsigas. Causality visualization usinganimated growing polygons. In Proceedings of the IEEE Symposiumon Information Visualization 2003, pages 189–196, October 19–212003.

[ET03b] Niklas Elmqvist and Philippas Tsigas. Growing squares: Animatedvisualization of causal relations. In Proceedings of the ACM Sympo-sium on Software Visualization 2003, pages 17–26, June 11–July 132003.

[FGP04] Jean-Daniel Fekete, Georges Grinstein, and CatherinePlaisant. InfoVis 2004 Contest: The History of InfoVis, 2004.http://www.cs.umd.edu/hcil/iv04contest/.

[Fur86] George W. Furnas. Generalized fisheye views. In Proceedings of theACM CHI’86 Conference on Human Factors in Computer Systems,pages 16–23, 1986.

[GBL98] C. Lee Giles, Kurt Bollacker, and Steve Lawrence. CiteSeer: Anautomatic citation indexing system. In Digital Libraries 98 - TheThird ACM Conference on Digital Libraries, pages 89–98, June1998.

[HE91] Michael T. Heath and Jennifer A. Etheridge. Visualizing the perfor-mance of parallel programs. IEEE Software, 8(5):29–39, September1991.

[Hea90] Michael T. Heath. Visual animation of parallel algorithms for ma-trix computations. In Proceedings of the Fifth Distributed MemoryComputing Conference, pages 1213–1222, April 1990.

Page 75: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

BIBLIOGRAPHY 65

[HKW94] Matthias Hemmje, Clemens Kunkel, and Alexander Willett. Lyber-world – A visualization user interface supporting fulltext retrieval.In Proceedings of the Seventeenth Annual International ACM SIGIRConference on Research and Development in Information Retrieval,pages 249–259, 1994.

[Kes63] Michael M. Kessler. Bibliographic coupling between scientific pa-pers. American Documentation, 14(1):10–25, January 1963.

[KPT99] Boris Koldehofe, Marina Papatriantafilou, and Philippas Tsigas.Distributed algorithms visualisation for educational purposes. InProceedings of the 4th Annual SIGCSE/SIGCUE Conference onInnovation and Technology in Computer Science Education, pages103–106, June 27– July 1 1999.

[KS98] Eileen Kraemer and John T. Stasko. Creating an accurate portrayalof concurrent executions. IEEE Concurrency, 6(1):36–46, January/March 1998.

[Lam78] Leslie Lamport. Time, clocks and the ordering of events in dis-tributed systems. Communications of the ACM, 21(7):558–564,1978.

[LH92] Haim Levkowitz and Gabor T. Herman. Color scales for image data.IEEE Computer Graphics and Applications, 12(1):72–80, January1992.

[LHMR92] Haim Levkowitz, Richard A. Holub, Gary W. Meyer, and Philip K.Robertson. Color versus black and white in visualization. IEEEComputer Graphics and Applications, 12(4):20–22, July 1992.

[Mac86] Jock Mackinlay. Automating the design of graphical presentationsof relational information. ACM Transactions on Graphics, 5(2):110–141, 1986.

[MPT98] Yoram Moses, Zvi Polunsky, and Ayellet Tal. Algorithm visual-ization for distributed environments. In Proceedings of the IEEESymposium on Information Visualization 1998, pages 71–78. IEEE,1998.

[MRC95] Jock D. Mackinlay, Ramana Rao, and Stuart K. Card. An organicuser interface for searching citation links. In Proceedings of ACMCHI’95 Conference on Human Factors in Computing Systems, vol-ume 1 of Papers: Information Access, pages 67–73, 1995.

[MTFF96] David Modjeska, Vassilios Tzerpos, Petros Faloutsos, and MichalisFaloutsos. BIVTECI: A bibliographic visualization tool. In Pro-ceedings of the 1996 Conference of the Centre of Advanced Studieson Collaborative Research, page 28, 1996.

Page 76: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

66 BIBLIOGRAPHY

[Nor88] Donald A. Norman. The Psychology of Everyday Things. BasicBooks, New York, 1988. TS 171.4.N67.

[Nor93] Donald A. Norman. Cognition in the head and in the world. Cog-nitive Science, 17(1):1–6, January-March 1993.

[PF93] Ken Perlin and David Fox. Pad: An alternative approach to thecomputer interface. In Proceedings of Computer Graphics (SIG-GRAPH 93), volume 27, pages 57–64, August 1993.

[Pla86] William Playfair. The Commercial and Political Atlas. London,1786.

[RCM93] George G. Robertson, Stuart K. Card, and Jock D. Mackinlay. In-formation visualization using 3D interactive animation. Communi-cations of the ACM, 36(4):56–71, April 1993.

[SBN89] David Socha, Mary L. Bailey, and David Notkin. Voyeur: Graph-ical views of parallel programs. In Proceedings of the ACM SIG-PLAN/SIGOPS Workshop on Parallel and Distributed Debugging,ACM SIGPLAN Notices 24, pages 206–215, Madison, Wisconsin,January 1989.

[Shn96] Ben Shneiderman. The eyes have it: A task by data type taxonomyfor information visualizations. In Proceedings of the IEEE Sympo-sium on Visual Languages, pages 336–343, September 3–6 1996.

[SK93] John T. Stasko and Eileen Kraemer. A methodology for buildingapplication-specific visualizations of parallel programs. Journal ofParallel and Distributed Computing, 18(2):258–264, June 1993.

[Sma73] Henry G. Small. Co-citation in the scientific literature: A newmeasure of the relationship between two documents. Journal ofthe American Society for Information Science, 24(4):265–269, July-August 1973.

[Spe01] Robert Spence. Information Visualization. Addison-Wesley, Har-low, England, 2001.

[SR96] Mike Scaife and Yvonne Rogers. External cognition: How dographical representations work? International Journal of Human-Computer Studies, 45(2):185–213, 1996.

[TSS98] Brad Topol, John T. Stasko, and Vaidy Sunderam. PVaniM: a toolfor visualization in network computing environments. Concurrency:Practice and Experience, 10(14):1197–1222, December 1998.

[Tuf83] Edward R. Tufte. The Visual Display of Quantitative Information.Graphics Press, Cheshire, CT, 1983.

Page 77: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

BIBLIOGRAPHY 67

[WNB99] Colin Ware, Eric Neufeld, and Lyn Bartram. Visualizing causalrelations. In Proceedings of the IEEE Symposium on InformationVisualization 1999 (Late Breaking Hot Topics), pages 39–42, Octo-ber 1999.

[WS91] Gunther Wyszecki and W. S. Stiles. Color Science: Concepts andMethods, Quantitative Data and Formulae. John Wiley & Sons,second edition, 1991.

Page 78: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

68 BIBLIOGRAPHY

Page 79: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

Appendix A

Growing Squares User Study

A.1 Script

0. Preparations

a) 4 blank notepad pages

b) 4 sets of task scenario cards

1. Introduction (5 minutes)

a) Experimenter welcomes subject

b) Experimenter gives brief explanation of study

i) Purpose: comparative evaluation of two different ways of visu-alizing causal relations

ii) Old way versus a new information visualization techniqueiii) Each subject will use the two technique for two different infor-

mation densities: 2x2 tasksiv) Hardware: Pentium III laptop with 3D acceleratorv) Software: information visualization application (CausalViz) run-

ning on the Linux operating systemvi) Study data will be reported anonymously in a Ph.D. thesis and

possibly in an academic paper

2. Computer Training (5 minutes)

a) Experimenter demonstrates CausalViz’s UI

i) Main window used for controlling the visualizationii) Animation controls and timeline for selecting positions in timeiii) Visualization windows for different views of the same dataiv) Zoomable interface: right-click and move to zoom, left-click and

move to pan

b) Experimenter explains the information visualization technique rele-vant to the subject

69

Page 80: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

70 APPENDIX A. GROWING SQUARES USER STUDY

i) Hasse diagrams is a traditional technique (animated). Processesas swimlanes, messages as arrows between them.

ii) 2D squares is a new visualization technique. Processes as grow-ing squares, the color signifying influences from other processesat different points in time.

iii) (If applicable) 3D pyramids is a similar technique where thevisualization is three-dimensional. Processes are growing 3Dpyramids here, color signifies influences from other processes atdifferent points intime.

iv) Causal relation: process A has received message from process Bat time t0 ⇒ A causally related to B for all times t > t0.

c) Experimenter demonstrates how to solve common problems usingboth visualizations (or all three).

i) Duration comparison: find the processes that have the longestlife time.

ii) Causal relation: find the processes that have influenced a giventarget process.

d) Subject is allowed up to 5 minutes of practice using a test scenario

i) Local file: practice.xmlii) Subject decides when ready to proceed

3. Task Explanation (5 minutes)

a) Subject will use both the old and new visualization to allow forcomparison

b) Subject will work with two different data sets of increasing densityfor each visualization

c) All in all, the subject will work with four different tasks for fourdifferent data sets

d) Subject is given a written booklet of subtasks for each data set

i) Each subtask is a common activity used in realityii) Use application visualization to find out solutions to taskiii) It is okay to skip or guess the solution of a subtask

e) Subject may use additional paper for sketching and quick notes

f) Subtasks will be timed by the experimenter

i) The idea is to work fast without being sloppyii) Better to be correct than to be fastiii) Do not go back to previous subtasksiv) Wait until indicated before proceeding to a new subtask

g) Subject will respond to a short questionnaire at the end of each task

h) Subject will respond to a final questionnaire at the end of the test

i) Questions to the experimenter are OK at any time

Page 81: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

A.2. SCENARIOS 71

4. Task Execution (5-10 minutes per task) – this is repeated once for the oldvisualization technique and once for the new visualization technique

a) Data density

i) Sparse: 10 processes, 30 messagesii) Dense: 30 processes, 90 messages

b) Experimenter loads the data sets

i) Local file (sparse, old): sparse1.xmlii) Local file (sparse, new): sparse2.xmliii) Local file (dense, old): dense1.xmliv) Local file (dense, new): dense2.xml

c) Experimenter closes down windows of visualizations that should notbe active

d) Subject is allowed to proceed with solving the task and write downthe solutions

i) Experimenter will time each subtask and later correct perfor-mance

ii) Experimenter randomly generates 2 questions specific to thedata set (see problem sheet)

e) Subject is asked to respond to the questionnaire for the specific taskperformed

5. Subject is asked to respond to the post-test questionnaire comparing thetwo visualization techniques

6. Conclusion

a) Experimenter thanks subject for participation in study.

b) Experimenter issues the agreed compensation to the subject.

A.2 Scenarios

A.2.1 Duration Comparison

You are analyzing the given distributed system to find out which processes takethe most CPU time in the system. In order to do this, find theprocess that hasthe longest duration in the sequence (from start to finish).

A.2.2 Influence Importance

You now want to know which are the most important processes in the system.One way to figure this out is to see which processes influenced the most otherprocesses (directly or indirectly). Find the process that has had the mostinfluence on the system.

Page 82: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

72 APPENDIX A. GROWING SQUARES USER STUDY

A.2.3 Influence Assessment

Some of the processes in the system are clearly “worker” processes or servers,which receive commands from other processes and in exchange perform somework. Identify these server processes by finding the process that was influencedby the most other processes (directly or indirectly).

A.2.4 Inter-Node Causal Relations

In order to ensure that the system is behaving properly, you want to checkthat some node x really has been influenced by another node y. Answer thefollowing questions (remember, causal relations are transitive, so check indirectrelations too):

1. Is process causally related to (influenced by) ? 2 yes 2 no

2. Is process causally related to (influenced by) ? 2 yes 2 no

3. Is process causally related to (influenced by) ? 2 yes 2 no

A.3 Post-Task Questionnaire

1. Please rate the visualization system according to ease of use.

Very hard = 1Hard = 2Medium = 3Easy = 4Very easy = 5

2. Please rate the visualization system according to efficiency.

Very inefficient = 1Inefficient = 2Neutral = 3Efficient = 4Very efficient = 5

3. Please rate the visualization system according to enjoyment.

Very unpleasant = 1Unpleasant = 2Neutral = 3Pleasant = 4Very pleasant = 5

A.4 Post-Test Questionnaire

1. Please select the visualization system that you liked more with respect toease of use.

Page 83: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

A.4. POST-TEST QUESTIONNAIRE 73

Hasse Diagrams = 1Growing Squares = 2

2. Please select the visualization system that you liked more with respect toefficiency.

Hasse Diagrams = 1Growing Squares = 2

3. Please select the visualization system that you liked more with respect toenjoyment.

Hasse Diagrams = 1Growing Squares = 2

Page 84: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

74 APPENDIX A. GROWING SQUARES USER STUDY

Page 85: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

Appendix B

Growing Polygons User Study

B.1 Script

0. Preparations

a) 4 blank notepad pages

b) 4 sets of task scenario cards

1. Introduction (5 minutes)

a) Experimenter welcomes subject

b) Experimenter gives brief explanation of study

i) Purpose: comparative evaluation of two different ways of visu-alizing causal relations

ii) Old way versus a new information visualization techniqueiii) Each subject will use the two technique for two different infor-

mation densities: 2x2 tasksiv) Hardware: Pentium III laptop with 3D acceleratorv) Software: information visualization application (CausalViz) run-

ning on the Linux operating systemvi) Study data will be reported anonymously in a Ph.D. thesis and

possibly in an academic paper

2. Computer Training (5 minutes)

a) Experimenter demonstrates CausalViz’s UI

i) Main window used for controlling the visualizationii) Animation controls and timeline for selecting positions in timeiii) Visualization windows for different views of the same dataiv) Zoomable interface: right-click and move to zoom, left-click and

move to pan

b) Experimenter explains the information visualization technique rele-vant to the subject

75

Page 86: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

76 APPENDIX B. GROWING POLYGONS USER STUDY

i) Hasse diagrams is a traditional technique (animated). Processesas swimlanes, messages as arrows between them.

ii) 2D polygons is the new technique with colors used to signifyprocess influences. Age rings on a tree, grows outwards. Sectorssignifying different processes.

iii) Use of the mini-map for navigating the visualization.iv) Use of the animation toolbar.v) Causal relation: process A has received message from process B

at time t0 ⇒ A causally related to B for all times t > t0.

c) Experimenter demonstrates how to solve common problems usingboth visualizations (or all three).

i) Duration comparison: find the processes that have the longestlife time.

ii) Influence dominance: find the process that has the most influ-ence in the system.

iii) Influence assessment: find the process that has been influencedthe most in the system.

iv) Causal relation: find the processes that have influenced a giventarget process.

d) Subject is allowed up to 5 minutes of practice using a test scenario

i) Local file: practice.xmlii) Subject decides when ready to proceed

3. Task Explanation (5 minutes)

a) Subject will use both the old and new visualization to allow forcomparison

b) Subject will work with two different data sets of increasing densityfor each visualization

c) All in all, the subject will work with four different tasks for fourdifferent data sets

d) Subject is given a written booklet of subtasks for each data set

i) Each subtask is a common activity used in realityii) Use application visualization to find out solutions to taskiii) It is okay to skip or guess the solution of a subtask

e) Subject may use additional paper for sketching and quick notes

f) Subtasks will be timed by the experimenter

i) The idea is to work fast without being sloppyii) Better to be correct than to be fastiii) Do not go back to previous subtasksiv) Wait until indicated before proceeding to a new subtask

g) Subject will respond to a short questionnaire at the end of each task

Page 87: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

B.2. SCENARIOS 77

h) Subject will respond to a final questionnaire at the end of the test

i) Questions to the experimenter are OK at any time

4. Task Execution (5-10 minutes per task) – this is repeated once for theold visualization technique and once for the new visualization technique.Interleave the use of the old and new techniques among different subjectsto minimize learning effect.

a) Data density

i) Sparse: 5 processes, 15 messagesii) Dense: 20 processes, 60 messages

b) Experimenter loads the data sets

i) Local file (sparse, old): sparse3.xmlii) Local file (sparse, new): sparse4.xmliii) Local file (dense, old): dense3.xmliv) Local file (dense, new): dense4.xml

c) Experimenter closes down windows of visualizations that should notbe active

d) Subject is allowed to proceed with solving the task and write downthe solutions

i) Experimenter will time each subtask and later correct perfor-mance

ii) Experimenter randomly generates 2 questions specific to thedata set (see problem sheet)

e) Subject is asked to respond to the questionnaire for the specific taskperformed

f) Maximum times for the various tasks is 2, 8, 6, and 6 minutes. (Cut-off time)

5. Subject is asked to respond to the post-test questionnaire comparing thetwo visualization techniques

6. Conclusion

a) Experimenter thanks subject for participation in study.

b) Experimenter issues the agreed compensation to the subject.

B.2 Scenarios

B.2.1 Duration Comparison

You are analyzing the given distributed system to find out which processes takethe most CPU time in the system. In order to do this, find theprocess that hasthe longest duration in the sequence (from start to finish).

Page 88: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

78 APPENDIX B. GROWING POLYGONS USER STUDY

B.2.2 Influence Importance

You now want to know which are the most important processes in the system.One way to figure this out is to see which processes influenced the most otherprocesses (directly or indirectly). Find the process that has had the mostinfluence on the system.

B.2.3 Influence Assessment

Some of the processes in the system are clearly “worker” processes or servers,which receive commands from other processes and in exchange perform somework. Identify these server processes by finding the process that was influencedby the most other processes (directly or indirectly).

B.2.4 Inter-Node Causal Relations

In order to ensure that the system is behaving properly, you want to checkthat some node x really has been influenced by another node y. Answer thefollowing questions (remember, causal relations are transitive, so check indirectrelations too):

1. Is process causally related to (influenced by) ? 2 yes 2 no

2. Is process causally related to (influenced by) ? 2 yes 2 no

3. Is process causally related to (influenced by) ? 2 yes 2 no

B.3 Post-Task Questionnaire

1. Please rate the visualization system according to ease of use.

Very hard = 1Hard = 2Medium = 3Easy = 4Very easy = 5

2. Please rate the visualization system according to efficiency.

Very inefficient = 1Inefficient = 2Neutral = 3Efficient = 4Very efficient = 5

3. Please rate the visualization system according to enjoyment.

Very unpleasant = 1Unpleasant = 2Neutral = 3Pleasant = 4Very pleasant = 5

Page 89: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

B.4. POST-TEST QUESTIONNAIRE 79

B.4 Post-Test Questionnaire

1. Which of the two visualization techniques did you like more with respectto ease of use?

2 Hasse Diagrams2 Growing Polygons2 Undecided

2. Please select the visualization system that you liked more with respect toefficiency.

a) Duration comparison (lifetime of processes)2 Hasse diagrams2 Growing Polygons2 Undecided

b) Influence importance (most influential process)2 Hasse diagrams2 Growing Polygons2 Undecided

c) Influence assessment (most influenced process)2 Hasse diagrams2 Growing Polygons2 Undecided

d) Inter-node causal relations (finding causal relations between pro-cesses)2 Hasse diagrams2 Growing Polygons2 Undecided

3. Which of the two visualization techniques did you like more with respectto enjoyment?

2 Hasse diagrams2 Growing Polygons2 Undecided

Page 90: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

80 APPENDIX B. GROWING POLYGONS USER STUDY

Page 91: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

Appendix C

CiteWiz User Study

C.1 Script

0. Preparations

a) 4 blank notepad pages

b) 1 task scenario booklet

1. Introduction (5 minutes)

a) Experimenter welcomes subject

b) Experimenter gives brief explanation of study

i) Purpose: comparative evaluation of two different ways of visu-alizing scientific citation networks

ii) Old way using a database search interface versus a new informa-tion visualization technique

iii) Each subject will use one of the two techniques to solve a set oftask scenarios (between-subjects)

iv) Hardware: Pentium III desktop computer with 3D acceleratorv) Software: information visualization application (CiteWiz) run-

ning on the Linux operating systemvi) Study data will be reported anonymously in a Ph.D. thesis and

possibly in an academic paper

c) Experimenter gives brief background on scientific citation networks

i) Scientific work builds on the work conducted by othersii) Previous that has influenced the author(s) are mentioned in the

references of a scientific paperiii) Scientific papers and their references (citations) together make

up a large directed graph showing the influences and dependen-cies of the scientific work in the field

iv) Studying the citation network can give us interesting informa-tion, such as identifying important papers, successful researchers,and hot topics within the field

81

Page 92: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

82 APPENDIX C. CITEWIZ USER STUDY

2. Computer Training (5 minutes)

a) Experimenter demonstrates the interface of the method the subjectwill use

b) Database interface

i) Searching – dialog for free search.ii) Sorting – sorting of the current result set.iii) Filtering – filtering the current result set according to some cri-

teria (based on free text in the specific fields).iv) Details-on-demand – calling up details on a specific paper.

c) CiteWiz visualizations

i) Main window with the citation database and the view.ii) Construction of the view, adding entries, building groups, nest-

ing groups.iii) Creating a visualization from the current view.iv) Interacting with the CiteWiz GP visualization.

A. Zoomable interface: right-click and move to zoom, left-clickand move to pan

B. Sliders for linear time windows – segments, window width,and position sliders

C. Idea behind GP visualization technique – influences, colors,and polygons

D. “Horns” showing forward influences, sectors showing back-wards influences

E. Interaction techniques – collapsing, expanding and selectinga node in the view hierarchy

F. Legend window – navigation and color mappingG. Tree manager window – collapsing, expanding, and searching

in the view hierarchyv) Interacting with the CiteWiz Newton’s Shoulders diagrams for

both authors and articles

d) Experimenter demonstrates how to solve common problems usingboth methods

i) Paper retrieval: finding a specific paper in the citation databaseii) Find related papers: following references backwards (and for-

ward)iii) Study author collaboration: compare collaboration between two

authors

e) Subject is allowed up to 5 minutes of practice using a test scenario

i) Local file: iv04dataset.xmlii) Subject decides when ready to proceed

3. Task Explanation (5 minutes)

Page 93: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

C.2. SCENARIOS 83

a) Subject will use one of the two methods

b) Subject is given a booklet of subtasks for each data set

i) Each subtask is a common activity used in realityii) Use application visualization to find out solutions to taskiii) It is okay to skip or guess the solution of a subtask

c) Subject may use additional paper for sketching and quick notes

d) Subtasks will be timed by the experimenter

i) The idea is to work fast without being sloppyii) Better to be correct than to be fastiii) Do not go back to previous subtasksiv) Wait until indicated before proceeding to a new subtask

e) Subject will respond to a short questionnaire at the end of the test

f) Questions to the experimenter are OK at any time

4. Task Execution (5-10 minutes per task)

a) Experimenter loads the data set

i) Local file: infovis-dataset.xml

b) Subject is allowed to proceed with solving the task and writing downthe solutions

i) Experimenter will time each subtask and later correct perfor-mance

c) Maximum times for each task is 15 minutes

5. Subject is asked to respond to the post-test questionnaire

6. Conclusion

a) Experimenter thanks subject for participation in study.

b) Experimenter issues the agreed compensation to the subject.

C.2 Scenarios

C.2.1 Find a Paper

You need to locate three specific papers. Please use the search feature of thetool to find these papers using the specified search terms. Please write downthe full title of each paper.

C.2.2 Find the Most Influential Paper

You are trying to identify the most influential papers of the IEEE InformationVisualization conferences. Use the search, browse and visualization facilities ofthe tool to identify the full title and author of the most influential paper for aspecified year of the conference.

Page 94: Niklas Elmqvist - Purdue Universityelm/projects/lic-thesis/...School of Computer Science and Engineering Department of Computing Science Chalmers University of Technology and G¨oteborg

84 APPENDIX C. CITEWIZ USER STUDY

C.2.3 Study Author Collaboration

You are studying the collaboration between different authors active in the IEEEInfoVis conferences. Study the relations between authors X and Y . Identifythe set of authors that have co-authored papers with both X and Y .

C.3 Post-Test Questionnaire

1. Rate the tool with respect to ease of use.

2 very hard2 hard2 medium2 easy2 very easy

2. Rate the tool with respect to efficiency of solving the different tasks.

a) Find a paper2 very inefficient2 inefficient2 neutral2 efficient2 very efficient

b) Find the most influential paper2 very inefficient2 inefficient2 neutral2 efficient2 very efficient

c) Study author collaboration2 very inefficient2 inefficient2 neutral2 efficient2 very efficient

3. Rate the tool with respect to enjoyment.

2 very unpleasant2 unpleasant2 neutral2 pleasant2 very pleasant