An Examination of Grouping and Spatial Organization Tasks ...

11
An Examination of Grouping and Spatial Organization Tasks for High-Dimensional Data Exploration John Wenskovitch, Member, IEEE, and Chris North Abstract—How do analysts think about grouping and spatial operations? This overarching research question incorporates a number of points for investigation, including understanding how analysts begin to explore a dataset, the types of grouping/spatial structures created and the operations performed on them, the relationship between grouping and spatial structures, the decisions analysts make when exploring individual observations, and the role of external information. This work contributes the design and results of such a study, in which a group of participants are asked to organize the data contained within an unfamiliar quantitative dataset. We identify several overarching approaches taken by participants to design their organizational space, discuss the interactions performed by the participants, and propose design recommendations to improve the usability of future high-dimensional data exploration tools that make use of grouping (clustering) and spatial (dimension reduction) operations. Index Terms—Clustering, dimension reduction, spatialization, grouping, cognitive study. 1 I NTRODUCTION Sensemaking refers to a cognitive process for acquiring, representing, and organizing information in order to address a task, solve a problem, or make a decision [46,72]. Models with varying levels of information granularity have been proposed for approaching and solving sense- making problems [66, 67, 72], and these models represent strategies for addressing sensemaking problems in a variety of domains. For example, Pirolli and Card’s Sensemaking Process [67] is designed for sensemaking problems faced by intelligence analysts. Despite the spe- cific challenge addressed by each of these models, they all highlight the need to organize the data. Continuing the intelligence analyst example, they may work to understand the actors and motivations by grouping documents by location, by person, or by subplot. A fundamental behavior in sensemaking is the act of grouping sim- ilar observations in order to understand their properties, effectively forming a cluster. This organizational strategy is true both in paper- based sensemaking tasks [25, 87] and in tasks performed on electronic displays [2, 30]. Clusters therefore have a natural connection to sense- making. Clusters can also help to reduce clutter in a workspace, com- pressing similar observations into a group that requires less physical or screen space [57, 71]. Simplifying the workspace leads to further cognitive benefits, as humans struggle to think about more than a small number of observations or dimensions at one time [75]. Thus, using groups of items to perform analysis tasks can lead to improved mem- ory and recall by providing a simplified method of understanding the data [19]. Previous research has shown that humans use a variety of organizational principles to cluster information [22], even when ad- dressing the same task [2]. In order to identify clusters computationally, hundreds of clustering algorithms have been implemented, each with strengths and weaknesses [33]. Another technique for sensemaking, particularly relevant to mul- tidimensional datasets, is to embed the high-dimensional data in a low-dimensional projection in such a way that the structure of the high-dimensional space (e.g., clusters and outliers) is maintained in the projection [41, 55, 81]. Often, these projections embody a “proximitysimilarity” metaphor, in which the distance between ob- servations represents the similarity of those observations. As a result, John Wenskovitch and Chris North are with the Virginia Tech Department of Computer Science and the Discovery Analytics Center. E-mails: {jw87 | north}@cs.vt.edu. John Wenskovitch is also currently employed by Pacific Northwest National Laboratory. Manuscript received xx xxx. 201x; accepted xx xxx. 201x. Date of Publication xx xxx. 201x; date of current version xx xxx. 201x. For information on obtaining reprints of this article, please send e-mail to: [email protected]. Digital Object Identifier: xx.xxxx/TVCG.201x.xxxxxxx groups of similar observations form clusters within the projection. A number of interactive tools make use of such techniques for exploratory data analysis and sensemaking [9, 23, 29, 64, 75]. The intersection of these grouping and spatialization sensemaking techniques (and their corresponding clustering and dimension reduction algorithms) has become an area of interest for visualization research, including tools and techniques [20, 52, 59, 94], studies [10, 73], and surveys [32,83]. However, the behavior of users when performing these tasks on a dataset has not been thoroughly explored, with the most similar study to this work performing a post-hoc analysis of created groups after the sensemaking task [30]. In contrast, this work presents a study in which participants have been asked to perform spatial and grouping operations within a single sensemaking task, with the goal of understanding the organizational strategies of the participants and the interplay between the operations. In particular, we note the following contributions: 1. The design and execution of a study to understand the cognitive group and spatialization processes involved in organizing a dataset for a sensemaking task. 2. A discussion of organizational strategies and structures, grouping and spatial operations, decision making, and external knowledge displayed by study participants. 3. Design recommendations that are intended for both current and future tools, with the goal of better supporting user organizational processes for sensemaking in interactive visualization systems. We identified three overarching organizational strategies demon- strated by participants, saw tight interplay between spatial and grouping operations during sensemaking, and noted the creation of organizational spaces that were more complex than those currently supported by inter- active sensemaking tools. The results from this study can be used to inform the design of future tools for interactive visual data exploration. 2 BACKGROUND AND RELATED WORK This work was primarily inspired by “The Semantics of Clustering” by Endert et al [30]. In that study, participants were directed to per- form a spatial sensemaking task in a large display space, organizing a collection of documents with only manual layout capabilities. After completing the task and participating in a semi-structured interview, the participants were asked to draw the final cluster structure of their organizational space. While this study provides a useful starting point for understanding the spatial and grouping behaviors of analysts, the post-hoc analysis of clusters misses information about the development of clusters and the interplay between grouping and spatial actions. In- deed, research has demonstrated that categories that emerge during the analysis process are often shifting and ad-hoc, evolving throughout the course of the analysis to represent the information uncovered [5, 44].

Transcript of An Examination of Grouping and Spatial Organization Tasks ...

Page 1: An Examination of Grouping and Spatial Organization Tasks ...

An Examination of Grouping and Spatial Organization Tasks forHigh-Dimensional Data Exploration

John Wenskovitch, Member, IEEE, and Chris North

Abstract—How do analysts think about grouping and spatial operations? This overarching research question incorporates a numberof points for investigation, including understanding how analysts begin to explore a dataset, the types of grouping/spatial structurescreated and the operations performed on them, the relationship between grouping and spatial structures, the decisions analysts makewhen exploring individual observations, and the role of external information. This work contributes the design and results of such astudy, in which a group of participants are asked to organize the data contained within an unfamiliar quantitative dataset. We identifyseveral overarching approaches taken by participants to design their organizational space, discuss the interactions performed by theparticipants, and propose design recommendations to improve the usability of future high-dimensional data exploration tools that makeuse of grouping (clustering) and spatial (dimension reduction) operations.

Index Terms—Clustering, dimension reduction, spatialization, grouping, cognitive study.

1 INTRODUCTION

Sensemaking refers to a cognitive process for acquiring, representing,and organizing information in order to address a task, solve a problem,or make a decision [46, 72]. Models with varying levels of informationgranularity have been proposed for approaching and solving sense-making problems [66, 67, 72], and these models represent strategiesfor addressing sensemaking problems in a variety of domains. Forexample, Pirolli and Card’s Sensemaking Process [67] is designed forsensemaking problems faced by intelligence analysts. Despite the spe-cific challenge addressed by each of these models, they all highlight theneed to organize the data. Continuing the intelligence analyst example,they may work to understand the actors and motivations by groupingdocuments by location, by person, or by subplot.

A fundamental behavior in sensemaking is the act of grouping sim-ilar observations in order to understand their properties, effectivelyforming a cluster. This organizational strategy is true both in paper-based sensemaking tasks [25, 87] and in tasks performed on electronicdisplays [2, 30]. Clusters therefore have a natural connection to sense-making. Clusters can also help to reduce clutter in a workspace, com-pressing similar observations into a group that requires less physicalor screen space [57, 71]. Simplifying the workspace leads to furthercognitive benefits, as humans struggle to think about more than a smallnumber of observations or dimensions at one time [75]. Thus, usinggroups of items to perform analysis tasks can lead to improved mem-ory and recall by providing a simplified method of understanding thedata [19]. Previous research has shown that humans use a variety oforganizational principles to cluster information [22], even when ad-dressing the same task [2]. In order to identify clusters computationally,hundreds of clustering algorithms have been implemented, each withstrengths and weaknesses [33].

Another technique for sensemaking, particularly relevant to mul-tidimensional datasets, is to embed the high-dimensional data in alow-dimensional projection in such a way that the structure of thehigh-dimensional space (e.g., clusters and outliers) is maintainedin the projection [41, 55, 81]. Often, these projections embody a“proximity≈similarity” metaphor, in which the distance between ob-servations represents the similarity of those observations. As a result,

• John Wenskovitch and Chris North are with the Virginia Tech Department ofComputer Science and the Discovery Analytics Center. E-mails: {jw87 |north}@cs.vt.edu. John Wenskovitch is also currently employed by PacificNorthwest National Laboratory.

Manuscript received xx xxx. 201x; accepted xx xxx. 201x. Date of Publicationxx xxx. 201x; date of current version xx xxx. 201x. For information onobtaining reprints of this article, please send e-mail to: [email protected] Object Identifier: xx.xxxx/TVCG.201x.xxxxxxx

groups of similar observations form clusters within the projection. Anumber of interactive tools make use of such techniques for exploratorydata analysis and sensemaking [9, 23, 29, 64, 75].

The intersection of these grouping and spatialization sensemakingtechniques (and their corresponding clustering and dimension reductionalgorithms) has become an area of interest for visualization research,including tools and techniques [20, 52, 59, 94], studies [10, 73], andsurveys [32,83]. However, the behavior of users when performing thesetasks on a dataset has not been thoroughly explored, with the mostsimilar study to this work performing a post-hoc analysis of createdgroups after the sensemaking task [30]. In contrast, this work presentsa study in which participants have been asked to perform spatial andgrouping operations within a single sensemaking task, with the goal ofunderstanding the organizational strategies of the participants and theinterplay between the operations.

In particular, we note the following contributions:1. The design and execution of a study to understand the cognitive

group and spatialization processes involved in organizing a datasetfor a sensemaking task.

2. A discussion of organizational strategies and structures, groupingand spatial operations, decision making, and external knowledgedisplayed by study participants.

3. Design recommendations that are intended for both current andfuture tools, with the goal of better supporting user organizationalprocesses for sensemaking in interactive visualization systems.

We identified three overarching organizational strategies demon-strated by participants, saw tight interplay between spatial and groupingoperations during sensemaking, and noted the creation of organizationalspaces that were more complex than those currently supported by inter-active sensemaking tools. The results from this study can be used toinform the design of future tools for interactive visual data exploration.

2 BACKGROUND AND RELATED WORK

This work was primarily inspired by “The Semantics of Clustering”by Endert et al [30]. In that study, participants were directed to per-form a spatial sensemaking task in a large display space, organizing acollection of documents with only manual layout capabilities. Aftercompleting the task and participating in a semi-structured interview,the participants were asked to draw the final cluster structure of theirorganizational space. While this study provides a useful starting pointfor understanding the spatial and grouping behaviors of analysts, thepost-hoc analysis of clusters misses information about the developmentof clusters and the interplay between grouping and spatial actions. In-deed, research has demonstrated that categories that emerge during theanalysis process are often shifting and ad-hoc, evolving throughout thecourse of the analysis to represent the information uncovered [5, 44].

Page 2: An Examination of Grouping and Spatial Organization Tasks ...

Our goal with the study discussed in this work is to investigate groupingand spatial behaviors throughout the analysis process.

The card-based approach used in this study was inspired by collab-oration studies performed by Robinson [69] and Isenberg et al [42].As noted by Robinson, the intention of this strategy was to permitunderstanding without constraints that may be imposed by softwaretools. For example, there are a variety of methods by which participantsindicated groups during our study, including by stacking observations,by overlapping observations, by positioning observations so that cardswere touching, and by positioning observations relatively close to eachother. Similar studies also include affinity diagramming techniquesfor organization [62]. Other possible metaphors for indicating groupscould be colored labels or drawing boundaries around collections ofobservations. Further, Sellen and Harper suggest that using physicalartifacts in such studies can be useful for revealing how participantsmake use of their affordances to complete tasks, thereby providinginterface and interaction design guidelines [76].

2.1 Sensemaking and CognitionSensemaking is an iterative process, building up an internal representa-tion of information in order to achieve the user’s goal [72]. Distributedand embodied cognition share complementary roles in human sense-making. Distributed cognition refers to the idea that external spacescan be used to extend and support cognitive reasoning. Analysts canthus use objects or symbols as a means of externally encoding re-lationships [70]. “Space to Think” demonstrated that space plays ameaningful role in sensemaking, providing a large high-resolutiondisplay grid to permit analysts to organize hypotheses and evidencespatially [2]. Spatial memory has also been leveraged by Robertsonet al. for document arrangement in their Data Mountain system [68],and the role of spatial memory has been extended into 3D interfacesfor information retrieval tasks by Cockburn and McKenzie [17].

Embodied cognition [88] focuses on the integration of the physicalbody and the environment with internal resources, reflecting how thebody influences cognition. Embodied cognition allows analysts tooffload cognition and create understanding within their workspace,allowing physical navigation to provide more meaning to locations [3].“Space to Think” has been demonstrated to extend to more complexspaces which contain multiple displays and devices [15, 16, 39].

2.2 Dimension Reduction and Clustering ToolsInteractive dimension reduction (DR) and clustering are both activetopics in visual analytics research. In many of these systems designedto support user sensemaking and exploratory data analysis, a learningmodule responds to incremental user feedback, structuring and display-ing subsequent visual representations of the data in a way that reflectsthe goals of the analyst exploring the data [77]. One goal of this studyis to better design future tools in this space by carefully examining userinteractions and their motivations when applied to a dataset.

Interactions in high-dimensional data exploration systems can beorganized into two broad categories: explanatory and expressive [31].Explanatory interactions, or surface-level interactions, seek to under-stand the data without altering any underlying model. Such interactionsare often used to support low-level tasks such as finding extrema orretrieving a value [1]. In contrast, expressive interactions communi-cate some intent from the analyst to the system, resulting in a modelupdate. Parametric interactions are directly applied to model parame-ters, such as changing the weight on a dimension in Andromeda [75].Observation-level interactions are applied to individual data items in aprojection, which are then used to infer the analyst intent.

The semantic interaction work by Endert et al. [29] catalyzed muchof the current research into expressive interactions in DR systems. Toolsthat resulted from this research direction can be divided into those thatsupport quantitative and text data, though text data must be convertedinto a numeric representation for the algorithms to process. Quantitativetools such as Dis-Function [11] present similarity-based projections ofthe data, with a user interest weight vector applied to the dimensions ofthe dataset to influence the projection in response to user interactions.These learning techniques also support text data, as seen in systems like

StarSPIRE [9, 82] and Cosmos [24]. One key change for the text caseis creating a similarity computation based on a distance metric suchas Cosine distance to resolve issues with term sparsity [78] or Gowerdistance to adjust for missing terms [37]. In contrast, quantitativetools often make use of Euclidean distance [23, 75, 85], though otherdistance metrics are also seen in the literature. With this study, we seekto both identify and understand which distances are important to theorganizational structures created by the participants.

Interactive clustering supports similar data exploration processes,with analysts often seeking to find the clustering assignment that bestsuits their understanding or interpretation of the data [4, 56]. TheSemantic Interaction paradigm can also be applied to interactive clus-tering, with a goal of understanding these analysts and adapting theclustering to suit their intent [14,38,79]. Using the principle of “Assign-ment Feedback” as coined by Dubey et al. [26], analysts are often af-forded the ability to directly move observations between clusters [6,18],thereby supplying constraints on future iterations of the cluster assign-ments. Beyond direct interaction with observations, systems can alsosupport direct interaction with the clusters themselves, including op-erations such as merging and splitting clusters [8, 13, 40], removingclusters [6, 21, 38, 56], and expanding clusters [4, 6, 8, 21]. In this study,we identify the frequency of these cluster-oriented techniques and tiethem to the exploratory intent of the user.

2.3 Cognitive Dimension Reduction and ClusteringExisting research in both the cognitive science and visualization com-munities has also explored the understanding of human factors in therealm of DR and clustering, with a particular focus on both percep-tion and cognition of these techniques. With respect to the cognitivescience field, researchers are actively studying the effects of both spa-tializations and groupings. Baylis and Driver [7] consider the role ofproximity, finding that visual attention is not based solely on positionalinformation. Similarly, Kramer and Johnson [48] consider the influenceof Gestalt grouping principles of similarity, closure, and proximity inattention-based tasks, noting that the inclusion of distractors withingroups yields a negative performance effect. Gillam [36] discussesthe role of grouping on spatial layout in great detail, proposing newGestalt grouping principles such as common region and connectedness.Others have examined specific areas in cognitive grouping, such as thework of Zadeh on fuzzy sets [91] and prototype theory [92], Fisher’sinvestigation of conceptual clustering [34, 35], and Duncan’s study ofsuperimposition [27].

In the visualization community, Nonato and Aupetit examined theimpact of distortions on analytical tasks performed by users whenexploring projections of high-dimensional data [61]. Such distortionsare common features of these projections, as the reduction from a high-dimensional space to a 2D (or occasionally 3D) space necessitates aloss of information. Similarly, Lewis et al. explore the reliability ofhuman embedding evaluations, seeking to understand whether analystsagree on the quality of a projection as well as what types of embeddingstructures are favored by analysts [54]. In the realm of clustering, Lewiset al. also contribute a study that compares standard quality measuresfor clustering (e.g., Dunn’s Index [28] and Calinski-Harabasz [12],among others) to the interpretations generated by analysts who evaluatethe clustering assignments [53]. Sedlmair et al. introduce a taxonomyof visual cluster separation factors, examining the density, isotropy, andclumpiness of clusters in scatterplots, finding that visual inspection withvarious DR and clustering techniques failed to match data classificationsin more than half of the human inspection cases [74].

3 EXPERIMENTAL DESIGN

The overarching question that motivates this research focuses on theorganizational processes of analysts when performing exploratory dataanalysis. For the purpose of this study, we define an “analyst” as aperson with expertise in managing complex data, more particularlyhigh-dimensional quantitative data for the purpose of this study, as wellas familiarity with basic data science techniques such as projections andclustering. We wish to understand the cognitive processes that underliethe approach that analysts take when trying to find insight within an

Page 3: An Examination of Grouping and Spatial Organization Tasks ...

Fig. 1. A photo of the Killer Whale card in both the (left) labeled and(right) abstract datasets.

unfamiliar dataset. For example, when analysts are only affordedgrouping and spatialization actions, how will they organize a collectionof observations? Further, we wish to understand how analysts beginto explore a dataset, the types of group structures created and types ofgrouping operations performed, and the decisions that analysts makewhen exploring individual observations. This study was designed toinvestigate these components of the exploratory data analysis process.

3.1 ParticipantsWe recruited 16 participants from statistics and computer science disci-plines in an academic setting. These participants all self-reported somedegree of experience with either data science or exploratory data analy-sis, with this experience ranging from taking a course that included adata science component to developing new methods and tools for dataanalysis. The participant cohort included three undergraduate students,eleven graduate students, and two faculty members.

Using a between-subjects design, we divided the participants intotwo equal groups, each of which performed their organizational taskson a separate set of observations. The group that received the labeleddataset (participants referred to as L1–L8) received 17 index cards withan animal name and five dimensions that describe the animal (see Fig. 1left). The second group received the same data but in abstract form (seeFig. 1 right; participants referred to as A1–A8), with all animal-relatedcontextual information removed from the cards. In the original dataset,the attributes for the animals are normalized to a 0–100 range (thesubset of animals and dimensions selected in the study dataset madethe largest value on the cards 91). The study dataset is provided inthe supplemental material. Participants were asked to ignore the largenumbers on the cards; they were added for better video capture.

3.2 DatasetThe animals dataset provided to the participants is a reduced versionof that created by Lampert et al [51], selected because of its generalknowledge applicability to all potential participants. From the initialdataset, we rounded all decimal values to the nearest integer, andthen reduced the number of animals and the number of dimensions.We selected 17 animals with the foreknowledge that they could benaturally divided into three groups of five animals plus two outliers,but that alternate classifications and group assignments were possible.Five dimensions were selected to make the task challenging, but notoverwhelmingly difficult, with a dimension selected to describe each ofthe three groups, plus two additional noise dimensions. One potentialdivision of this dataset could be:

• Predators (described by Fierce): Bobcat, Grizzly Bear, Leopard,Polar Bear, Wolf

• Aquatic (described by Swims): Blue Whale, Killer Whale, Otter,Seal, Walrus

• Large Herbivores (described by Big): Cow, Deer, Giraffe, Hip-popotamus, Moose

• Outliers: Bat, SquirrelHowever, alternative natural groupings are possible. For example, thePolar Bear and Hippopotamus have Swims attributes that could placethem in the Aquatic group, or the Killer Whale could be a Predator(and the Bat has a similar Fierce attribute). The animals could also bedivided into two groups rather than three: Aquatic and Non-Aquatic, orFurry and Non-Furry.

3.3 Study ProcedureNoting the cognitive strain experienced by participants during a pilotstudy, we elected to limit the length of the study to one hour in order tominimize fatigue and frustration effects. Participants began the studyby responding to four questions in a Google Forms survey, describingtheir familiarity with dimension reduction and clustering algorithmsand answering two questions about exploratory data analysis.

Following this, they were provided with a short description of thetasks they must complete and the goals of the study, after which theysaw the dataset for the first time. In the task (referred to as OrganizationTask), participants were asked to organize the observations in any waythat they wished, though they were limited to grouping and spatial-ization operations (i.e., “place two observations in the same group”and “place two observations some distance apart”). Participants wereinstructed to think aloud in order to better capture their organizationalprocess [60,90]. In addition to notes written in real-time by the proctor,this portion of the session was video recorded for later review. Fi-nally, participants completed a second survey of open-ended questionsaddressing their thoughts related to their personal analysis process.

3.4 Research QuestionsAs previously noted, our overarching research question is to study theorganizational processes of analysts when approaching a sensemakingtask with an unfamiliar dataset. We divide our more specific researchquestions into four broad themes, which correspond to the subsectionsof the Results that follow in the next section:

1. Participant Analysis Process(A) How do analysts begin to evaluate an unfamiliar dataset,

and what actions do they take during this initial evaluation?(B) What are the overarching analysis strategies of study par-

ticipants throughout the complete process of transformingan unfamiliar dataset into an organized space?

2. Representations Created(A) What types of grouping and spatial structures are created

during the full analysis process?(B) How do the individual dimensions appear within the orga-

nizational spaces created by participants?3. Interactions with Representations

(A) What types of grouping and spatial operations are per-formed during the full analysis process?

(B) How do participants approach individual dimensions vscollections of dimensions when exploring the dataset?

4. The Effect of Domain Knowledge(A) What is the role of external information in the analysis

process?(B) How do the layouts created by the abstract condition par-

ticipants change when they are provided with labeled infor-mation?

4 RESULTS

Using notes taken during the session and video recordings, we reviewedthe actions of each participant throughout their analysis process. Dur-ing this review, we were purposefully searching for and noting thefrequency of some behaviors, such as the types of cluster interactionspreviously noted in Section 2.2. For the broader participant strategiesthat are described in this section, we used an open coding approach todescribe these events, which we later synthesized into general themes.

4.1 Participant Analysis ProcessIn this section, we discuss the analysis process undertaken by the studyparticipants, with foci on both how their analysis began and how theiroverall strategy developed.

4.1.1 Beginning the Analysis (RQ1A)There were two primary methods by which participants approachedthe Organization Task. The most common strategy, the Grid Method,was to begin by laying out the full dataset on the table, often in a gridpattern, in order to inspect the full dataset simultaneously. Slightlyless often, we saw the Stack Method, in which participants kept the

Page 4: An Examination of Grouping and Spatial Organization Tasks ...

Fig. 2. The radial layout created by Participant A2, in which each of theanimals is drawn towards its highest attribute with additional effects bythe other large attribute values. This figure (and the other animal layoutfigures within this work) are recreations of the index card layouts createdby the study participants.

index cards in a stack, inspecting each sequentially and determining itsoptimal location in the partially-organized space. Several participantsbegan by inspecting the top few cards in the stack before turning to oneof the two primary patterns. One participant in the labeled conditionlooked through the full stack to survey only the names of the animals,organizing the cards in the stack before following the sequential pattern.

Interestingly, there was a divide between the common approachesin the two groups. Seven of the eight participants who received theabstract dataset followed the Grid Method, while only two of the eightparticipants did so with the labeled dataset. Conversely, six of theeight participants who received the labeled dataset followed the StackMethod, while only one of the abstract dataset participants did so.

Only one of the participants in the abstract data condition primar-ily used spatialization actions to begin exploration the data. Usingthe Stack Method, this participant created a radial layout in whichobservations were drawn towards five points that corresponded to fullvalues of each of the dimensions (Fig. 2), in a manner reminiscent ofthe Dust & Magnet system [89] where dust observations were drawntowards radially-positioned magnet points of high attribute value. Theremainder of the abstract dataset participants performed mostly group-ing operations to explore the dataset, as did three of the labeled datasetparticipants. The remaining five labeled condition participants per-formed a blend of grouping and spatialization operations, to the degreethat neither category could clearly be considered a majority.

4.1.2 Overarching Strategies (RQ1B)We observed a tight coupling between spatialization and grouping ac-tions performed by analysts. Rather than adopting a purely group-firstor layout-first mentality, participants in this study switched between thetwo frequently. This was even true in cases where grouping or spatial-ization actions accounted for a great majority of the overall interactiontotal. Further, we note that this complex relationship develops overtime, where spatializations are used to drive grouping and groupingsare used to drive spatializations. This results in complex organizationalspaces that were produced by the participants in this study. We delveinto these issues further in this section.

There were three main strategies that participants performed whenapproaching the task (see Fig. 3), which we noted share some simi-larities to the investigative strategies observed by Kang et al. [43] intheir document-based study using Jigsaw. Though the switch fromdocuments to quantitative data made the precise interactions and moti-vations different, the strategies that we note here show patterns that aresimilar. For example, the Divide and Conquer strategy that we describenext resembles their Overview, Filter, and Detail strategy, while ourBottom-Up strategy is comparable to their Build from Detail.

The most common strategy seen is the Divide and Conquer Strat-egy, demonstrated in the photo in Fig. 4. In the first pass through the

Fig. 3. Representations of the three overarching strategies that partici-pants used to organize their sensemaking space. In these representa-tions, the Divide and Conquer Strategy begins with the Grid Method, andthe Incremental Layout Strategy begins with the Stack Method.

data, participants selected a single dimension and separate the obser-vations into a small number of groups. They then attempted to findmeaning within these smaller groups, either by selecting another dimen-sion to separate by or spatializing within the group. After structuringthe individual groups, they turned their attention to the full space andattempt to organize the large groups, with occasional refinement withinthe groups.

An example of this strategy is displayed by Participant L4 in Fig. 5.In Panel A at 4:47, she has binned the animals by size, creating seventemporary groups that increase in size from left to right across groupsand from bottom to top within groups. Panel B at 9:04 includes adimension for Swims, forming a structure that approximates a scatterplot. At this point, she identifies three main groups: swimming animals,sort-of swimming animals, and non-swimming animals. In Panel C at16:46, she has decided to separate the swimming animals group as shebegan to sort within each group by the Furry dimension. The globalSwims dimension still persists from bottom to top, but the global Sizedimension is now discrete across the two columns of groups. A localSize dimension was maintained vertically in the non-swimming animals

Fig. 4. A participant from the labeled condition midway through herorganizational process. Groups of animals are apparent, and she isbeginning to create spatial structures within the groups, following theDivide and Conquer strategy.

Page 5: An Examination of Grouping and Spatial Organization Tasks ...

Fig. 5. Five stages from the analysis produced by Participant L4. Theparticipant used the Divide and Conquer strategy, dividing the datasetinto groups to be individually analyzed and refined.

group. The Furry dimension was vertical in the larger swimming andsort-of swimming groups, but was horizontal in the non-swimminggroup (and was not clearly specified in the small swimming group). InPanel D at 21:03, the two groups of large animals were positioned closertogether (though kept as separate groups) and the sort-of swimminganimals group was flipped vertically because the Blue Whale, Moose,and Hippopotamus have similar Fierce and Solitary attributes. Thevertical axis of the smaller animals group (joined into a single group)now has a Fierce dimension with the non-swimming animals, thoughthe Swims dimension is still maintained at the top of the group. Finallyin Panel E at 31:21, the Fierce axis was rotated within the small animalsgroup, and the vertical axis has been replaced with a Solitary dimension.

The second most common strategy was the Incremental LayoutStrategy. This strategy was almost exclusive to the labeled data con-dition (A2 once again being the exception). Participants consideredeach observation one at a time, adding them to a continually grow-ing organizational space in the location which appeared most sensible.Each of these additions was often a grouping operation, but could alsobe a spatialization operation in some cases. Updates to the positionof observations already positioned did occur, but were infrequent. Asthe participants continued to add data, the physical size of the utilizedspace increased. After all observations were added to the space, theparticipants began a more thorough refinement process.

A third strategy, not as common as the first two but still implementedby multiple participants, was the Bottom-Up Strategy. Participantsbegan similarly to Divide and Conquer, laying out all of the observa-tions to view simultaneously. However, their next step was to beginto build small groups of two or three similar observations, usually byonly looking at a single dimension at first but then considering others.After many of the observations had been placed in small groups, spatialrelationships were created between the groups, often leading to theformation of larger groups.

Regardless of the approach strategy taken by participants, they allfollowed an incremental pattern that consisted of a period of organiza-tion followed by a period of reflection, after which the process repeated.Such incremental formalism has been demonstrated in previous studiesand system use cases [9, 29, 43, 77], but took on a different form in thisstudy. As noted previously and confirmed by the post-survey, partici-pants almost universally approached the organization by considering asingle dimension at a time. In doing so, they created an organizationalspace for a dimension, and then took a step back to consider the space.They then proceeded to a second dimension, introducing its effects intothe space gradually by individual animal or group, and then examinedthe global changes made to the structure. This alternating pattern ofsensemaking and synthesis segments connected the participants’ localinteractions to their global understanding of their organizational space.

4.2 Representations CreatedIn this section, we examine the grouping and spatial structures thatwere created by the participants during the course of their exploration.

4.2.1 Grouping and Spatial Structures (RQ2A)Both groups of participants were approximately equally likely to createhierarchies and cross-cutting groups in their organizational structures(4/8 labeled and 5/8 abstract). Many participants did create internalspatial structures within their groups, but only a small subset clearlydelineated groups within groups or groups overlapping groups (forexample, Participant A3 in Fig. 6). In many cases, these internal groupswere formed by breaking up a larger supergroup, though occasionallytwo subgroups were joined together to form children of a larger parentgroup. These overlapping groups occasionally represented “fuzzy” or“soft” cluster assignments [91].

Similarly, both groups of participants were equally likely to createorganizational structures in which the axes mapped to dimensions inthe data. This is contrary to the properties of many dimension reductionalgorithms, in which the axes have no direct mapping to the sourcedata. Often, such constructions resulted from participants’ behavior infocusing on a single dimension at a time and organizing the observationson a spectrum along one or more dimensions (see Fig. 7). As axes

Page 6: An Examination of Grouping and Spatial Organization Tasks ...

Fig. 6. The complex group cross-cutting created by Participant A3. Thisparticipant judged groups of abstract data by considering low, medium,and high values of dimensions A, B, and C (corresponding to Furry, Big,and Swims). As a result, the axes of the organization were important tothe group determinations.

were seen to be important, we see benefits from tools that communicatethe meaning of axes in a projection as seen in InterAxis [45] andAxiSketcher [50].

We also noted that both groups of participants identified outliersin the dataset that they were hesitant to place in any group. Nearthe beginning of their analysis, they referred to single-observationgroups as groups (or the seeds of a group), but as they continued tostructure observations in the space, they were more likely to refer tothese observations as not fitting well with the others. The participantswho created cross-cutting groups often had subgroups with just a singleobservation in their organizational structure, but they were clear toidentify those as equally belonging to two or more of the broadergroups (again see Fig. 6).

Spatial meaning internal to groups was also seen by both participantconditions. Often, the spatializations within the groups were designedto show differentiation within observations in the group (e.g., a sizetrend across the group), though occasionally the goal of the participantswas to show relationships between members of the group and otherparts of the space (e.g., an observation within the group that is quitesimilar to those in other groups). As a consequence of the second,participants in both groups were equally likely to create global spatialstructures that spanned the entire structure or governed large portionsof their organization.

Fig. 7. Participant L5 created a structure in which the axes were clearlyan important feature. Groups were refined based on the remaining threedimensions, but the majority of the structure was governed by the Swimsand Big dimensions with which she began her analysis. Attribute binsfrom the analysis of the Swims dimension are clearly still visible on thex-axis.

4.2.2 Complex Spaces (RQ2B)After considering several dimensions, participants began to create com-plex spaces. This was already seen through the hierarchical, cross-cutting set of groups created by Participant A3 (Fig. 6). Anotherexample is seen within the structure created by Participant A4 (Fig. 8),in which the participant created spectra for two of the dimensions andinfluence regions for the three remaining dimensions. The spectrawere not orthogonal, though the C dimension was aligned with thex-axis. These are represented by the solid arrows in Fig. 8. The threeregions likewise overlapped in some places but not others, indicatingportions of the space in which one dimension had a great deal of in-fluence in determining how the participant structured the layout andgroups. These are represented by the dashed arcs in the same figure.

This runs counter to the common method of creating dimensionally-reduced projections, in which the entire space is governed by a singleweight vector (for example, [9, 23, 75, 84]). Creating clusters thatcontain independent, internal weight vectors, as well as maintaining aglobal space, presents one solution to this challenge. For the examplepresented by A4, the low B region could be defined as a cluster thatstill maintains the influence of low A and high C. This finding presentsopportunities for the introduction of subspace clustering techniquesto support such complex spaces, as seen in previous works analyzinghigh-dimensional data [49, 63].

Further, these complex spaces are not limited to areas of attributeinfluence. Participants also created complex structures of hierarchi-cal, cross-cutting groups. For example, Participant A6 created severalcomplex sets of groups at various points in her analysis, two of whichare provided in Fig. 9. This participant’s bottom-up process of con-necting smaller clusters into larger groupings also reflects Gillam’sconnectedness findings for cognitive grouping [36].

4.3 Interactions with RepresentationsHaving established the created structures, we now examine the opera-tions performed on these structures during the course of analysis.

4.3.1 Grouping and Spatial Operations (RQ3A)We recorded instances in which participants performed four differenttypes of grouping operations: create, remove, join, and split. Theseare differentiated in Fig. 10. The most common method by whichparticipants in both study conditions approached the organizational taskwas to start with large groups and then subdivide. As a result, splittingan existing group into smaller groups was the most common groupingoperation performed. A majority of participants also joined groupstogether at some point in their analysis, usually when considering adimension for the first time and noticing new similarities among theobservations. Only two participants in the abstract condition (A2 andA6) performed operations to create an entirely new group from obser-vations previously in several other groups. None of the participants inthe abstract condition removed a group and allocated its members intoseveral other groups. In contrast, five of the eight participants in thelabeled condition created and three of the eight removed a group.

Spatial operations were occasionally the driving force behind groupcreation. Participants in both the labeled and abstract conditions oftenidentified collections of observations that were similar and positionedthem close together spatially before identifying that collection as agroup. Conversely, groups were often used to drive spatial operationsas well, especially when participants were refining group membershipsand identifying distances within groups.

Participants also reported that distances within groups were moreimportant to their structure than distances between groups. This waspartially a result of the cognitive difficulty in mentally computing adistance between groups of observations as opposed to computing adistance between a pair of observations. More meaningfully, partici-pants reported that these fine-grained differences between observationswithin a group were more relevant to their understanding of groupstructure than were the differences between the groups themselves. Inother words, it was enough for participants to say “these groups aredifferent,” but they felt the need to incorporate more spatial detail whensaying “this is why these observations within a group are different.”

Page 7: An Examination of Grouping and Spatial Organization Tasks ...

Fig. 8. Participant A4 created a complex space, creating spectra (solid line arrows) for dimensions A and C (Furry and Swims) and distinct regions ofhigh influence (dashed arcs) for dimensions B, D, and E (Big, Fierce, and Solitary). The smaller subfigures to the right of the main space display theeffect of each individual dimension on the overall projection with less clutter.

4.3.2 Decision Making (RQ3B)

With the exception of A2, all participants in both groups spent themajority of their analysis considering only a single dimension at atime, confirming the observation seen in previous studies that analystsstruggle to think high-dimensionally [3, 75]. Additionally, participantsfrequently processed attributes by either making binary decisions (e.g.,divide observations by an attribute value greater than or less than 50),or alternatively by creating a small number of bins to discretely groupobservations by a single dimension. The participants commented thatboth the binary decisions and the binning operations were intention-ally made so that they could focus their attention on subsets of theobservations rather than the entire collection.

Two of the participants in the abstract condition created featuresfrom combinations of provided features while exploring the data, inboth cases to reduce the amount of information that they were tryingto cognitively process. Participant A5 computed the median of allfive dimensions in order to perform an initial grouping, while A4computed the difference between the final two variables she had notyet considered. Two of the participants from the labeled conditionalso created features, but these came from domain knowledge of thelabeled animals instead. Participant L5 introduced both flying and speedfeatures into her layout, while the other created groups that incorporatedthe likelihood of finding these animals in a zoo. This last participant,L1, also reported that she focused primarily on the animal labels, anddid not consider the attribute values until making final refinements tothe space. Kopanas et al. found similar data transformations and featureinterpretations in their data mining study [47].

4.4 The Effect of Domain KnowledgeWe now look at the differences between the labeled and abstract groups,with a particular focus on the role of external information.

Fig. 9. Participant A6 created complex hierarchical and cross-cuttinggroup structures during her analysis.

4.4.1 External Knowledge (RQ4A)

Participants who received the labeled dataset often made use of theirexternal knowledge about the animals that was not contained on thecards. These behaviors in response to external and domain knowledgereinforce the “Data is Personal” work by Peck et al. [65], particularlythat personal experience drives decisions. Similar experience-driven be-haviors have been seen in the cognitive psychology field, as evidencedby the experiments of Zemel et al [93].

External knowledge influenced participants’ organizational spacesin three ways. First, introducing external knowledge allowed the partic-ipants to begin forming groups of animals prior to closely examiningthe data. For example, Participant L1 created a number of groupswithout even considering the data, including features corresponding toenvironment, diet, and the probability of locating the animal in Canada.Likewise, Participant L5 introduced a flight group for the Bat into herorganization, keeping it separate from all other animals as a resultwithout considering the attribute values on the card.

Second, external knowledge was used to structure and spatializewithin groups. For example, Participant L5 introduced a speed dimen-sion into her organization, which permitted her to break up the land-dwelling (and non-flying) animals into groups based on how quickly

Fig. 10. Grouping operations in the participant organizational strategies.Join operations destroy two original groups to create a new group, splitoperations destroy one original group to create two new groups, creationoperations take a portion of one or more groups to create a new group(while leaving the originals), and remove operations destroy one originalgroup and distribute its members to one or more existing groups.

Page 8: An Examination of Grouping and Spatial Organization Tasks ...

they moved. The final result of this introduction was a group that con-tained the Deer along with the Wolf, Leopard, and Bobcat. Occasion-ally, incorrect domain knowledge could also affect their organization,as seen when Participant L3 used external knowledge about the diet ofanimals to create transition groups. For example, the Hippopotamuswas used as a transition animal between Aquatic and non-Aquatic ani-mals, not because of its mid-range Swims score, but because of the factthat she felt this animal was likely to consume fish while spending timein water as well as plants when spending time on land.

Third, external knowledge about the animals was used as confir-mation, checking to see if the groups the participants created weresensible after performing a dimension-specific organizational step. Par-ticipant L2, for example, frequently questioned his positioning of theBat throughout his iterative organizational steps. Despite the attributeson the cards supporting his decisions, he continued to second-guess itsposition and occasionally to modify it based on his external knowledge.Participant L4 also reported scanning the names of the animals in thegroups regularly, searching for refinements that could be made to thestructure internal to groups based solely on those labels.

4.4.2 “The Big Reveal” (RQ4B)Participants who received the abstract dataset were informed of itsanimal dataset source after completing their organizational space. Theanimals cards were overlaid above the abstract cards, and participantswere asked to comment on the structure and relationships that werenow apparent after seeing the labeled data, effectively replicating theconfirmation use from the previous subsection.

In general, participants reported being satisfied with the layout thatthey created in the abstract data. Many had an Aquatic group of animalsdue to the high-C score, and they quickly picked up on this relationship.Similarly, participants who prioritized the D dimension saw groups ofpredators in their organization. The Bat often was a frustration pointfor participants, expressing dissatisfaction with its position in the spacesimilar to the reaction seen by participant L2.

When asked what they would change in their space given this newinformation, most participants reported that they would increase thedensity of the groups in their space, moving groups of similar animalscloser together after understanding their relationship to each other. Thiswas often seen in any Aquatic groups, Predator groups, and among theWhales. Interesting, this behavior was not necessarily true with theBears; participants were more willing to keep these animals separateand even to occasionally increase their separation despite the commongenus, often reporting that this was counter to their initial reactionwhen noting the existence of two bears.

The incorporation of external knowledge by the labeled conditionparticipants, and the changes made to the layouts of the abstract con-dition participants once the labeled information was provided, bothsupport the utility of semantic interaction in interactive systems. Thegoal of semantic interaction [29] is to permit the analyst to focus theirexploration on the observations themselves, rather than carefully exam-ining data values and fine-tuning the parameters of underlying models.Because the labeled group participants often brought in external knowl-edge, they were able to add additional detail to their organizationalstructures that was not contained on the index cards. The updates madeby the abstract condition participants after the presentation of labelsshow that their spaces would have been differently structured if theyhad access to this information throughout the study.

5 DISCUSSION

The results collected from this study have yielded valuable informationtowards how humans approach exploring and organizing data. Such in-formation can be used to guide the design of interactive visual analyticssystems in the future.

5.1 Post-SurveyAll participants provided responses to the survey that followed theOrganization Task, which addressed their interpretation of their strat-egy, easy and difficult parts of the analysis, and their thoughts on theusefulness and meaningfulness of grouping and spatialization actions.

Approaching the Task: Participants reported approaching the taskby considering a single dimension at a time, confirming observationsfrom the study. The difficulty that the participants experienced whenattempting to think high-dimensionally suggests the need for compu-tational support in similar organizational tasks. Each dimension wasselected by searching for features in the dataset that seemed to be mostuseful or representative, either due to the overall distribution, outliers,or common values. When considering each dimension, participantssought out commonalities between the observations, building largegroups or binning the observations in a spectrum and refining the bins.Participants who received the labeled set also used external knowledgeof the animals to create and organize groups.

Number of Interactions: When asked whether they thought theyperformed more grouping or spatialization interactions, participantsgave a variety of responses. Among those who believed they performedmore grouping operations, they noted that their overarching strategywas to isolate groups within the data in order to make future processingsimpler. Those who believed that they performed more spatializationinteractions generally reported that they were careful when refining theorganization in later stages, leading to that majority. Some participantsreported that they couldn’t determine which class was the majority.

Task Difficulty: Participants generally reported that the easiest partof the Organization Task was the beginning or the end of the analysis.Some reported that selecting a starting point, picking the initial groups,or creating the initial spatial structure was easiest. Others noted thatmaking final refinements within and between groups at the end ofthe analysis was easiest. Most participants reported the mid-stages oforganization to be the most challenging, needing to balance updatesto existing groupings of data when examining a new dimension withmaintaining existing spatial relationships. Participant A1 did reportthat the amount of data seen after laying out all of the cards initiallywas overwhelming, but that did not stop her from arbitrarily selecting astarting dimension for analysis.

Operation Usefulness: Participants were also evenly split betweenconsidering the grouping or the spatialization actions to be more use-ful or more meaningful. Those who felt more positively about thegroupings mentioned summarizing the big picture and making sense ofthe overall space visually, while those who felt more positively aboutthe spatializations thought that these actions made them more care-ful in their analysis. Both groups also mentioned that their operationpreference for this question impacted the other. Those who preferredgrouping actions noted that it made spatialization actions easier toperform, while those who preferred spatialization actions noted that itmade the task of creating meaningful groups easier.

A related observation while running the study is the role of termi-nology, particularly with the grouping operation. There were a numberof times when participants were clearly separating the observationsinto piles, but they were somewhat hesitant to define this organizationas a “cluster” or a “group.” Frequently, we found ourselves using avariety of terms when inquiring about the structures that the participantswere creating (e.g., “Do you consider this a group, or is it a cluster, oran organizational construct, or a bin, or a collection, or ...?”). To theparticipants, the terms “cluster” and “group” had a different, deepersemantic meaning than a simple “pile” of observations. In order tobe classified as a “cluster” or “group,” participants often wanted toperform multiple iterations of analysis to ensure that more than just asingle property defined the collection of observations.

5.2 Recommendations for Tool DesignTable 1 collects findings that were uncovered while running this study,each of which is described in further detail in the preceding sections.Each of these findings is accompanied by a design recommendationfor interactive visualization systems that support these interactionsin the study domain (high-dimensional quantitative data). Existingsystems in this area (e.g., Castor [84] and Pollux [85]) begin to addresssome of these interaction and design issues, but do not fully addressthem all. For example, neither system makes an effort to map animportant dimension to an axis, support is not provided for efficientcluster splitting, and both use a single weight vector to express the

Page 9: An Examination of Grouping and Spatial Organization Tasks ...

Table 1. A summary of the main findings uncovered by this study and corresponding design recommendations for interactive data exploration tools.

Finding Design RecommendationA tight coupling was seen between grouping and spatialization actions, frequentlyswitching between these operations. Participants formed groups in order to make futurespatializations easier, and also formed spatializations to make future groupings either.

Systems should be designed so that analysts can perform such operations at any timeduring the analysis process, as opposed to creating sequential phases. Algorithms shouldbe selected or designed to support these highly-incremental feedback loops.

The axes often had meaning in the organizational spaces created by participants, withone or more dimensions frequently parallel or orthogonal to the front of the table.

Aligning an important property (e.g., the first principal component or the dimensionwith the highest weight) to an axis can boost the interpretability of the projection.Alternatively, the meaning of the axes should be clearly communicated.

A common trend was to progress from large groups to small groups, splitting groupsrather than joining them.

While many types of cluster operations (e.g., joining, creating, removing) should besupported, an efficient interaction for splitting clusters should be a design priority.

Participants reported that distances within groups were more important to their structurethan distances between groups.

Algorithms that favor local structures (e.g., t-SNE, subspace clustering) may be betterrepresentations of an analyst’s understanding of data than those which do not.

To further reduce the complexity of the data, participants often binned the observationsinto smaller groups or separated the observations with a binary decision.

Potential means of designing such an efficient cluster-splitting interaction include auto-matically binning observations or separating them by an analyst-specified threshold.

Participants primarily explored one dimension at a time, expressing frustration withtrying to consider all dimensions at once.

Computational support is key for efficiently communicating multidimensional informa-tion to the analyst.

Participants often created complex spaces that combined dimension spectra with dimen-sion regions of influence.

Tools should support complex spaces (e.g., subspace clustering and other techniques thatfavor local structures) rather than using a single weight vector to express the full space.

Participants in the labeled group often brought their external knowledge into theirorganizational structure, adding additional animal properties that were not provided.

Systems should provide interactions for annotation or other notes, permitting analysts toinject information into the system.

full space. Further development of these systems to reflect the lessonslearned from this study will improve both analytical power and userexperience or these and similar tools.

The overarching organizational strategies demonstrated by partici-pants can also inform the choice and design of algorithms that supportanalysis in these systems. For example, participants who used the Di-vide and Conquer strategy created spaces that differed between groupsas of result of their separate analysis within each group. This suggeststhe benefit of algorithms that favor local structures, such as t-SNE [55]for projection layout and subspace clustering [58, 80] for grouping.

It is important to note that the goal of including such algorithms is tobetter enable this organizational process, not to compute the final results.There is great benefit to the exploration processes of the analysts, asthey continue to refine hypotheses and structure observations. Ananalyst who is presented with an organizational space created by thestudy participants will almost certainly walk away with conclusionsthat differ from the participant who created the space.

5.3 Limitations and Future WorkThe primary limitation of this work is that this study has only beentested on a single dataset rather than experimenting with datasets ofvarious sizes (cardinality of observations and dimensions), types (doc-uments), and levels of complexity (floating-point observations, con-flicting dimensions, and confounding variables). We recognize that theresults found in this study and the recommendations presented in Ta-ble 1 are limited in applicability to our domain of quantitative data. Fortime-series or ensemble data, these recommendations may not apply(and the analysis undertaken by the user will be different as well). Ad-ditionally, this study was limited to investigating how an analyst thinksabout and organizes information in a space that they create; however,we do not consider how an analyst will interpret an organizational spacethat is created for them by an analytical system. Such a follow-up studyis necessary to supplement the design recommendations from Sect. 5.2.

Future studies with other datasets can continue to explore this space.In particular, running this study with documents could yield differentresults, as participants will likely focus on abstract conceptual dimen-sions rather than term frequency-based computational dimensions, asthe dimensions are not obvious in the text. Likewise, we expect thatthere will be additional focus on grouping operations at the beginningof the analysis process as the participants search for common topicsor themes in the documents. Comparing the results of this work to adocument study represents a promising direction for future research.

This study yielded a number of additional questions worthy of in-vestigation. We plan to investigate the similarities between the finallayouts of all participants by encoding groups and relative distancesbetween observations into a high-dimensional dataset, and then visual-ize the result in a tool such as Andromeda [75] or SIRIUS [23]. Thesetools can also enable our understanding of which spatial or groupingproperties cause such similarities between the spaces. We can also

perform similar analyses on the individual layouts, enabling us to un-derstand bias in the layouts created by the participants. A follow-upstudy using a computational system could present this bias informationto the analyst in real time, in order to learn how participants react to thatinformation. Overall, this direction of research seeks to further supportthe cognitive side of human-AI collaboration and co-learning [86].

6 CONCLUSION

In this work, we experiment with a labeled and abstract set of datato examine how analysts approach and organize an unfamiliar dataset.We wish to understand the cognitive processes that underlie the ap-proach that analysts take when trying to find insight in data. We foundthat participants used groups to create spatial structures as well asspatial structures to form groups. Participants created hierarchies andcross-cutting groups in their organizational structures, and frequentlyapproached the task by creating large groups and subdividing them torefine additional structure. The complex spaces created by participantshint towards structures that should be supported in interactive applica-tions. We summarize a list of main findings and corresponding systemdesign recommendations in Table 1.

ACKNOWLEDGMENTS

The authors wish to thank those who participated in this study, as wellas the reviewers for their insightful comments.

REFERENCES

[1] R. Amar, J. Eagan, and J. Stasko. Low-Level Components of AnalyticActivity in Information Visualization. In IEEE Symposium on InformationVisualization, 2005. INFOVIS 2005., pp. 111–117, 2005.

[2] C. Andrews, A. Endert, and C. North. Space to Think: Large High-resolution Displays for Sensemaking. In Proceedings of the SIGCHIConference on Human Factors in Computing Systems, CHI ’10, pp. 55–64.ACM, New York, NY, USA, 2010. doi: 10.1145/1753326.1753336

[3] C. Andrews and C. North. Analyst’s Workspace: An Embodied Sense-making Environment for Large, High-Resolution Displays. In 2012 IEEEConference on Visual Analytics Science and Technology (VAST), pp. 123–131. IEEE, Oct 2012. doi: 10.1109/VAST.2012.6400559

[4] G. Andrienko, N. Andrienko, S. Rinzivillo, M. Nanni, D. Pedreschi, andF. Giannotti. Interactive Visual Clustering of Large Collections of Tra-jectories. In 2009 IEEE Symposium on Visual Analytics Science andTechnology, pp. 3–10, 2009.

[5] L. W. Barsalou. Ad Hoc Categories. Memory & Cognition, 11(3):211–227,1983.

[6] S. Basu, D. Fisher, S. M. Drucker, and H. Lu. Assisting Users withClustering Tasks by Combining Metric Learning and Classification. InTwenty-Fourth AAAI Conference on Artificial Intelligence, 2010.

[7] G. C. Baylis and J. Driver. Visual Parsing and Response Competition: TheEffect of Grouping Factors. Perception & Psychophysics, 51(2):145–162,1992.

Page 10: An Examination of Grouping and Spatial Organization Tasks ...

[8] L. Boudjeloud-Assala, P. Pinheiro, A. Blansch, T. Tamisier, and B. Ot-jacques. Interactive and Iterative Visual Clustering. Information Visualiza-tion, 15(3):181–197, 2016. doi: 10.1177/1473871615571951

[9] L. Bradel, C. North, L. House, and S. Leman. Multi-Model SemanticInteraction for Text Analytics. In 2014 IEEE Conference on Visual Ana-lytics Science and Technology (VAST), pp. 163–172, Oct 2014. doi: 10.1109/VAST.2014.7042492

[10] M. Brehmer, M. Sedlmair, S. Ingram, and T. Munzner. VisualizingDimensionally-Reduced Data: Interviews with Analysts and a Characteri-zation of Task Sequences. In Proceedings of the Fifth Workshop on BeyondTime and Errors: Novel Evaluation Methods for Visualization, BELIV’14, pp. 1–8. ACM, New York, NY, USA, 2014. doi: 10.1145/2669557.2669559

[11] E. T. Brown, J. Liu, C. E. Brodley, and R. Chang. Dis-Function: LearningDistance Functions Interactively. In 2012 IEEE Conference on VisualAnalytics Science and Technology (VAST), pp. 83–92, 2012.

[12] T. Calinski and J. Harabasz. A Dendrite Method for Cluster Analysis.Communications in Statistics-Theory and Methods, 3(1):1–27, 1974.

[13] J. Choo, C. Lee, C. K. Reddy, and H. Park. UTOPIAN: User-Driven TopicModeling Based on Interactive Nonnegative Matrix Factorization. IEEETransactions on Visualization and Computer Graphics, 19(12):1992–2001,2013.

[14] J. Chuang and D. J. Hsu. Human-Centered Interactive Clustering forData Analysis. In Conference on Neural Information Processing Systems(NIPS). Workshop on Human-Propelled Machine Learning, 2014.

[15] H. Chung and C. North. SAViL: Cross-Display Visual Links for Sensemak-ing in Display Ecologies. Personal Ubiquitous Comput., 22(2):409–431,Apr. 2018. doi: 10.1007/s00779-017-1091-4

[16] H. Chung, C. North, J. Z. Self, S. Chu, and F. Quek. VisPorter: FacilitatingInformation Sharing for Collaborative Sensemaking on Multiple Displays.Personal Ubiquitous Comput., 18(5):1169–1186, June 2014. doi: 10.1007/s00779-013-0727-2

[17] A. Cockburn and B. McKenzie. Evaluating the Effectiveness of SpatialMemory in 2D and 3D Physical and Virtual Environments. In Proceedingsof the SIGCHI Conference on Human Factors in Computing Systems, CHI’02, pp. 203–210. ACM, 2002. doi: 10.1145/503376.503413

[18] A. Coden, M. Danilevsky, D. Gruhl, L. Kato, and M. Nagarajan. A Methodto Accelerate Human in the Loop Clustering, pp. 237–245. SIAM, 2017.doi: 10.1137/1.9781611974973.27

[19] J. M. Curiel and G. A. Radvansky. Mental organization of maps. Journal ofExperimental Psychology: Learning, Memory, and Cognition, 24(1):202,1998.

[20] C. Ding and T. Li. Adaptive Dimension Reduction Using DiscriminantAnalysis and K-Means Clustering. In Proceedings of the 24th InternationalConference on Machine Learning, ICML ’07, pp. 521–528. ACM, NewYork, NY, USA, 2007. doi: 10.1145/1273496.1273562

[21] V. Dobrynin, D. Patterson, M. Galushka, and N. Rooney. SOPHIA: AnInteractive Cluster-Based Retrieval System for the OHSUMED Collection.IEEE Transactions on Information Technology in Biomedicine, 9(2):256–265, 2005.

[22] P. Dourish, J. Lamping, and T. Rodden. Building Bridges: Customisationand Mutual Intelligibility in Shared Category Management. In Proceedingsof the International ACM SIGGROUP Conference on Supporting GroupWork, GROUP ’99, pp. 11–20. ACM, New York, NY, USA, 1999. doi: 10.1145/320297.320299

[23] M. Dowling, J. Wenskovitch, J. Fry, S. Leman, L. House, and C. North.SIRIUS: Dual, Symmetric, Interactive Dimension Reductions. IEEETransactions on Visualization and Computer Graphics, 25(1):172–182,Jan 2019. doi: 10.1109/TVCG.2018.2865047

[24] M. Dowling, N. Wycoff, B. Mayer, J. Wenskovitch, S. Leman, L. House,N. Polys, C. North, and P. Hauck. Interactive Visual Analytics for Sense-making with Big Text. Big Data Research, 16:49 – 58, 2019. doi: 10.1016/j.bdr.2019.04.003

[25] S. M. Drucker, D. Fisher, and S. Basu. Helping Users Sort Faster withAdaptive Machine Learning Recommendations. In P. Campos, N. Graham,J. Jorge, N. Nunes, P. Palanque, and M. Winckler, eds., Human-ComputerInteraction – INTERACT 2011, pp. 187–203. Springer Berlin Heidelberg,Berlin, Heidelberg, 2011.

[26] A. Dubey, I. Bhattacharya, and S. Godbole. A Cluster-Level Semi-supervision Model for Interactive Clustering. In J. L. Balcazar, F. Bonchi,A. Gionis, and M. Sebag, eds., Machine Learning and Knowledge Dis-covery in Databases, pp. 409–424. Springer Berlin Heidelberg, Berlin,Heidelberg, 2010.

[27] J. Duncan. Selective Attention and the Organization of Visual Information.Journal of experimental psychology: General, 113(4):501, 1984.

[28] J. C. Dunn. Well-Separated Clusters and Optimal Fuzzy Partitions. Journalof cybernetics, 4(1):95–104, 1974.

[29] A. Endert, P. Fiaux, and C. North. Semantic Interaction for Sensemaking:Inferring Analytical Reasoning for Model Steering. IEEE Transactions onVisualization and Computer Graphics, 18(12):2879–2888, Dec 2012. doi:10.1109/TVCG.2012.260

[30] A. Endert, S. Fox, D. Maiti, S. Leman, and C. North. The Semantics ofClustering: Analysis of User-Generated Spatializations of Text Documents.In Proceedings of the International Working Conference on AdvancedVisual Interfaces, AVI ’12, pp. 555–562. ACM, New York, NY, USA,2012. doi: 10.1145/2254556.2254660

[31] A. Endert, C. Han, D. Maiti, L. House, S. Leman, and C. North.Observation-Level Interaction with Statistical Models for Visual Analytics.In 2011 IEEE Conference on Visual Analytics Science and Technology(VAST), pp. 121–130, 2011.

[32] A. Endert, W. Ribarsky, C. Turkay, B. W. Wong, I. Nabney, I. D. Blanco,and F. Rossi. The State of the Art in Integrating Machine Learning intoVisual Analytics. In Computer Graphics Forum, vol. 36, pp. 458–486.Wiley Online Library, 2017.

[33] V. Estivill-Castro. Why So Many Clustering Algorithms: A Position Paper.SIGKDD Explor. Newsl., 4(1):65–75, June 2002.

[34] D. H. Fisher. Improving Inference through Conceptual Clustering. InAAAI, vol. 87, pp. 461–465, 1987.

[35] D. H. Fisher. Knowledge Acquisition Via Incremental Conceptual Cluster-ing. Machine learning, 2(2):139–172, 1987.

[36] B. Gillam. Varieties of Grouping and Its Role in Determining SurfaceLayout. In T. Shipley and P. Kellman, eds., From Fragments to Objects:Segmentation and Grouping in Vision, chap. 8, pp. 247–264. ElsevierScience B.V., Amsterdam, The Netherlands, 2001.

[37] J. C. Gower. A General Coefficient of Similarity and Some of Its Properties.Biometrics, pp. 857–871, 1971.

[38] P. Guo, H. Xiao, Z. Wang, and X. Yuan. Interactive Local ClusteringOperations for High-Dimensional Data in Parallel Coordinates. In 2010IEEE Pacific Visualization Symposium (PacificVis), pp. 97–104, 2010.

[39] P. Hamilton and D. J. Wigdor. Conductor: Enabling and UnderstandingCross-Device Interaction. In Proceedings of the 32Nd Annual ACM Con-ference on Human Factors in Computing Systems, CHI ’14, pp. 2773–2782.ACM, New York, NY, USA, 2014. doi: 10.1145/2556288.2557170

[40] E. Hoque and G. Carenini. Interactive Topic Modeling for ExploringAsynchronous Online Conversations: Design and Evaluation of ConVisIT.ACM Trans. Interact. Intell. Syst., 6(1), Feb. 2016. doi: 10.1145/2854158

[41] S. Ingram, T. Munzner, and M. Olano. Glimmer: Multilevel MDS onthe GPU. IEEE Transactions on Visualization and Computer Graphics,15(2):249–261, March 2009. doi: 10.1109/TVCG.2008.85

[42] P. Isenberg, A. Tang, and S. Carpendale. An Exploratory Study of VisualInformation Analysis. In Proceedings of the SIGCHI Conference onHuman Factors in Computing Systems, pp. 1217–1226. ACM, 2008.

[43] Y. Kang, C. Gorg, and J. Stasko. How Can Visual Analytics AssistInvestigative Analysis? Design Implications from an Evaluation. IEEETransactions on Visualization and Computer Graphics, 17(5):570–583,2011.

[44] F. C. Keil. Concepts, Kinds, and Cognitive Development. mit Press, 1992.[45] H. Kim, J. Choo, H. Park, and A. Endert. InterAxis: Steering Scatterplot

Axes Via Observation-Level Interaction. IEEE Transactions on Visualiza-tion and Computer Graphics, 22(1):131–140, 2016.

[46] G. Klein, B. Moon, and R. R. Hoffman. Making Sense of Sensemaking 2:A Macrocognitive Model. IEEE Intelligent systems, 21(5):88–92, 2006.

[47] I. Kopanas, N. M. Avouris, and S. Daskalaki. The Role of DomainKnowledge in a Large Scale Data Mining Project. In Hellenic Conferenceon Artificial Intelligence, pp. 288–299. Springer, 2002.

[48] A. F. Kramer and A. Jacobson. Perceptual Organization and Focused Atten-tion: The Role of Objects and Proximity in Visual Processing. Perception& psychophysics, 50(3):267–284, 1991.

[49] H.-P. Kriegel, P. Kroger, and A. Zimek. Clustering High-DimensionalData: A Survey on Subspace Clustering, Pattern-Based Clustering, andCorrelation Clustering. ACM Trans. Knowl. Discov. Data, 3(1), Mar. 2009.doi: 10.1145/1497577.1497578

[50] B. C. Kwon, H. Kim, E. Wall, J. Choo, H. Park, and A. Endert.AxiSketcher: Interactive Nonlinear Axis Mapping of Visualizationsthrough User Drawings. IEEE Transactions on Visualization and Com-puter Graphics, 23(1):221–230, 2017.

Page 11: An Examination of Grouping and Spatial Organization Tasks ...

[51] C. H. Lampert, H. Nickisch, S. Harmeling, and J. Weidmann. Animalswith Attributes: A Dataset for Attribute Based Classification, 2009.

[52] H. Lee, J. Kihm, J. Choo, J. Stasko, and H. Park. iVisClustering: AnInteractive Visual Document Clustering via Topic Modeling. ComputerGraphics Forum, 31(3pt3):1155–1164, 2012. doi: 10.1111/j.1467-8659.2012.03108.x

[53] J. Lewis, M. Ackerman, and V. de Sa. Human Cluster Evaluation andFormal Quality Measures: A Comparative Study. In Proceedings of theAnnual Meeting of the Cognitive Science Society, vol. 34, 2012.

[54] J. Lewis, L. Van der Maaten, and V. de Sa. A Behavioral Investigation ofDimensionality Reduction. In Proceedings of the Annual Meeting of theCognitive Science Society, vol. 34, 2012.

[55] L. v. d. Maaten and G. Hinton. Visualizing Data using t-SNE. J. Mach.Learn. Res., 9:2579–2605, Sept. 2008.

[56] J. MacInnes, S. Santosa, and W. Wright. Visual Classification: ExpertKnowledge Guides Machine Learning. IEEE Computer Graphics andApplications, 30(1):8–14, 2010.

[57] R. Mander, G. Salomon, and Y. Y. Wong. A Pile Metaphor for SupportingCasual Organization of Information. In Proceedings of the SIGCHI Con-ference on Human Factors in Computing Systems, CHI ’92, pp. 627–634.ACM, New York, NY, USA, 1992. doi: 10.1145/142750.143055

[58] E. Muller, S. Gunnemann, I. Assent, and T. Seidl. Evaluating Clusteringin Subspace Projections of High-Dimensional Data. Proc. VLDB Endow.,2(1):12701281, Aug. 2009. doi: 10.14778/1687627.1687770

[59] A. Y. Ng, M. I. Jordan, Y. Weiss, et al. On Spectral Clustering: Analysisand an Algorithm. In NIPS, vol. 14, pp. 849–856, 2001.

[60] J. Nielsen, T. Clemmensen, and C. Yssing. What Goes on in People’sHeads?: Reflections on the Think-Aloud Technique. In Proceedings of theSecond Nordic Conference on Human-computer Interaction, NordiCHI’02, pp. 101–110. ACM, New York, NY, USA, 2002.

[61] L. G. Nonato and M. Aupetit. Multidimensional Projection for VisualAnalytics: Linking Techniques with Distortions, Tasks, and Layout En-richment. IEEE Transactions on Visualization and Computer Graphics,25(8):2650–2673, 2019.

[62] A. V. Pandey, J. Krause, C. Felix, J. Boy, and E. Bertini. Towards Un-derstanding Human Similarity Perception in the Analysis of Large Setsof Scatter Plots. In Proceedings of the 2016 CHI Conference on HumanFactors in Computing Systems, pp. 3659–3669, 2016.

[63] L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimen-sional data: A review. SIGKDD Explor. Newsl., 6(1):90–105, June 2004.doi: 10.1145/1007730.1007731

[64] F. V. Paulovich, D. M. Eler, J. Poco, C. P. Botha, R. Minghim, and L. G.Nonato. Piece wise laplacian-based projection for interactive data ex-ploration and organization. In Computer Graphics Forum, vol. 30, pp.1091–1100. Wiley Online Library, 3 2011.

[65] E. M. Peck, S. E. Ayuso, and O. El-Etr. Data is Personal: Attitudes andPerceptions of Data Visualization in Rural Pennsylvania. In Proceedingsof the 2019 CHI Conference on Human Factors in Computing Systems,CHI 19. Association for Computing Machinery, New York, NY, USA,2019. doi: 10.1145/3290605.3300474

[66] P. Pirolli and S. Card. Information foraging. Psychological Review,106(4):643, 1999.

[67] P. Pirolli and S. Card. The Sensemaking Process and Leverage Points forAnalyst Technology Identified through Cognitive Task Analysis. Proceed-ings of International Conference on Intelligence Analysis, 5:2–4, 2005.

[68] G. Robertson, M. Czerwinski, K. Larson, D. C. Robbins, D. Thiel, andM. van Dantzich. Data Mountain: Using Spatial Memory for DocumentManagement. In Proceedings of the 11th Annual ACM Symposium onUser Interface Software and Technology, UIST ’98, pp. 153–162. ACM,New York, NY, USA, 1998. doi: 10.1145/288392.288596

[69] A. C. Robinson. Collaborative Synthesis of Visual Analytic Results. In2008 IEEE Symposium on Visual Analytics Science and Technology, pp.67–74, Oct 2008. doi: 10.1109/VAST.2008.4677358

[70] Y. Rogers and J. Ellis. Distributed Cognition: Alternative Framework forAnalysing and Explaining Collaborative Working. Journal of InformationTechnology, 9(2):119–128, Jun 1994. doi: 10.1057/jit.1994.12

[71] D. E. Rose, R. Mander, T. Oren, D. B. Ponceleon, G. Salomon, and Y. Y.Wong. Content Awareness in a File System Interface: Implementing the“Pile” Metaphor for Organizing Information. In Proceedings of the 16thAnnual International ACM SIGIR Conference on Research and Develop-ment in Information Retrieval, SIGIR ’93, pp. 260–269. ACM, New York,NY, USA, 1993. doi: 10.1145/160688.160735

[72] D. M. Russell, M. J. Stefik, P. Pirolli, and S. K. Card. The Cost Structure

of Sensemaking. In Proceedings of the INTERACT ’93 and CHI ’93Conference on Human Factors in Computing Systems, CHI ’93, pp. 269–276. ACM, New York, NY, USA, 1993. doi: 10.1145/169059.169209

[73] B. Saket, P. Simonetto, S. Kobourov, and K. Brner. Node, Node-Link,and Node-Link-Group Diagrams: An Evaluation. IEEE Transactions onVisualization and Computer Graphics, 20(12):2231–2240, Dec 2014. doi:10.1109/TVCG.2014.2346422

[74] M. Sedlmair, A. Tatu, T. Munzner, and M. Tory. A Taxonomy of VisualCluster Separation Factors. Computer Graphics Forum, 31(3pt4):1335–1344, 2012. doi: 10.1111/j.1467-8659.2012.03125.x

[75] J. Z. Self, M. Dowling, J. Wenskovitch, I. Crandell, M. Wang, L. House,S. Leman, and C. North. Observation-Level and Parametric Interactionfor High-Dimensional Data Analysis. ACM Transactions on InteractiveIntelligent Systems (TiiS), 8(2):15:1–15:36, June 2018. doi: 10.1145/3158230

[76] A. Sellen and R. Harper. Paper as an Analytic Resource for the Design ofNew Technologies. In CHI, vol. 97, pp. 319–326. Citeseer, 1997.

[77] F. M. Shipman and C. C. Marshall. Formality Considered Harmful: Expe-riences, Emerging Themes, and Directions on the Use of Formal Repre-sentations in Interactive Systems. Computer Supported Cooperative Work(CSCW), 8(4):333–352, 1999. doi: 10.1023/A:1008716330212

[78] A. Singhal et al. Modern Information Retrieval: A Brief Overview. IEEEData Eng. Bull., 24(4):35–43, 2001.

[79] O. Sourina and D. Liu. Visual Interactive Clustering and Querying ofSpatio-Temporal Data. In O. Gervasi, M. L. Gavrilova, V. Kumar, A. La-gana, H. P. Lee, Y. Mun, D. Taniar, and C. J. K. Tan, eds., ComputationalScience and Its Applications – ICCSA 2005, pp. 968–977. Springer BerlinHeidelberg, Berlin, Heidelberg, 2005.

[80] A. Tatu, F. Maa, I. Frber, E. Bertini, T. Schreck, T. Seidl, and D. Keim.Subspace Search and Visualization to Make Sense of Alternative Clus-terings in High-Dimensional Data. In 2012 IEEE Conference on VisualAnalytics Science and Technology (VAST), pp. 63–72, 2012.

[81] W. S. Torgerson. Theory and Methods of Scaling. Wiley, Oxford, England,1958.

[82] J. Wenskovitch, L. Bradel, M. Dowling, L. House, and C. North. TheEffect of Semantic Interaction on Foraging in Text Analysis. In 2018IEEE Conference on Visual Analytics Science and Technology (VAST), pp.13–24, 2018.

[83] J. Wenskovitch, I. Crandell, N. Ramakrishnan, L. House, S. Leman, andC. North. Towards a Systematic Combination of Dimension Reductionand Clustering in Visual Analytics. IEEE Transactions on Visualizationand Computer Graphics, 24(1):131–141, Jan 2018. doi: 10.1109/TVCG.2017.2745258

[84] J. Wenskovitch and C. North. Observation-Level Interaction with Clus-tering and Dimension Reduction Algorithms. In Proceedings of the 2ndWorkshop on Human-In-the-Loop Data Analytics, HILDA’17, pp. 14:1–14:6. ACM, New York, NY, USA, 2017. doi: 10.1145/3077257.3077259

[85] J. Wenskovitch and C. North. Pollux: Interactive Cluster-First Projectionsof High-Dimensional Data. In 2019 IEEE Visualization in Data Science(VDS), pp. 38–47, Oct 2019. doi: 10.1109/VDS48975.2019.8973381

[86] J. Wenskovitch and C. North. Interactive Artificial Intelligence: Designingfor the ‘Two Black Boxes’ Problem. Computer, 53(8):29–39, 2020. doi:10.1109/MC.2020.2996416

[87] S. Whittaker and J. Hirschberg. The Character, Value, and Management ofPersonal Paper Archives. ACM Trans. Comput.-Hum. Interact., 8(2):150–170, June 2001. doi: 10.1145/376929.376932

[88] M. Wilson. Six Views of Embodied Cognition. Psychonomic bulletin &review, 9(4):625–636, 2002.

[89] J. S. Yi, R. Melton, J. Stasko, and J. A. Jacko. Dust & Magnet: Multi-variate Information Visualization using a Magnet Metaphor. Informationvisualization, 4(4):239–256, 2005.

[90] K. A. Young. Direct from the Source: The Value of ’Think Aloud’ Datain Understanding Learning. The Journal of Educational Enquiry, 2005.

[91] L. A. Zadeh. Fuzzy Sets. Information and control, 8(3):338–353, 1965.[92] L. A. Zadeh. A Note on Prototype Theory and Fuzzy Sets. In Fuzzy Sets,

Fuzzy Logic, And Fuzzy Systems: Selected Papers by Lotfi A Zadeh, pp.587–593. World Scientific, 1996.

[93] R. S. Zemel, M. Behrmann, M. C. Mozer, and D. Bavelier. Experience-Dependent Perceptual Grouping and Object-Based Attention. Experimen-tal Psychology: Human Perception and Performance, 28(1):202, 2002.

[94] H. Zha, X. He, C. Ding, M. Gu, and H. D. Simon. Spectral Relaxationfor K-means Clustering. In Advances in Neural Information ProcessingSystems, pp. 1057–1064, 2002.