3D Visualization of High-Dimensional Discrete...

3D Visualization of High-Dimensional Discrete Data

SeungJin LimIntegrated Science and Technology, Marshall University, Huntington, WV, U.S.A.

Abstract— The need for effective visualization of high-dimensional data is still in demand in the field of sci-entific data analysis. In this paper, we present interactive3D hyperedges as a 3D metaphor for visualizing high-dimensional discrete datasets with an emphasis of enhancingthe separation of data instances in a dataset by means ofstatistical significance. The effectiveness of the proposedvisual metaphor is demonstrated by examples throughout thepaper.

Keywords: 3D visualization, high-dimensional data, 3Dmetaphor, frequent itemset mining

1. IntroductionHigh-dimensional data is prevailing in the world of sci-

entific data analysis. Examples include spatial co-locatedfeature analysis in spatial databases [8], biomedical analysisof cellular proteins for early cancer detection using massspectroscopy [4], [12], [5], and social network analysis onthe Web [9], [3], to name a few. Visualization techniquesand tools of such high-dimensional data are necessity ratherthan convinience in modern data analysis due to the highcomplexity and volume of data.

Visualization of high-dimensional data can be achievedby a series of two-dimensional visualization for every dis-tinct pair of dimensions. This technique has been widelyused because it is easy to understand and implement. Anexample is found in Weka [1] as shown in Figure 1 inwhich 25(= 5 × 5) 2D scatter plots are created fromthe UCI Iris dataset. This dataset has four independentvariables (petalwidth, petallength, sepallengthand sepalwidth) and three class labels (Iris-setosa, Iris-versicolor and Iris-virginica), and is frequently used tobenchmark classification algorithms. In the figure, it is easyto see that even without running a classification algorithm,the petalwidth and petallength attributes togetherare capable of separating one class from another whereassepallength and sepalwidth are not as good as theprevious two. Scatter plots, however, do not generalize orscale up well with high-dimensional data as it would requireto generate up to n2 plots from an n dimensional dataset.

Two alternative, more sophisticated visualization tech-niques are noticeable for their capability of handling high-dimensional data: Parallel coordinates [10], [13] and Circlesegmentation techniques [6]. In Parallel coordinates, eachdimension is represented as a line parallel to each other.The values of each dimension are drawn on the respective

Fig. 1: Scatter plots for UCI Iris dataset in Weka. The visualfeedback from the plots indicates that petalwidth andpetallength are good to separate the data into classes.

Fig. 2: Parallel coordinates for a 7-dimensional dataset.Image source: Wikipedia

line as points. Hence, the universe of the n-dimensional datais represented as n parallel axes. A data instance in thedataset is an n-dimensional vector (p1, p2, . . . , pn) forminga polyline with pi as a vertex on the i-th axis. As illustratedin Figure 2, Parallel coordinates generalize well for high-dimensional datasets by means of adding more axes. It isalso clear in the figure that the first three axes do not separatethe data well.

The primary concern of Parallel coordinates, however,is the potential misinterpretation of the dataset D when

Fig. 3: Circle segmentation technique [6] for high dimen-sional data

D is rendered in different axis ordering. It is becausethe technique typically divides the screen left to right andvertical axes are evenly spaced horizontally. Hence, thetechnique unexpectedly creates a distinct notion of left (orfirst) and right (or last) axes, as in traditional histograms andscatter plots. In most scientific domains, however, the axesin fact form a named tuple, i.e., a unordered data entity, likea set of attributes in relational databases. Putting unordereddata into a traditionally ordered visualization framework maycause biased visual feedback. (We call this positional ororientational visual bias without a formal definition.) Due tothis inherent positional or orientational visual bias, Parallelcoordinates support manual reordering, rotation and scalingof the axes.

In the circle segmentation techniques [6], axes are orig-inated from the midpoint of the screen as radial lines, asillustrated in Figure 3. The figure shows an example of circlesegmentation for a 8-dimensional data. Each data instance isrepresented as a concentric polygon in the techniques. Thelengths or vertices of the data instance correspond to thenumeric magnitude of the data values. The techniques areless prone to biased visual feedback since the divided areasare circular and do not have a distinct notion of the first andlast axes.

These two techniques are mainly adopted in 2D visual-ization [7], [14], [11]. 3D visualization of high-dimensionaldata is still in its infancy stages. A major difficulty of2D visualization techniques such as Parallel coordinatesand Circle segmentation techniques is that visual cues canquickly become cluttered as the amount of visual informa-tion increases. It is generally accepted that 3D can providea better visualization platform than 2D mainly because 3Dis more natural to human cognitive processing for highdimensional data. In addition, visual cues in 3D are farless cluttered than in 2D for the same amount of visualinformation.

We present a 3D visualization metaphor for high-dimensional discrete data in this paper. The universe of the

n-dimensional dataset is mapped onto a circle of n dimen-sions, rather than a rectangular parallel axes, represented by3D objects in our work to minimize bias in visual feedback.Each data instance is modeled as a k-ary tuple, k ≤ n, wherek is the number of values present in the instance, which isin turn modeled as a k-ary hyperedge in our work. We alsopresent a prototype of the proposed 3D metaphor. The keyhighlights of our work include

1) Each k-ary tuple being modeled as a 3D k-hyperedgewhose vertical height along the y-axis represents thestatistical significance of the data instance within thedataset while the constituent dimensions (i.e., axes)are arranged on the circumference of a circle as 3Dobjects on the x-z plane. This layout guarantees notonly to minimize positional or orientational visual biasbut also to maximize the amount of visual informationdisplayed on the screen.

2) Each hyperedge being equipped with multiple visualcues, i.e., color, brightness, thickness and height, forhighly effective visual feedback to capture the statis-tical significance of each data instance.

3) The 3D metaphor being general enough and suitablefor visualizing any undirected, quantifiable many-to-many relationships among data instances.

4) The visual model being highly interactive in our im-plementation for effective user-driven exploration ofthe embedded information. The user can start with aglobal summary of the input dataset and drill downthe detailed local information by zoom-in, and catchdifferent useful global synopses of the dataset by rota-tion and navigation through the underlying relationshipnetwork. All the 3D visual objects are clickable for thispurpose.

The rest of this paper is organized as follows: the proposed3D visual metaphor is presented in Sec. 2 with examples.A brief summary of our experimentation on the proposedmetaphor is presented in Sec. 3. Finally, concluding remarksare presented in Sec. 4.

2. The 3D Visual MetaphorThe purpose of the proposed 3D visual metaphor is

enabling the user to visually and interactively explore thegiven high dimensional discrete dataset in search of anypatterns of high statistical significance. In the end, theuser would appreciate the old axiom “a picture is worth athousand words." We begin with a discussion on the visualrepresentation of a high-dimensional space in the proposedmetaphor.

2.1 High-Dimensional SpaceSince we focus on the visualization of discrete datasets, a

nominal representation of dimensions is sufficient. We do not

Fig. 4: Objects of uniform shape and size that are arrangedalong the circumference of the circle, and viewed from +y-axis.

Fig. 5: Objects of different radii signifying different statis-tical importance.

need an ordinal representation because there is no implicitordering in the dimensions. Should the dataset representcontinuous data, an interval or ratio representation would benecessary. For this reason, a dimension is represented usinga point geometry such as a sphere with a minimal radiusrather than a line or an area.

For the layout of the 3D objects representing dimensions,the circular layout (as presented in the circle segmentationtechniques) is preferred over rectangular layouts (as seenin Parallel coordinates) for mainly two reasons: first, tominimize the positional or orientational visual bias thathas been discussed earlier, and second to maximize theinformation density given a number of pixels on the screen.For the second point, consider a circle of radius r and asquare of side 2r. The number of dimensions that can berepresented on the square is proportional to the side length,

Fig. 6: Objects of different shapes. 25 Presidential candidatesare represented by red boxes and the 50 states by grayspheres.

Fig. 7: 3-ary hyperedges as simple triangles

i.e., 2r, whereas that of the circle is proportional to thecircumference of the circle which is 2πr. That is, a circleof radius r will have π times more visual capacity than asquare of side length 2r.

a) Object shape, color and size: In our model, objectsrepresenting dimensions primarily have the same shape andsize when there is no important semantic division in them, asshown in Figure 4. At times, however, dimensions may havedifferent meanings in the dataset. One typical example canbe found in frequent itemset mining or information retrievalin which some items or words may have different statisticalsignificance. Different object sizes can be employed in sucha case, as illustrated in Figure 5. Also, we may considera situation where dimensions are conveniently divided intodifferent semantic groups. Different object shapes can beused to accommodate this need. For example, objects in

Fig. 8: A 3-tuple with constituent dimensions in backgroundrepresented as a hyperedge of degree 3 whose red colorsignifies its high statistical significance.

Figure 6 represent two distinct groups: US presidentialcandidates and states. The candidates are represented by redcubes and states by gray spheres to reinforce visual feedback.

2.2 High-Dimensional Data as HyperedgesOur main focus in this work is to develop an effective

3D metaphor for n-dimensional data instances where n isa reasonably large integer value. We investigated severalgeometric shapes as option. One naive option would be asimple 3D polygon which has k edges for k-ary tuples, asillustrated in Figure 7. The advantage of this option is theeasiness of implementation. However, it has a significantdrawback: when a data instance d whose arity is close ton exists, it is possible that the hyperedge of d would blockother hyperedges of lower arities from the user’s view.

A remedy to the problem is to use spherical edges forthe polygon. One such an example for 3-ary hyperedgecan be found in hyperbolic geometry, namely hyperbolictriangle. The characteristics of hyperbolic triangle are wellstudied in a 2D space. However, extending the theory ofhyperbolic triangle to a higher dimensional geometric shapein a 3D space for the purpose of implementation requiressubstantially more effort.

An alternative approach employed in our work is to useBezier curve for edge shape as discussed below.

b) Geometry of hyperedges: A 3D hyperedge has twoparts in our metaphor: legs and a cover. A k-ary hyperedge

Fig. 9: 3-tuples with peer 3-tuples of different statisticalsignificance represented in different colors and heights

has k legs plus a 3D surface covering the midpoint of the legsto augment the 3D look of the hyperedge. In designing andimplementing such hyperedges, we are ought to maximizethe separation of hyperedges from each other for effectivedata analysis. We begin with the construction of legs.

Each k-ary tuple is first mapped to a k-ary hyperedgeof k legs, each of which is a quadratic half Bezier curvestanding vertically. A quadratic Bezier curve requires threecontrol points to define: given three control points p0, p1and p2, the quadratic Bezier curve is defined as a functionof t∈ [0,1]

B(t) = (1− t)2 · p0 + 2t(1− t) · p1 + t2 · p2

where t denotes a curve segment.Let I be a k-tuple whose constituent k dimensions i1,

. . ., ik are located on the circumference of the circle, andpm be the middle point of the k objects representing the kdimensions on the x-z plane. Then, we can construct thehyperedge for the k-tuple as follows: first, take i1 as p0 ofthe corresponding quadratic Bezier curve. We can computethe location of p2 such that p2 is the opposite point of p0 onthe circle about pm. Furthermore, set the y-axis value of pmto the statistical significance value of I . Now we can createa quadratic Bezier curve using the three control points p0,pm and p2. Take the half of the curve on the p0 side asthe component curve for i1. Repeat the steps to constructcomponent half Bezier curves for the other dimensions. Theresulting k half Bezier curves are joined at the peak point toyield a parabolic-shaped hyperedge whose height is scaled

Fig. 10: 3-tuples with peers at a different view angle

by the statistical significance of the corresponding k-tupleautomatically. The hyperedge is further decorated with apolygon cover over the peak point for visual clarity.

Figures 8 through 13 show examples of 3-tuples, 6-tuplesand 10-tuples. Hyperedges for 3-tuples are constructed fromthree half Bezier curves with a cover at the joining point ofeach hyperedge as shown in Figures 8, 9 and 10. Figures 11,12 and 13 illustrate that tuples of higher arity can also beconstructed seamlessly using the same algorithm. Notice thateach edge of the polygon cover of a hyperedge is also aBezier curve.

The length of a hyperedge cover from the peak along a legis about 80% of the leg length. We found that this percentileyields a good visual feedback while not dominantly coverother hyperedges. Once the location of each vertex ofthe hyperedge cover is determined, every pair of adjacentvertices serve as quadratic Bezier control points to generatethe final hyperegde cover.

c) Visual cues: Visualization of tuples of arbitrary aritieswould be challenging mainly due to that fact that tuples canbe very long and the number of tuples that are embeddedin a dense database can be very large. Also, users mayrespond to the visual stimuli coming from a large numberof tuples differently. For example, one user may be sensitiveto variations in color but not in length, and vice versa.For these reasons, in designing a 3D metaphor for tupleswe need to use more visual cues to reinforce the meaningof each tuple in the context and to minimize any user-level distortion of the visual information. Visualization of

Fig. 11: A group of 6-tuples in dark orange color. Their closeproximity indicates a high level of constituent item sharing.

tuples using multiple visual cues is generally more effectivethan using a single visual cue as long as they are usedwith consistency. In the proposed 3D metaphor for k-tuples,multiple visual cues are used to maximize the discriminativepower of the metaphor: shape, color, illumination, height,and thickness. These visual cues are demonstrated in thefigures throughout this paper.

d) Relative location: Since the peak point of each com-ponent Bezier curve of a hyperedge is a function of itsassociated control points (i.e., item objects), hyperedges tendto be spatially cohesive when hyperedges share dimensions.For example, the four 6-tuples in Figure 11 form a clump.The relatively close proximity between them indicates thatthe level of dimension sharing is high among them.

e) Color: The hue of an edge e representing a k-tuple isdetermined by the statistical significance of the tuple usingthe following formula:

hue(e) =max−weight(e)

max

where max is the maximum significance in the dataset andweight(e) is the statistical significance of the correspondingtuple. With this color scheme, the most significant edges arerendered in red, least significant ones in blue with yellow andgreen edges in between. Figures 9 and 10 show that highlysignificant 2-tuples are red and tall whereas less significanttuples are blue and short. Note in the figure that there are

Fig. 12: A 10-tuple in green color with three constituentitems in background

generally much fewer red, tall edges than blue, short edgesin background. Also note that the 3-tuple in Figure 8 isred while the 10-tuple in Figure 12 is green because tuplesof larger arity generally occur less frequently than those ofsmaller arity in a database.

f) Illumination: Shading is natural to real 3D objects underillumination. We add a pseudo-shading effect to edges toaugment their 3D look by applying varying illuminationrather than uniform illumination.

The edge illumination is defined as a function of t∈ [0,1]

illumination(t) = − (t− 0.5)2

0.52+ 1

where t denotes a curve segment.The above formula guarantees an edge to be the brightest

at the vertex point and gradually darker toward the both endpoints, as illustrated in Figures 8 through 13.

g) Thickness: The thickness of an edge is proportional tothe statistical significance of the corresponding tuple.

2.3 User interactionAs discussed above, the circular layout is preferred for its

superior visual capacity and intuitive visual feedback that itoffers. However, its effectiveness is ultimately challengedwhen the number of visual objects becomes large. Theobjects will be cluttered, and subsequently feedback fromthe visualization will be compromised. A natural remedy to

Fig. 13: The green 10-tuple is outstanding among other lesssignificant, blue 10-tuples in background.

this problem is an integration of user interaction that lets theuser navigate through multiresolutional views. We follow theVisual Information-Seeking Mantra in this regard: overviewfirst, zoom and filter, then details on demand, as suggestedin [2], through the following user interaction functions:

h) Rotation: The user can change his/her view directionto any direction in the 3D space. This offers the effect ofrotation of the entire grid. In addition, the entire grid canrotate around x-, y-, or z-axis incrementally or automaticallyat the user’s discretion, as shown in the figures. The consis-tent, automatic rotation around a particular axis is useful toinspect a global summary of the entire data from a particularperspective.

i) Zoom: The zoom function allows the user to zoominto any visual components such as nodes, node labels,hyperedges, as well as the entire grid. Examples of zoomedviews are also shown in Figures 5 and 8 through 13 atvarious level of magnification. Note that when the datasetsize is relatively large, node labels may be too small to readat a low magnification level. By zooming the view in, wecan inspect all the node labels clearly without losing anyinformation.

j) Filter: The user is able to filter nodes and hyperedgesby their statistical significance. This function helps the userfocus on significant patterns without being distracted byminor details by having to filter out less significant objects.

This filter function provides run-time pruning of tuples.Examples of filtered views are shown in Figures 8, 11, and

12. In the figures, hyperedges are pruned interactively usingdifferent minimum statistical significance through whichprominent 3-, 6- or 10-tuples are easily discovered. Apruning action can also be instantly revoked in our prototype.

k) Query interface: Our model also provides a queryinterface. As the user clicks a node n, n’s appearancechanges to a wire framed shape in a different color andthe detailed information of n is displayed on top of thescreen. The information includes the name of n, and thesupport count of n in the given dataset. As the user clicksa hyperedge e, the support count of the corresponding tupleand the names of the constituent dimensions are displayed.Examples are shown in Figures 8 through 13.

3. ExperimentThe prototype of the proposed visual model was im-

plemented using the Java 3D programming language, andexperimented upon with a number of real world datasets aswell as artificial datasets. Their sizes range up to ≈20,0003D spheres/cubes and 104 hyperedges. The figures presentedin this paper are taken from our experiments.

In particular, the figures shown in Figures 8 through 13are generated from a market basket analysis dataset whichcomprises 100 dimensions, 1,126 2-arity hyperedges, andnumerous k-hyperedges (k > 2). With this amount ofvisual information, exploration of the dataset for statisticallysignificant patterns on a Pentium-based PC with 2GB mainmemory was smooth and pleasant.

4. Conclusion and Future WorkA 3D visual metaphor for high-dimensional datasets is

presented in this paper. It resolves the potential positionaland orientational visual bias inherent in a rectangular layoutof dimensions by arranging dimensions as a circle. Each k-dimensional data instance is modeled as a k-arity hyperedge.The metaphor is implemented as visual objects of highquality and shows a good separation of data instances.

While the effectiveness of our model is generally demon-strated through a handful of experiments, we forewarn thereader that efficiency in terms of memory requirement couldbe a concern in comparison to 2D multivariate visualizationtechniques as we render information as high quality 3Dobjects.

In the future, the proposed metaphor will be investigatedfor its applicability to visual data mining.

References[1] Weka: Data Mining Software in Java. http://www.cs.

waikato.ac.nz/ml/weka/.[2] J. Abello and J. Korn. MGV: A System for Visualizing Massive

Multidigraphs . IEEE TVCG, 8(1):21–38, Jan-Mar 2002.

[3] S. Chakrabarti. Mining the Web, Discovering Knowledge fromHypertext Data. Elsevier Science, 2003.

[4] M. Cohenford, S. Lim, C. Brown, M. Chaudhry, S. Sigdel, andB. Rigas. FT-IR Analysis of Normal and Malignant Mouse ColorectalTissues; A Study Employing Chemo- metric Analysis. In Proc.of the Pittsburgh Conference on Analytical Chemistry and AppliedSpectroscopy (PITTCON 2011), March 2011.

[5] M. A. Cohenford and B. Rigas. Cytologically normal cells fromneoplastic cervical samples display extensive structural abnormalitieson IR spectroscopy: Implications for tumor biology. Proc. of theNational Academy of Sciences of the United States of America (PNAS),95(26):15327–15332, December 22 1998.

[6] D. A. Keim. Designing Pixel-Oriented Visualization Techniques:Theory and Applications. IEEE TVCG, 6(1):59–78, Jan-Mar 2000.

[7] C. K.-S. Leung, P. P. Irani, and C. L. Carmichael. Wifisviz: Effectivevisualization of frequent itemsets. Data Mining, IEEE InternationalConference on, 0:875–880, 2008.

[8] Z. Lin and S. Lim. Optimal Candidate Generation in Spatial Co-location Mining. In Proc. of the 2009 ACM Symposium on AppliedComputing, 2009.

[9] B. Liu. Web Data Mining, Exploring Hyperlinks, Contents, and UsageData. Springer, 2007.

[10] R. Moustafa and E. Wegman. Multivariate continuous data–parallelcoordinates. In Graphics of Large Datasets, Statistics and Computing,pages 143–155. Springer, 2006.

[11] M. E. J. Newman. Coauthorship networks and patterns of scientificcollaboration. Proc. of the National Academy of Sciences of the UnitedStates of America, 101(1):5200–5205, April 6 2004.

[12] P. R. Srinivasa, B. S. Kramerb, and S. Srivastava. Trends in biomarkerresearch for cancer detection. The Lancet Oncology, 2(11):698–704,November 2001.

[13] E. J. Wegman. Hyperdimensional Data Analysis Using Parallel Coor-dinates. Journal of the American Statistical Association, 85(411):664–675, 1990.

[14] L. Yang. Visual data mining. chapter Visual Exploration of FrequentItemsets and Association Rules, pages 60–75. Springer-Verlag, Berlin,Heidelberg, 2008.

3D Visualization of High-Dimensional Discrete...

Documents

Transcript of 3D Visualization of High-Dimensional Discrete...