Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf ·...

30
Version 9-Sep-99 Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames Research Center 1 Introduction We first distinguish the problem of big data collections from that of big data objects. Big data col- lections are aggregates of many data sets. These are not the focus of the current notes, but the issues in big data collections are summarized for completeness. Big data objects are just that – single data objects that are too large to be processed by standard algorithms and software on the hardware one has available. Big data objects may comprise multiple individual files – the collection may be referred to as a data set. However, if all of the files or pieces of a data set are intimately related, and analyzed together (e.g. the individual files of a time series) then we refer to the set of files as a single data object. There is a growing literature on techniques to handle big data objects (though not all authors have thought of their techniques as being relevant for “big data”). To understand what techniques can be used and when, we discuss some differences among big data applications. We then discuss the varying architectures that have been and might be applied to manage big data objects for analysis and visualization. We then proceed to a taxonomy of the techniques possible to manage big data, and discuss each of these in turn. 1.1 Big data collections Big data collections are aggregates of many data sets. Typically the data sets are multi-source, often multi-disciplinary, and heterogeneous. Generally the data are distributed among multiple physical sites, and are often multi-database (that is, they are stored in disparate types of data repositories). At any one site, the size of the data may exceed the capacity of fast storage (disk), and so data may be partitioned between tape and disk. Any single data object or data set within the collection may be manageable by itself, but in aggregate the problem is difficult. To accomplish anything useful, the scientist must request information from multiple sources and each such request may require tape access at the repository. Over many scientists the access patterns may not be predictable, and so it is not always obvious what can be archived on faster storage (disk) and what must be off-loaded to tape. In addition, there are the standard data management problems (but aggravated) of consistency, heterogeneous database access, and of locating relevant data. 1 The latest version of these course notes can be found at http://science.nas.nasa.gov/ mbc/home.html. The current notes build upon notes from previous SIGGRAPH courses, in particular [10] and [8]. Both can be found at the same URL. 29 Big Data

Transcript of Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf ·...

Page 1: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Version 9-Sep-99

Large Data Management for InteractiveVisualization Design1

Michael CoxMRJ/NASA Ames Research Center

1 Introduction

We first distinguish the problem ofbig data collectionsfrom that ofbig data objects. Big data col-lections are aggregates of many data sets. These are not the focus of the current notes, but the issuesin big data collections are summarized for completeness. Big data objects are just that – single dataobjects that are too large to be processed by standard algorithms and software on the hardware one hasavailable. Big data objects may comprise multiple individual files – the collection may be referred toas a dataset. However, if all of the files or pieces of a data set are intimately related, and analyzedtogether (e.g. the individual files of a time series) then we refer to the set of files as a single data object.

There is a growing literature on techniques to handle big data objects (though not all authors havethought of their techniques as being relevant for “big data”). To understand what techniques canbe used and when, we discuss some differences among big data applications. We then discuss thevarying architectures that have been and might be applied to manage big data objects for analysis andvisualization. We then proceed to a taxonomy of the techniques possible to manage big data, anddiscuss each of these in turn.

1.1 Big data collections

Big data collections are aggregates of many data sets. Typically the data sets are multi-source, oftenmulti-disciplinary, and heterogeneous. Generally the data are distributed among multiple physicalsites, and are often multi-database (that is, they are stored in disparate types of data repositories).At any one site, the size of the data may exceed the capacity of fast storage (disk), and so data maybe partitioned between tape and disk. Any single data object or data set within the collection maybe manageable by itself, but in aggregate the problem is difficult. To accomplish anything useful,the scientist must request information from multiple sources and each such request may require tapeaccess at the repository. Over many scientists the access patterns may not be predictable, and so itis not always obvious what can be archived on faster storage (disk) and what must be off-loaded totape. In addition, there are the standard data management problems (but aggravated) of consistency,heterogeneous database access, and of locating relevant data.

1The latest version of these course notes can be found athttp://science.nas.nasa.gov/ mbc/home.html. The current notesbuild upon notes from previous SIGGRAPH courses, in particular [10] and [8]. Both can be found at the same URL.

29 Big Data

Page 2: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Michael Cox

The Earth Observing System (EOS) whose development is overseen by NASA Goddard is an in-structive example of the problem of big collections. The goal of EOS is to provide a long-term repos-itory of environmental measurements (e.g. satellite images at various wavelengths; about 1K parame-ters are currently included in the requirements) for long-term study of climate and earth’s ecosystems.The data are intended to be widely available not only to scientists, but to the general public. Thus,EOS must acquire, archive and disseminate large collections of distributed data, as well as providean interface to the data. Estimates for the data volume that must be acquired, processed, and madeavailable are from 1 to 3 TBytes/day. These data arrive in the form of individual data objects that varyfrom about 10 to 100 MBytes (average about 50 MBytes), and are acquired and processed by about 10Distributed Active Archive Centers (DAACs).

1.2 Big data objects

Big data objects are typically the result of large-scale simulations in such areas as ComputationalFluid Dynamics (CFD), Structural Analysis, Weather Modeling, and Astrophysics. These simulationstypically produce multi-dimensional data sets. There is generally some 3Dgrid or meshon which datavalues are calculated. Often, data values are calculated at 10x or 30x the number of time steps in thesimulation than are written to mass storage.

For example, in CFDcurvilinear grids are commonly used. The grids themselves are regularlattices bent to conform to the structures around which flow is calculated (e.g. a wing). Multiple gridsmay be required to conform to all of the parts of the surface under study (e.g. the wing and fuselage).In other disciplines other grids and meshes are employed. Simulations may be eithersteady, in whichcase time is not modeled. Or simulations may beunsteadyin which case there are solutions at multipletime steps. The grids or meshes themselves may move and/or change during the simulation. In CFD,the results are referred to asunsteady gridsand are required for example to conform to flap movementon the wing of a plane.

Even steady calculations today result in data sets around 1 GBytes (e.g. 32 million nodes on aCFD curvilinear grid). It is common to generate hundreds or thousands of time steps as the result ofunsteady simulation leading quickly to TB-scale data sets.

Typically these huge data sets are not analyzed while the supercomputer generates them (it isgenerally not cost-effective to use supercomputer cycles for human interaction). Rather, data sets arepost-processed. Post-processing may involve user-driven visualization algorithms such as (in CFD)streamlines, streaklines, examination of cutting planes, etc., and also may involve off-line calculationssuch as vortex-core extractions.

It is clear that data sets of hundreds of GBytes are too large to fit in the main memory of anythingbut a supercomputer. It is the rare installation that can afford supercomputer time for post-processing,and so these data must be disk-resident during analysis. But hundreds of GBytes is too large for localdisk except on the most resourced server-class workstations. For extremely large single data objects(in the 500 GByte range) the data may not even fit entirely on remote mass storage! Coping with suchextreme data sets is the venue of algorithms and architectures for managing big data objects.

SIGGRAPH 99 30

Page 3: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Version 9-Sep-99

2 Important differences among big data applications

To date a number of techniques have been published whose goal is to cope with the problems of bigdata objects. We abstract the common themes of these techniques and examine them in turn. Howeverwe must first look at some differences between visualization (and data analysis) applications becausesome properties of the application determine which techniques may or may not be productive. Ingeneral the questions that must be asked of the application are:

� What is thedata analysis model?

� Can the data bequeried, or must they bebrowsed?

� Can the data bedirectly rendered, or must the data bealgorithmically traversed?

� Are the data themselves static or dynamic? In particular, do the data comprisestatic fieldsor dothey comprisedynamic fieldscalculated on demand by the application?

� Is there an appropriate algorithm for thedimensionality and organizationof the data?

� How largeis the data set?

� Is there a data analysistime budget?

2.1 Data analysis process – postprocessing vs. steering

There are generally two models or processes for data analysis, which we might call thepostprocessingmodeland thecomputational steering model.

The postprocessing is the more common model in large-scale computational simulation. A super-computer or otherwise large and expensive machine simulates some phenomenon, taking advantageof fast floating point and extremely large main memory sizes. In a separate step, the simulation writesdata to mass storage. For time-varying simulations, 1/10 or 1/30 of the time steps are actually stored(the other time steps are generally required as intermediate results in the simulation for high fidelityof final results). Then, in a separate phase (today generally on a less expensive machine) the data areanalyzed (post-processed).

Alternatively, parts of the research community have pursued techniques that allow the scientistto interact directly with the simulation, and evensteer the computation. The reasoning is that theproblem of “big data” can be avoided by not generating the data! Historically, scientists have peekednon-intrusively at the time step data as they have been written to disk, and have shut down simula-tions gone awry. Those who advocate computational steering generally envision more proactive (andintrusive) monitoring and modification of running codes [36]. Historically, scientists have consistentlychosen to use supercomputer cycles for scientific computation rather than data exploration at (human)interactive rates. Although it seems unlikely scientists will suddenly change their views, there mayyet be techniques discovered that do allow interaction with running simulations without consumingsubstantially more supercomputer cycles. It does seem very likely that as scientists find personal com-puters and workstations sufficient for their specific problems, techniques of computational steering

31 Big Data

Page 4: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Michael Cox

will increasingly be used. On a single-user desktop machine it makes much more sense for the scien-tist to interact directly with the running simulation, and it makes much less sense for the program tosave data to disk for later postprocessing.

2.2 Applications that query vs. those that browse

A major difference between applications is in the kind of question asked during post-processing. Whenquestions and form of the answers are well known in advance it may be possible simply to extract thedata-dependent answers with minimal user involvement. In fact, user involvement may be simply aquery to the data for an answer that matches a specific question and specific parameters. When thedata are not that well understood, when the field is not that well-developed, when algorithms to processqueries do not yet exist,browsingor navigationmust be employed.

Feature extractionandscientific data miningare two areas of research based on the query paradigm.A simpler example is the extraction of locally maximum pressures, a more intricate example is vortex-core extraction from large CFD data sets. Feature extraction and data mining techniques tend to bemore developed for non-research applications (for example in the design of aircraft) than they are inresearch applications. In general, feature extraction and data mining work better for engineering thanthey do for science.

It may be impossible to support the query paradigm when the questions or answers are not under-stood, but it may also be impossible if the algorithms to extract the requisite information do not yetexist. As simple example, it may not even be obvious which summary statistics in a post-processeddata set are of interest. A more interesting example is isosurface extraction.Segmentationalgorithmsare those that can extract interesting surfaces off-line. In medical work, algorithms to extract the sur-faces of the kidney from CT data are known. These surfaces can be represented as triangle meshesstored for later viewing, and the original “big data set” can be set aside. Of course, if the static dataextracted (the triangle meshes) are still too large for storage, rendering and display, then those datamay be amenable to off-line surface decimation and multiresolution techniques.2

On the other hand, some isosurfaces are not amenable to off-line segmentation. For example, aninfinite family of isosurfaces may be present in the data. If it is not possible to determine in advancewhich of these are interesting, it is obviously not possible to extract them all off-line. As anotherexample, segmentation algorithms for the heart in CT data are not known and so it is not possible toextract the heart and store it as a set of surface meshes. In both examples, it is not possible to store theextracted surfaces and so the original data must be retained and manipulated interactively. The desiredisosurfaces must be extracted on-line, interactively at user request. In addition for both examples,surface decimation and level-of-detail algorithms are uninteresting since the data cannot be stored assurfaces.

In such cases where off-line feature extraction or data mining or segmentation is not possible, acommon technology forbrowsingis exploratory visualization. Most visualization software and sys-tems allow the user to apply very generic visualization techniques in order to understand the underlying

2There are many decimation and level-of-detail techniques, e.g. multiresolution, for the representation of surfacemeshes. The current notes touch upon such techniques for 3-dimensional meshes in scientific visualization, but leave thesummary of the rich graphics literature on surfaces to other sources.

SIGGRAPH 99 32

Page 5: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Version 9-Sep-99

data. While browsing is of course useful in engineering as well as in science, browsing isessentialinthe latter.

The paradigm required by the application affects the management of data. Query-based paradigmsin general allow significant off-line processing to be done so that answers to user questions can bedelivered quickly. This off-line processing is in general possible because the questions and desiredanswers are known in advance. The requirement for on-line browsing makes off-line processing diffi-cult. It may be unclear what questions to ask, and even when those are known the algorithms to deriveanswers from the data may not yet exist.

2.3 Direct rendering of the data vs. algorithmic traversal

For some applications it is possible torender data directly, that is to produce pixels of visualizationsdirectly from the data. Volume rendering is one example of this. The user (or software) chooses atransfer functionthat maps data values directly to pixel colors. During volume rendering, the dataare reconstructed, resampled, and mapped to pixel colors. Directly rendering 3D scenes (e.g. forarchitectural walkthroughs or virtual worlds) is another example.

A common technique employed for data that can be rendered directly is to reduce the total amountof data actually touched or rendered by several means. A common approach is to down-sample theunderlying data and to reconstruct and resample based on output resolution and viewer position. Forexample, polygons of an architectural walkthrough may be aggregated to simpler polygons and theserendered instead if they are sufficiently far from the viewer – i.e. if only a few pixels of the initial poly-gons can be seen anyway. This is the standard level-of-detail approach followed in terrain renderingand flight simulators. Theseview-dependentandoutput-resolution-dependenttechniques are possiblebecause the rate that the data must be sampled is driven directly by screen pixel size.

However, many visualization techniques employalgorithmic traversalof the underlying data. Fre-quencies in the data are not directly seen by the viewer – rather they are interpreted by algorithms.An obvious example is CFD particle tracing. A particle is injected into a field of velocity vectors, andis integrated through the field as if it were smoke injected into the real flow. A particle may traversethe data set arbitrarily, and it is absolutely incorrect to restrict the particle’s traversal by output screenresolution. If this were done the particle may very well touch pixels significantly different than thoseit would otherwise have. A second example is the display of cutting planes through a data set. Con-sider a field derived by some nonlinear function of the raw data. Even if a large section of the cuttingplane resolves to a single pixel color over few pixels, the derived field must be reconstructed and thenresampled before that color can be correctly chosen.

In these examples the results of traversal are user- and graphics-independent. The data are trans-lated into graphics primitives by traversal, and these primitives are unrelated to screen resolution andviewer position. For such visualization techniques where direct rendering of the data is not possible,it is not obvious that data reduction based on output resolution and camera position is possible.

33 Big Data

Page 6: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Michael Cox

2.4 Static vs. dynamic fields

We have already noted that most big data objects are the result of computational simulation. Simu-lations in general tend to write as few parameters as possible to mass storage. Given the choice ofwriting out parameters that can later be derived, and writing out (say) more time steps of an unsteadycomputation, scientists in general choose to write out more time steps.

We refer to the parameters actually stored asstatic fieldsand those that must be computed duringpost-processing (browse- or feature-extraction-time) asdynamic fieldsor derived fields. Examples ofstatic fields from CFD are density, momentum, and energy (5 fields). During post-processing it iscommon to derive vorticity, pressure, and more than 50 additional fields.

Derived fields may be linear functions of the raw data, but most derived fields are nonlinear func-tions of the underlying static fields. Therein lies the rub, as nonlinear derived fields present greatdifficulty for many of the data reduction techniques that have been proposed by the research com-munity. For example, multiresolution methods generally store integrated (average) values at lowerresolutions, and some authors propose that these lower resolution data sets can be traversed directlyfor visualization. However, a derived value over the average of a field is not the same as the average ofthe derived values over the same field! In particular, consider a velocity field and the derived vorticity(which is the curl of velocity). If vorticity is calculated over a lower-resolution field of average val-ues, the result is very different than if vorticity were calculated over the raw underlying field and thenaveraged to produce a lower-resolution data set.

Evidently the original data must be reconstructed before the nonlinear derived field is calculated!Schemes that do not reconstruct the original data (or that do not otherwise solve this difficult problem)before traversal by the visualization algorithm, provide the wrong answers to their users. From anotherviewpoint, schemes that work only on static fields in CFD (5 fields) work on less than 10% of the fieldsof interest to the CFD scientist (5/55). From yet another viewpoint, this problem exists regardless of theresolution of the output device, and regardless of the user’s camera position during the visualization.

2.5 Dimensionality and organization of data

Aside from the more fundamental differences between application requirements of the previous sec-tions, there are practical differences as well. There are of course natural differences between appli-cations in the dimensionality of the data (1D, 2D, 3D, etc) but there are also more (perhaps) artificialdifferences in the stored organization of the data. Algorithms designed for feature extraction or inter-active data browsing are in general targeted at specific user problems. Algorithms tend to work withrestricted dimensionality of data (e.g. 3D only) and in general tend to work with specific schemes ofdata storage (e.g. on regular grids only). Some of the different organizations are defined briefly below.Always check the type of data organization for which a particular big data algorithm has been designedto work.

There is a fundamental difference between data sets withimplicit storage and addressingandthose withexplicit storage and addressing. In implicitly addressed data, the relationships betweenvertices, edges, faces, are implicit in the data structure. Finding a vertex, edge, or face can be donewith a deterministic address calculation, usually by calculating an offset in a multi-dimensional array.In explicitly addressed data, the relationships between vertices, edges, etc, must be stored explicitly.

SIGGRAPH 99 34

Page 7: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Version 9-Sep-99

Finding a vertex, edge, face, usually requires traversal of the data (in particular, by following pointerseither in memory or on disk).

In CFD in particular, there is a distinction between the address space in which the data are ma-nipulated and the address space in physical (Euclidian) space. The former is generally referred toascomputational spaceand is in terms of the storage data structures, for example three indices intoa 3-dimensional array of data. The latter is generally referred to asphysical spaceand is in termsof the coordinates in Euclidian space, for example three floating point values representing a point in3-dimensional space.

A regular gridusually denotes a multi-dimensional array of the underlying data, where storage andaddressing are implicit.Rectilinear gridsin medical imaging are regular grids. Addressing in a regulargrid in computational spaceis typically done with a tuple of three indices into a multi-dimensionalarray. However, a regular grid in computational space may not be “regular” in Euclidian 3-space.Curvilinear grids in CFD are generally represented as a pair of regular grids. Parameter values arestored in a 3-dimensional array, while a node-by-node mapping from computational space to physicalspace is stored in a separate 3-dimensional array.

Irregular gridsstore and address the cells of a data set explicitly. The most common irregular gridcomprises lists of the vertices and edges of the grid (edges usually specified by reference to an arrayof the vertices), and the faces of the tetrahedra of the volume if the data are 3-dimensional.

2.6 How large is the data set – quantity becomes quality

In many fields thedomain scientisttraditionally insists that errors “cannot be” introduced by data anal-ysis (in particular by data traversal and visualization). Thecomputer scientisthas historically insistedthat data analysis and visualization “must be” interactive. However, as the data sets have increasedbeyond anything remotely manageable with previous techniques, it is interesting to see examples inboth communities where such “hard” requirements have softened. On the ASCI program in particular,domain scientists profess that lossy data analysis techniques are of interest. And there are numerousexamples (in the literature and in available software) of computer scientists who offer systems for datavisualization at less than 5 Hz.

Acceptable error and acceptable non-interactivity are functions of data size. When all data fit inmain memory, both scientists agree that error must be low and interactivity high. As the data spill overonto local disk, and then onto raid disk farm, the computer scientist tends to accept non-interactivity.As the data spill over onto raid disk farm and then onto the mass storage system that is typically onlyaffordable at a government lab, the domain scientist begins to accept error.

We have not yet discussed the techniques that have been applied to reduce the effective size ofextremely large data sets, but we can discuss their potential tradeoffs in terms of data reduction efficacyand potential error introduced. An estimate of both based on reports in the literature and on someguesswork appears in the following table.

The techniques grouped at the top of the table have zero error, but we can only expect them toreduce data size by some constant. For example, combining best-case expectations for the first threetechniques, we may get a 250x reduction in data size. ASCI today produces data sets of 10 TBytes. Itis clear that these fixed-reduction techniques are not sufficient. This is exactly why domain scientists

35 Big Data

Page 8: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Michael Cox

Techniques Data reduction potential Error introduced

Memory hierarchy 2x - 100x 0Indices 2x - 100x 0

Write-a-check 10x - 50x 0Compression 10x - 100x ?? Arbitrary

Computational steering Arbitrary ?? 0 ??Feature extraction Arbitrary 0 ??

Multi-resolution browsing Arbitrary ArbitraryView-dependent techniques Arbitrary Arbitrary

are willing to consider lossy data reduction strategies.There are two techniques in the table that are claimed to offer arbitrary data reduction (computa-

tional steering and feature extraction). It is unknown in general the error they introduce into the data(such an evaluation must be domain-specific).

The two techniques at the bottom of the table provide potentially arbitrary data reduction, witharbitrary error. These two techniques are active areas of research – presumably because they do offerarbitrary data reduction for projects such as ASCI that require something better than fixed-reductionapproaches. One shortcoming of much of the published research on these two techniques, however, isthat the error introduced is generally not characterized. In fact, very few tools have been developed tohelp the researcher characterize error in a new data reduction algorithm.

Borrowing from Pang ([35]) we can define three spaces in which error might be characterized andquantified:

� Image-level

� Data-level

� Feature-level

To characterize and quantify error, we must compare the data-reduced set with the original data set.If the comparison is at the image level, we compare the images that result from visualization. Thereare several methods we may use to compare images:

� Simple visual inspection (unfortunately too common in the literature).

� Summary statistics of the images (e.g. RMS).

� Taking and evaluating the difference between the two images.

� Transforming both images and comparing in the transform space (e.g. Fourier or wavelet analy-sis).

If the comparison is at the data level, we compare the reduced data with the original data. Againthere are several methods that have been reported:

SIGGRAPH 99 36

Page 9: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Version 9-Sep-99

� Summary statistics (again, e.g. RMS)

� Taking and evaluating the differences between the two data sets.

Finally, if the comparison is at the feature level, we compare the features that can be deduced fromthe data-reduced and the original data. The comparisons may be:

� Domain-specific feature comparisons (e.g. vortex cores from Computational Fluid Dynamics),or

� Domain-independent features (e.g. isosurface comparisons)

While it is difficult to characterize exactly the “information” that has been lost via lossy data reduc-tion, more should be done in this area than has been. Feature-level comparisons are probably betterthan data-level comparisons. Data-level comparisons are probably better than image-level compar-isons. Any quantitative metrics are almost certainly better than simple visual comparisons (especiallywhen those comparisons are not performed by the domain scientist).

2.7 Data analysis time budget

The Department of Energy’s Accelerated Strategic Computing Initiative (ASCI) produces and plans toproduce what are quite probably the largest data sets in the world, and produces and plans to producemore of these than anyone else. Today’s “average” computer simulation generates 350 GBytes of data,the large simulations currently generate up to about 10TBytes, and it is expected that in 2004 theaveragesimulation will generate over 12 TBytes of data [19]. The magnitude of these data results in aqualitative difference in data analysis.

The architecture of ASCI’s production system (following Heermann [18]) is shown in Figure 1.Note first about this architecture that there are no details of visualization data flow itself! The ASCIarchitecture represents a pipelined production environment that encompasses not only visualizationbut the analysis required before new simulations are begun, and the supercomputer simulations them-selves. An important point that Heermann makes about the ASCI architecture and requirements is that“entire system throughput is as important as the efficiency of a single system component.” In particu-lar, techniques may not improve the interactivity of visualization (i.e. for browsing) at the expense oftotal throughput through the visualization stage! This creates another difference between applications:“interactivity vs. total processing time/cost.” For many applications, computer scientists focus ontechniques that may be expensive in off-line processing time but that result in better interactive on-lineprocessing. For the ASCI application, this focus on interactivity of visualization is shifted to focus onhigh bandwidth and low latency through the visualization stage of the ASCI cycle.

3 System architectures

The abstract data flow from simulation output to visualization image is shown in Figure 2.

37 Big Data

Page 10: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Michael Cox

Simulation

Visualization

Analysis

Figure 1: The ASCI system architecture.

Traversal + transfer function

Rendering

Data traversal

Simulation (or data acquisition)

Image

Large Data Set

Geometry (e.g. triangles)

Figure 2: Abstract data flow in visualization of big data.

SIGGRAPH 99 38

Page 11: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Version 9-Sep-99

The process begins with data acquisition, in general the output of computational simulation. Thisgenerates a very large data set which is written to mass storage. The data may be accessed for visu-alization by two scenarios. On the left in Figure 2, an algorithm traverses the data. The algorithmgenerates geometry which is then rendered to an image (or animation) and displayed. This is the stan-dard scenario in which polygonal surfaces are generated for rendering by graphics hardware. On theright in Figure 2, data are accessed and rendered directly. This is the standard scenario for volumerendering, where the data are “displayed directly”.

The traversal rate may or may not be coupled to the rendering rate. For example, the data setmay be traversed off-line, and geometry generated and stored for later perusal. As another example,traversal may generate an isosurface that the user examines and manipulates for many frames beforerequesting a new isosurface.

Rendering rate may not be coupled to image display rate. Several researchers are exploring the ideaof generating many images from the data, from different points of view and at different resolutions,and then allowing exploration by image interpolation (or reprojection). This is an application of thenow popular idea in the graphics community of “image-based rendering”.

Given these scenarios, the architectural question is “where are the partitions drawn between differ-ent machines?” For example, are all steps from data to pixels computed on a single machine? Or isdata traversal done on one machine and rendering on another? Or perhaps reduction allows data to beaccessed remotely from mass storage, and traversed and rendered on a desktop workstation?

3.1 Supercomputer with the graphics workstation

Perhaps the oldest visualization system architecture is shown in Figure 3. This architecture was em-ployed at NASA Ames Research Center (and most likely at other sites as well) around 1985 or 1986.

The process begins with simulation, which generates very large data sets and writes them directlyto mass storage. Simulation is on the supercomputer, as are the big fast disks to which are writtenthe very large data sets. In this architecture, the supercomputer also traverses the data when the sci-entist wishes to perform post-processing. This traversal generates geometry which is shipped via fastnetwork to a graphics workstation whose job is primarily to provide fast rendering. This architecturalmodel provided initial demand for SGI graphics workstations. Architecturally, this is an expensivesolution, using supercomputer time for what has become possible more recently on high-end graphicsworkstations.

3.2 Supercomputer with the heroic workstation

Until recently the most common high-end visualization architecture combined a supercomputer witha “heroic workstation”, as shown in Figure 4.

In this architecture, the user “writes a check” for the largest high-end workstation that can bepurchased (or perhaps only afforded). The supercomputer completes its simulation, and writes data toits own mass storage. Typically the data are copied to the high-end workstation’s own disks (whichcan be formidable – sometimes a high-end graphics workstation is a data server in its own right –

39 Big Data

Page 12: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Michael Cox

Workstation

Supercomputer

Image

Rendering

Fast network

Geometry (e.g. triangles)

Traversal using big fast memory

Big fast disks

Simulation (or data acquisition)

Figure 3: System architecture of supercomputer / workstation.

SIGGRAPH 99 40

Page 13: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Version 9-Sep-99

Big fast disks

Workstation

Supercomputer

Image

Rendering

Fast network

Geometry (e.g. triangles)

Traversal using big fast memory

Big fast disks

Simulation (or data acquisition)

Figure 4: System architecture of supercomputer / heroic workstation.

often the disks are large RAID). The workstation then performs traversal, generation of geometry, andrendering.

This solution requires a high-end graphics workstation that is a “complete” package: it must sup-port large disk and memory configurations, both in capacity and bandwidth, while also supportinggraphics rendering as fast as (or faster than) the fastest desktop workstation. The disk and memory ca-pacity and bandwidth this package must support generally match commercial server capabilities. Thegraphics capabilities have historically exceeded those on the desktop. Heroic workstations in this classhave historically been extremely expensive, and the budgets to procure such machines are decreasing.This combines with extremely competitive (and cheap) PC desktop graphics to put pressure on the“complete” package. We shall see over the next several years if this historically “complete” machinecan maintain market viability.

However, note that even in this architecture which relies on brute capacity and bandwidth to solvethe big data problem, there is opportunity for data reduction. Between the supercomputer mass storageand the high-end workstation mass storage, data can be compressed. Even within the workstation, ifthe compressed data could be traversed directly, or if pages of compressed data were read from disk and

41 Big Data

Page 14: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Michael Cox

Workstation PCFast network

Traversal using big fast memory

Big fast disks

Commercial server

Supercomputer

Fast network

Big fast disks

Simulation (or data acquisition)

Geometry (e.g. triangles)

Rendering

Image

Figure 5: System architecture #1 supercomputer / commercial server / workstation PC.

decompressed into memory there would be savings at least of disk footprint and bandwidth. Most ofthe opportunities discussed later in these notes are applicable even to the “write-a-check” architecture.

3.3 Supercomputer, commercial data server, workstation PC

The recent advent of very fast but inexpensive graphics workstation PCs combined with decreasingbudgets has driven many visualization workers to alternative architectures. The foundation of thesenew architectures is the workstation PC on the desktop. The difficulty of these new architectures forvery large data sets is a decrease in memory and disk capacity and bandwidth. The “complete package”of the “heroic workstation” – fast graphics, big fast disks, big fast memory – is not available in thePC workstation marketplace, and it seems likely that there will never be sufficient market to sustain a“complete package” based on commodity PC components. New architectural solutions to the problemsof big data must be found.

Two speculative architectures that take advantage of commodity components are shown in Figures5 and 6. In Figure 5 the data set is moved from the supercomputer to a commercial server with large

SIGGRAPH 99 42

Page 15: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Version 9-Sep-99

Workstation PCFast network

Data reduction techniques

Big fast disks

Commercial server

Supercomputer

Fast network

Big fast disks

Simulation (or data acquisition)

Data reconstruction + data traversal

Geometry (e.g. triangles)

Rendering

Image

Figure 6: System architecture #2 supercomputer / commercial server / workstation PC.

capacity and RAID-class bandwidth. Current CPUs used on commercial servers compete with thoseof high-end graphics workstations and so both data serving and calculation (traversal) are possible onthese servers. In this architecture, the commercial server generates geometry that is sent over a fastnetwork to a desktop workstation with fast graphics.

Fast, big-capacity commercial servers are on the market today as are workstation PCs that havegraphics capability that are up to the requirements of scientific visualization. The biggest componentrisk in this architecture is currently the network between the server and the desktop. Our own shoppingexperience within the Data Analysis group at NASA Ames has been that the networking bandwidthsrequired between server and workstation PCs are available but are not commodity.

An alternative to the architecture of Figure 5 is shown in Figure 6. This architecture again takesadvantage of fast commercial servers and fast desktop graphics workstations, but provides data fromthe server to the desktop rather than geometry. It also employs any and alldata reduction techniqueson the data before shipping these across the network to the desktop. This architecture is clearly onlyas good as the data reduction techniques between server and desktop: data reduction is the topic of

43 Big Data

Page 16: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Michael Cox

the remainder of these notes. Of course, data reduction is not restricted to this architecture or to thepath between server disk and workstation memory: it may be employed to reduce the footprint of datastorage on any disk, and may be employed as well between the supercomputer and server.

4 Techniques

Eight techniques for coping with very large data sets can be identified:

� Memory hierarchy. These techniques share the property that they treat very large data sets as avirtual space that need not be memory-resident.

� Indexing. These techniques organize the data so that requisite data can be found and retrievedquickly.

� Write-a-check. While most researchers and practitioners are increasingly constrained by budget,there are applications for which the data are so largeandbudgets large enough to mitigate theproblem of big data by buying the biggest systems available for data analysis.

� Computational steering. These techniques are currently more research than practical. There ap-pear different definitions of computational steering in the literature and in workshop discussion,but the general idea is to avoid generation of large data sets by data analysis and “discovery”made during the computation.

� Compression. These techniques attempt to reduce the data to a smaller representation, eitherwith loss (lossy) or without (lossless).

� Multiresolution browsing with drill-down. These techniques apply now-popular methods to rep-resent and manipulate data hierarchically. The idea is that higher levels of the hierarchy retainimportant information but are smaller and easier to manipulate.

� Feature extraction and data mining. These techniques enable on-line queries by providing off-line processing that extracts relevant features or information from very large data sets. The ideais generally that the results of off-line processing are smaller and easier to manipulate.

� View-dependent techniques. These techniques share the property that they attempt to reducearbitrarily large data sets to within some a constant factor of the number of pixels on the visual-ization screen.

In the following subsections, each of these techniques is discussed in turn. Examples from theliterature are used to demonstrate the technique and to help illuminate and enumerate the alternativeapproaches that may be employed. However, please note that the specific papers cited are intended tobe illustrative of specific features of each technique, and the total collection of papers is not intendedas a comprehensive survey of the literature.

SIGGRAPH 99 44

Page 17: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Version 9-Sep-99

4.1 Memory hierarchy

Thememory hierarchyis a useful abstraction for developing systems solutions to the problem of bigdata. At the top of the memory hierarchy is the most expensive but fastest memory (e.g. registers onthe CPU). Below this is less expensive but slightly slower memory (e.g. first-level cache), and so on(e.g. second-level cache, main memory, local disk, remote disk, remote tape). Data may be stored andretrieved in blocks. When the blocks are variable-sized, they are referred to assegments. When theyare of fixed-size they are referred to aspages. Segments may be further organized into pages (pagedsegments).

The idea is to retrieve only the segments or pages (or pages of segments) that will be needed byanalysis or visualization, thus saving memory that would otherwise be required for the whole data set(demand driven). This approach can also save memory and disk bandwidth. Ideally, a good demand-driven paging or segmentation strategy does not increase the footprint of the data on disk. Not allstrategies for segmenting the data are low-overhead in terms of disk footprint. Doubling a 100-GB dataset so that it can be analyzed on a low-end workstation is perhaps acceptable for some environmentsand applications, but it is clearly far superior to allow analysis by a low-end workstation with onlya small increase in mass storage requirements. There are some applications for which doubling thedata set is simply not feasible for long-term storage. On the other hand, there are also applications forwhich doubling the data set may be far more acceptable than doubling the pre-processing time (e.g.ASCI). In addition, storage organization and proper selection of the parameters of paging (such aspage size) are important details requiring attention in memory hierarchy implementations.

Demand-driven strategies may be combined with judicious scheduling so that other work may bedone while the data are read from disk (pipelining), and it may be possible to predict accesses so thatrequests to load data can be issued while previous data are still being processed (prefetching).

Demand-driven paging and segmentation are most efficacious when not all of the data are required(sparse traversal). Sometimes sparse traversal is inherent in the visualization (or analysis) algorithms(e.g. particle tracing in CFD). Sometimes algorithms can be designed explicitly for sparse traversal ofthe data (e.g. some isosurface algorithms discussed later in these notes). Derived fields are a particulardifficulty for paging strategies, where most applications/tools simply pre-compute all derived data thatwill be needed (in particular assuming that the entire data set fits in local memory).Lazy evaluationisa technique that may help manage big derived fields.

Examples of memory hierarchy techniques for big data can be further classified.

Sequential segments for time-varying data Lane (who now goes by the name Kao) employed onesegment per time step for unsteady flow visualization [22].

Paging using operating-system facilities Naive reliance on operating-system virtual memory tomanage very large data sets is demonstrably bad. The UNIX system callmmap()potentially offers analternative and has been explored for CFD post-processing by [15] and [9]. Both report that it resultsin better performance than simple reliance on virtual memory; the latter reports that it is inferior to auser implementation that manages disk I/O and memory explicitly (as was demonstrated for databaseimplementations 20 years ago).

45 Big Data

Page 18: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Michael Cox

Application-controlled paged segments The addition of paging to the segment-management al-gorithm of [22] was found by [9] to result in good performance that does not significantly degradeon smaller-memory machines. In more recent work at NASA Ames, we have achieved interactivevisualization rates by paging segments of unsteady data.

Lazy evaluation of derived fields Globus first explored lazy evaluation of derived fields, and foundit an effective mechanism to reduce memory requirements while visualizing derived values [15].

Lazy evaluation of derived fields + caching Moran further quantified the gain and extended thiswork to include caching of derived fields [29]. From that paper, we note that “caching can improve theperformance of a visualization based on a lazy field that is expensive to evaluate, but ... can hinder theperformance when evaluation is cheap.”

Sparse traversal The paging, caching, and lazy evaluation techniques all take advantage of the factthat many algorithms in visualization sparsely traverse the data set. A viable research direction inmanaging big data is clearly to search for algorithms that result specifically in sparse traversal.

Indexed pages One means of achieving sparse traversal is to provide an index into the pages ofthe data, so that exactly the pages required can be found and retrieved. This approach has been usedexplicitly by some (cf. [43], [24]) and implicitly by others (cf. [4]).

Paged index If the index itself is too large, it may also be paged (as was done implicitly by [4]).

Paging in data flow architectures A common problem with data flow visualization architecturesis that they have exorbitant memory requirements. Song showed that if the granularity of modulesis made sufficiently small, data flow architectures can be significantly more memory-efficient [40].Schroeder has extended this basic approach and developed an architecture and working system forpaging in the Visualization Toolkit [37].

4.2 Indexing

Indexing is a technique whereby the data are pre-processed for later accelerated retrieval. Any searchstructure can serve as an index, but the standard search structures have beenoctrees, k-d trees, intervaltrees, and home-brewed data structures in both 3- and 4-dimensions. When applied to the problem ofbig data, an index may allow sparse traversal of the data, thereby saving memory space and bandwidthand saving disk bandwidth. Some indices significantly increase the storage requirements of the data,and thus are less desirable for the management of big data. Those indices that require little storage areobviously preferable for most environments.

Adapting a classification scheme from Cignoni [6], we can distinguish three general types of in-dices. Space-basedmethods partition physical space.Value-basedmethods partition the parameter

SIGGRAPH 99 46

Page 19: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Version 9-Sep-99

values at the nodes (or cells) of the data set.Seed-basedmethods provide “starting points” for run-time traversal. To date, the only seed-based methods evident in the literature appear to be for isosurfacegeneration.

Most of the work on indices for scientific data has (perhaps unfortunately) been for volume render-ing and isosurface generation. Cignoni claims that space-based methods for isosurfaces are better overregular grids because spatial coherence can be exploited, while value-based methods for isosurfacesare better for irregular grids where spatial coherence is difficult to exploit [7]. However, it is clear fromthe literature that the value-based methods increase the size of the data set by a factor of 2x or more,while the space-based methods increase data set size by perhaps 15%. The seed-based methods appearcomparable to the space-based methods as measured by increase in data set size. Increases reportedrange from 1% to 8%.

4.2.1 Coping with large indices

When the size of the index is too large, it may be possible to build the index over clusters of cells(pages or segments) rather than over individual cells themselves. This approach not only may reduceindex size, but also has the advantage that the pages or segments themselves may be compressed forstorage on disk. Pages may be reconstructed during traversal, or in some cases visualization algorithmsmay be applied to compressed data directly (this is discussed in the sections on compression andmultiresolution).

Other applications of an index over pages or segments of data include user browsing and laterdrill-down, andprogressive refinementof network transmission or retrieval from disk. Browsing overthe index may allow the user to find areas of interest within the data set, with subsequent requests toretrieve and examine the underlying data (drill-down). Browsing may also be enhanced byprogressiverefinement, whereby index (summary) data are displayed while the user moves through the data set,but the data themselves are retrieved and displayed when the user stops browsing. (Both techniquesare discussed further in the section on multiresolution).

An alternative approach to cope with indices that are too large to be memory-resident is to pagethe index itself. This may be done explicitly, or implicitly by storing the index directly with the data.

4.2.2 Value and seed indices for isosurface extraction

Isosurfaces have been the goal of most development in value and seed indices. When the isosurfacesare a priori known, it is more efficacious to extract those of interest off-line (e.g. kidney or bonesegmentation) and store the triangulated surfaces. However, when the underlying data have continuousisosurfaces with no obvious segmentation, the user must browse the isosurfaces of the data set to findfeatures of interest. The idea behind the value and seed indices is that such browsing can be made moreinteractive by building an index off-line that can accelerate isosurface generation when new values arerequested. Acceleration comes primarily from sparse traversal of the underlying data. In general,most techniques have been developed so that when the user requests an isosurface value incrementallydifferent than the last isosurface value requested, the new isosurface can be generated quickly. That is,most techniques have been developed to exploit coherence between isosurface requests.

47 Big Data

Page 20: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Michael Cox

The value index algorithms work essentially by building a data structure to maintain two lists andthe relationships between them (the maximum and minimum ranges for the underlying cells of thedata). The seed index algorithms work by providing the seed cells from which all cells (that containthe isosurface) can be found forany isosurface query.

Problems with value and seed indices As previously noted, the value index data structures explodedata size, while the seed index data structures increase data set size more parsimoniously.

A critical problem with indices in general, and with isosurfaces in particular, is that they provideonly sparse traversal on a single parameter (i.e. range of a single value within each cell). Some datasets store 50 parameters (e.g. typical data sets generated at Lawrence Livermore for the ASCI project)and would require an index for each one.

A second problem with indices in general, and with isosurfaces in particular, is that they do notprovide sparse traversal over derived parameters (in any obvious way). In the data sets commonlyin use at NASA Ames, only 5 parameters are stored and on a regular basis another 50 are derived atrun-time.

4.2.3 Examples of indexing from the literature

Spatial indices – sparse traversal in 3D and 4D Spatial indices work by organizing the pointsin physical space. Wilhelms applied an octree to data volumes, storing with each internal node themin/max interval spanned by the sub-volume. This allowed efficient pruning during isosurface gen-eration and volume rendering while increasing storage by about 15% [45]. Wilhelms later extendedthis work to handle time-varying data [46]. Others have since applied octrees and other spatial datastructures in similar ways (cf. [27]).

Value indices – sparse traversal in 3D and 4D Value indices for isosurface generation have a richliterature. Most of the methods work essentially by storing two lists over the cells of the data set(minimum and maximum values) along with references between them. Finding an isosurface is thenreduced to the problem of looking up the cells whose range covers the desired value. Most of the workhas been in 3D ( [14], [13], [26], [39], [6], [4]). Shen appears to be the first to extend value indices to4D [38]. As previously mentioned, storage for a value index over the entire data set increases storagerequirements by at least a factor of 2x. The 4D approach of [38] might be profitably combined with theapproach in [4] to page the value index itself. However, work that does so has evidently not appearedin the literature.

Seed indices – sparse traversal in 3D and 4DSeed indices appear to be more promising as data-reducing data structures than value indices. The idea is to store only the subset of the cells from whichall other cells may be found (the “seed” cells) [20], [1]. The storage requirements of Itoh are notexplicitly noted in the paper, but appear to be in the 10% to 20% range [20]. Bajaj in particular showsstorage requirements that increase by only 1% to 8% [1].

SIGGRAPH 99 48

Page 21: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Version 9-Sep-99

Indexing over pages Ueng built an octree over segments of the data, using values at intermediatenodes for reduced-resolution volume rendering [43]. The system allowed the user to drill down intoareas of interest based on the results of the reduced-resolution images. The paper unfortunately doesnot report the increase in storage requirements for this spatial indexing.

Leutenegger explored the same idea with an R-tree of the underlying segments and claimed thatit would give better performance than an octree (but did not provide experimental results on real datasets) [24]. However, the scheme requires about 2.5x the disk space for storage of the tetrahedra inunstructured grids (though only requires about 12% more storage space for vertices).

Paged index Chiang appears to be the first to build a value index over pages (rather than over theindividual cells) [4]. The results appear promising, reducing the index size as a function of the un-derlying page size. This work also implicitly “paged the page table” (i.e. paged a value index forisosurface generation) by intermixing the value index with the data). Those results appear promisingin terms of the reduction of I/O (in terms of the sparsity of traversal).

4.3 Write-a-check

For some applications, the size of the problem is so large that even when data analysis is done on acluster of the largest workstation-class machines commercially available, there is still a large impe-dence mismatch between data size and memory capacity. An instructive example of can be found inrecent work by Painter et al.’s at Los Alamos [34]. They used roughly $10M of equipment to demon-strate volume rendering of 1 billion cell volumes at 3 to 5 Hz. The equipment comprised a 16-pipeRealityMonster, 30 fiber-channel disk interfaces, and collections of 72 GB raid disks (that supportedbandwidth of 2.4 GB/s to a single file).

Some comparison of scales is useful to understand the size of the data problem that the ASCIproject has undertaken. $10M is to 10 TB as a $10K workstation is to 10 GB. And even if band-width scaled lInearly (which it does not) this would imply $10K desktop bandwidth of 2.4 MB/s. So,even with $10M of equipment, the ASCI project has at least as large a problem as workers analyzingsupercomputer Output on a standard workstation.

4.4 Computational steering

There are a number of techniques that have been labeled “computational steering”. At the most basiclevel, nonintrusive taps into running code periodically write out time steps for perusal by the scientist.If the simulation has gone awry, the scientist can cancel the job, fix the problem, and restart. Asomewhat more sophisticated version of computational steering envisions direct taps into running codefor analysis tools to analyze the progress and direction of the simulation. In the most advanced vision,the data analysis tools may be used to modify parameters of the running simulation, and actuallychange its course.

As has already been noted, scientists have historically chosen to apply supercomputer cycles tosimulations (rather than to interactive sessions). Infrastructure and funding have also supported thisparadigm. It is not clear that usage patterns of supercomputers will change any time soon. However,

49 Big Data

Page 22: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Michael Cox

what was once possible only on a supercomputer has become progressively more feasible on a desktopworkstation (and even PC workstation). It is likely that advances in computational steering will beutilized first (and possibly only) by the scientist doing simulations directly on his or her own desktopcomputer.

4.5 Compression

There are two types of data that may be compressed: thegrid or meshdata comprising the physicalpoints or cells themselves, and thesolutionor parameterdata comprising the values at those physicalpoints or cells. Some data sets (in particular rectilinear volume data) may have only solution/parameterdata. Others (in particular curvilinear grids, unstructured grids) have both data describing the physicalpoints themselves (and sometimes their relationships – explicit vs. implicit addressing) as well as thesolution/parameter data. The straightforward compression schemes (DCT, Fourier) have been appliedonly to solution/parameter data. The multiresolution schemes (discussed in the next section) have beenapplied to both grid/mesh and solution/parameter data.

Compression can be lossy or lossless. While there may be applications for lossy compression,many scientists will not even consider tools that visualize lossy data.3 In particular, CFD researchersand design engineers report that lossless compression is a requirement, and it is well-known that loss-less compression in medical imaging is absolutely essential. On the other hand, many scientists in suchfields as CFD do report that lossy schemes that losslessly preserve features of interest are of interestto them. However, note that faithful derived fields over lossy raw data are very difficult to ensure.While scientists may accept feature-preserving lossy compression in principle, the guarantee must beextended to the derived fields the same scientists commonly use.

Some compression schemes are applied to the entire data set, some are applied to sub-blocks ofthe data set. The latter are more amenable to progressive transmission and refinement (and thereforebrowsing and drill-down) and also to paging and other memory-hierarchy schemes. Some work hasbeen done on applying visualization operators (e.g. volume rendering) directly to the compresseddata. When this is not possible the data must be reconstructed before traversal/rendering. While re-construction requires memory bandwidth proportional to the data traversed, it still saves disk capacityand bandwidth.

Compression has been reported as a technique in several contexts:

� Compressed storage with reconstruction before traversal.Two user-paradigms are apparent inprevious work: progressive refinement, and browsing with drill-down. With the former, lower-fidelity data are traversed and displayed while the user changes viewpoint within the data, butthen higher-fidelity data are traversed and displayed while the user viewpoint remains constant.With the latter, lower-fidelity data are traversed and displayed by the user, who explicitly choosesto select, traverse, and display higher-fidelity data when an interesting “feature” or “region” isdetected.

3We note that some scientists working on ASCI report that their big data problems are so severe that the scientists arewilling to work with reduced-fidelity data.

SIGGRAPH 99 50

Page 23: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Version 9-Sep-99

� Feature extraction and/or data mining. In such cases, compression is achieved by representingthe data by underlying core features. In this case the compression is believed to retain importantfeatures from the raw data. We distinguish betweenfeature extractiontechniques which areinherently domain-specific (e.g. vortex-core extraction in CFD) anddata miningtechniqueswhich are inherently general (e.g. maximum divergence over a vector field). Feature extractionand data mining are also inherently multiresolution in nature, with the idea that some featuresare retained across scales of the data.

� Lossy compression. The proposal in some work is that “high frequencies” in the data are some-how less important than “lower frequencies” that must be believed to have more important in-formation content. Under these circumstances the belief is that the data may be compressed toeliminate the “higher frequencies”, thus reducing data size.

Before discussing the types of scientific data compression that have been explored, we first addresserror metrics for lossy compression.

Error metrics for lossy compression The biggest gap in the data-reduction literature for scientificdata has been in quality error metrics for lossy compression. A common metric is that of “imagequality”. This is probably the worst of all possible metrics to use. In general, it is not at all obviousthat an image that “looks OK” has the same information content as the image generated from high-fidelity data (i.e. data which retains its “high frequencies”). “Image quality” if usedmustbe tied to theunderlying information for which the scientist or engineer is searching.

Other metrics such as Signal-to-Noise (SNR) or Root-Mean-Square (RMS) error are indeed better,but still do not demonstrate that fundamental “information” has not been lost in the compression. Userstudies are one way to address this problem, but these admittedly are extremely time-consuming andalso have the problem that they offer only statistical evidence of success.

Error-based progressive refinement The work by Laur is an early example of metric-based pro-gressive refinement of volume-rendered scientific data [23]. In this work an octree was built overthe underlying volume data, and mean values were employed at intermediate nodes to volume-renderreduced-resolution images. Associated with each node was an RMS error of the averages, and sothe reduced-resolution renderings weredata based, not image based. While the user manipulatedthe viewpoint, only reduced-resolution renderings within some data error tolerance were calculated.When the user stopped moving the viewpoint the implementationprogressively refinedthe renderingwith higher-quality data closer to the leaves of the octree. Wilhelms extended this work to providemore arbitrary integration functions at intermediate nodes, and more arbitrary error functions [46].

Lossless compression on full data setsFowler provides lossless compression of 3D data by ap-plying a technique common in 2D image compression: Differential Pulse-Code Modulation (DPCM).This technique works by predicting differences between voxels (pixels) and encoding the differencesbetween predicted and actual with an entropy-coding scheme such as Huffman coding [12]. Fowlerreports 2:1 compression without loss.

51 Big Data

Page 24: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Michael Cox

Zhu uses multiresolution analysis to achieve lossless compression [49]. The solution/parameterdata are first wavelet-compressed and the lower-resolution representation is used to detect “structureswhich persist across scales of the wavelet decomposition.” These are then used in building an octreepartition of the volume. Each subtree is then wavelet-compressed for transmission. The ordering oftransmission is driven by a model of the human visual system – the intent is to transmit coefficientsfrom most visible to least visible.

Lossy compression on data pagesThe work by Ning is an early example of compressing the un-derlying pages of data (rather than the whole data set) using vector quantization [33]. Compressionwas reported as 5:1. However, the resulting images shown are quite striking in their distortion, andno results were reported showing that the underlying information was not also distorted (only “visual”evidence is presented).

Yeo also reports compression on underlying pages of the data, with results between 10:1 and100:1 [47]. Quality loss is reported by signal-to-noise ratio (SNR). While this metric is preferableto visual metrics it is still not entirely clear now SNR corresponds to the fidelity of the underlyinginformation for which a scientist is searching.

Operators on compressed data Several papers have appeared on volume rendering directly in fre-quency space on Fourier-transformed scalar data ( [25], [28], [41]). However, these authors were notconcerned with data compression per se, and generally have not reported Fourier-transformed datasizes (one does report data size of 2x the original [25]). Chiueh appears to be the first to operate in atransformed domain (Fourier) while also achieving compression [5]. The authors do so by compress-ing blocks of data 9thereby minimizing the global effect of signals across the data), and they renderdirectly from these blocks. The scheme is attractive in that it may be combined with memory hierarchytechniques. However, the 30:1 compression ratios reported come at the apparent expense of seriousdegradation in image fidelity for non-linear transformation functions.

Compression of unsteady data Ma combines difference-encoding between octree sub-trees of anunsteady data set, with image-based rendering: only those sub-trees that change are re-rendered [27].Rendering in this case is volume rendering.

4.6 Multiresolution

Most work in multiresolution analysis (in particular using wavelets) has been over triangulated sur-faces. The techniques that have been developed have had success in Computer Aided Design (CAD),architectural walkthroughs, and exploration and display of virtual worlds. However, most of this workover 2D surfaces is not particularly useful for scientific data. With the latter, surfaces are rarely explic-itly defined – at best they are discovered at run-time. That is, most work over 2D surfaces applies tostatic geometry that is defined and later used. This is rarely the case in scientific visualization.

More recently work has been undertaken which specifically applies multiresolution techniques toscientific data. The applications to date have been primarily for browsing and drill-down, and pro-gressive transmission and refinement. Most applications have been lossy, a few have provided lossless

SIGGRAPH 99 52

Page 25: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Version 9-Sep-99

representations. Multiresolution analysis has also been used for feature extraction from scientific data,and there has been more recent work on information metrics for multiresolution compression – met-rics that attempt to preserve the underlying information content of the data, or that attempt to preservesignificant features of the data.

Visual error metrics for multiresolution representation of scientific data The dearth of robusterror metrics for lossy compression of scientific data is suffered as well by most work in multiresolu-tion analysis of scientific data. The use of image quality as sole metric is even more ubiquitous in thisliterature than it is in the literature for lossy compression of scientific data. From one paper in the field,“... comparing the structures found in the original image with those computed from the optimized dataset containing only 19% of the vertices, we find only very small details missing...” But were those“small details” the important ones?! While image quality may be sufficient for many applications incomputer graphics, it is an insufficient metric when the application must ensure aerodynamic stabilityof a plane under design or the correct diagnosis of a serious illness.

Better error metrics for multiresolution representation of scientific data Standard error met-rics reported in the literature are Root-Mean-Square (RMS) error and Signal-to-Noise-Ratio (SNR)(cf. [23], [7]). A nagging difficulty with any generic error metric has already been discussed: howdoes 1% or 2% error relate to the information content of the data themselves? Wilhelms, perhaps rec-ognizing the difficulty of definingtheappropriate integration function (using a spatial index) andtheappropriate integration error, developed a software architecture that allowed many integration func-tions and error metrics to be set by the user [46]. Trotts treats multiresolution error similarly – byensuring that the aggregate error in multiresolution tetrahedral decimation over a grid does not exceeda user-specified tolerance [42].

Zhu presents an error metric driven by the data frequencies possibly visible to the human eye [49].This is an interesting direction that may tie data errors to information content. Bajaj introduces mul-tiresolution coarsening (and refinement) based on an error metric that preserves critical points andtheir relationships (in scalar data) [3], [2]. The feature preservation is intended to maintain correctcalculation of isosurfaces at all scales of the multiresolution representation.

Example applications of multiresolution Several authors have employed multiresolution for fea-ture extraction (or data mining) and for progressive refinement. The common data mining operationappears to be “edge detection.” Examples of these techniques can be found in [44], [17], and [31].

The first approaches to multiresolution were over rectilinear grids (cf. [30], [44], [17], [31]). Analternative that works on irregular grids is to compress the solution/parameter data only (leaving thegrid/mesh uncompressed) [49]. A more recent approach to multiresolution has been to tetrahedralizea grid and then eitherrefine the lowest-resolution representation, orcoarsenthe highest-resolutionrepresentation (cf. [16], [32], [48], [7]).

53 Big Data

Page 26: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Michael Cox

4.7 Feature extraction and data mining

All of the algorithms so far discussed have been primarily data reduction techniques. An alternative ap-proach is to devise algorithms that compress the data specifically by extracting (off-line) the importantanswers (which presumably require significantly less storage than the raw data). Such feature extrac-tion techniques are in general very domain-specific (cf. [21]. While these notes elide this promisingdirection, the scientist or engineer faced with very large data should consider feature extraction anddata mining techniques a viable possibility.

4.8 View-dependent techniques

There are many techniques from computer graphics to reduce either the data that need be touched,or the computation, by using the viewer’s point of view. Level-of-Detail (LOD) modeling, occlusionculling, view frustum culling, etc, are examples of this approach. The risk of using such techniquesfor data culling in scientific visualization has always been that importantinformationmay be lost byimage-space data reduction.

Crawfis offers an interesting argument for (and provides an interesting example of) using view-dependent techniques to manage large data [11]. He argues that especially for large data visualization,the scientist sees 1 - 50 MB of pixels, but has 50 MB - 10 TB of data: why not process only thedata that are “visible”. The standard concern about this approach is that fidelity to the underlying datamust be ensured (which is difficult). Crawfis argues that visualization does not maintain fidelity to theunderlying data anyway (!), that interactivity is the more important component of discovery. Based onthis philosophy, their system applies image-based rendering to the 2D cross-sections of a splat volumerenderer to accelerate visualization within some small cone of the viewing frustum. During the timethat the user peruses this small cone (at interactive rates), their system uses dead reckoning to takeadditional cross-sections that will be used from subsequent viewpoints.

5 Summary

In these notes we have discussed three major issues in the management of very large data sets forinteractive analysis and visualization: application requirements and differences, end-to-end systemsarchitectures, and major techniques and algorithms.

The major techniques have been classified as: memory hierarchy, data indices, write-a-check,computational steering, compression, multiresolution browsing with drill-down, and feature extractionand data mining. We have reviewed some of the more representative literature.

Finally, we have discussed error metrics for lossy compression of scientific data, and for multireso-lution representation and analysis. We have emphasized, even over-emphasized, that a major gap in theliterature has been in good metrics for data loss due to compression or multiresolution representation.

SIGGRAPH 99 54

Page 27: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Version 9-Sep-99

References

[1] BAJAJ, C. L., PASCUCCI, V., AND SCHIKORE, D. Fast isocontouring for improved interactivity.In 1996 Symposium on Volume Visualization(October 1996), pp. 39–46.

[2] BAJAJ, C. L., PASCUCCI, V., AND SCHIKORE, D. Visualization of scalar topology for structuralenhancement. InProceedings of Visualization ’98(October 1998), pp. 51–58.

[3] BAJAJ, C. L., AND SCHIKORE, D. Topology preserving data simplification with error bounds.Computers and Graphics(Spring 1998). special issue on Simplification.

[4] CHIANG, Y. J., SILVA , C. T., AND SCHROEDER, W. J. Interactive out-of-core isosurface ex-traction. InProceedings of Visualization ’98(October 1998), pp. 167–174.

[5] CHIUEH, T., YANG, C., HE, T., PFISTER, H., AND KAUFMAN , A. Integrated volume com-pression and visualization. InProceedings of Visualization ’97(October 1997), pp. 329–336.

[6] CIGNONI, P., MARINO, P., MONTANI, C., PUPPO, E., AND SCOPIGNO, R. Speeding upisosurface extraction using interval trees.IEEE Transactions on Visualization and ComputerGraphics 3, 2 (April - June 1997), 158–.

[7] CIGNONI, P., MONTANI, C., PUPPO, E., AND SCOPIGNO, R. Multiresolution representationand visualization of volume data.IEEE Transactions on Visualization and Computer Graphics3, 4 (October - December 1997).

[8] COX, M. Managing big data for scientific visualization. InACM SIGGRAPH ’98 Course 2,Exploring Gigabyte Datasets in Real-Time: Algorithms, Data Management, and Time-CriticalDesign(August 1998).

[9] COX, M., AND ELLSWORTH, D. Application-controlled demand paging for out-of-core visual-ization. InProceedings of Visualization ’97(October 1997), pp. 235–244.

[10] COX, M., AND ELLSWORTH, D. Managing big data for scientific visualization. InACM SIG-GRAPH ’97 Course 4, Exploring Gigabyte Datasets in Real-Time: Algorithms, Data Manage-ment, and Time-Critical Design(August 1997). Los Angeles CA.

[11] CRAWFIS, R. Parallel splatting and image-based rendering. InNSF/DOE Workshop onLarge Scale Visualization and Data Management(May 1999). Presentations available atftp://sci2.cs.utah.edu/pub/ldv99/.

[12] FOWLER, J. E.,AND YAGEL, R. Lossless compression of volume data. In1994 Symposium onVolume Visualization(October 1994), pp. 43–50.

[13] GALLAGHER, R. S. Span filtering: An optimization scheme for volume visualization of largefinite element models. InProceedings of Visualization ’91(October 1991), pp. 68–75.

55 Big Data

Page 28: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Michael Cox

[14] GILES, M., AND HAIMES, R. Advanced interactive visualization for cfd.Computing Systemsin Engineering 1(1990), 51–62.

[15] GLOBUS, A. Optimizing particle tracing in unsteady vector fields. NAS RNR-94-001, NASAAmes Research Center, January 1994.

[16] GROSSO, R., LUERIG, C., AND ERTL, T. The multilevel finite element method for adaptivemesh optimization and visualization of volume data. InProceedings of Visualization ’97(October1997), pp. 387–394.

[17] GUO, B. A multiscale model for structure-based volume rendering.IEEE Transactions onVisualization and Computer Graphics 1, 4 (December 1995), 291–301.

[18] HEERMANN, P. D. Production visualization for the asci one teraflops machine. InProceedingsof Visualization ’98(October 1998), pp. 459–462.

[19] HEERMANN, P. D. Asci visualization: One teraflops and beyond. InNSF/DOE Workshopon Large Scale Visualization and Data Management(May 1999). Presentations available atftp://sci2.cs.utah.edu/pub/ldv99/.

[20] ITOH, T., AND KOYAMADA , K. Automatic isosurface propagation using an extrema graph andsorted boundary cell lists.IEEE Transactions on Visualization and Computer Graphics 1, 4(December 1995), 319–.

[21] KENWRIGHT, D. N., AND HAIMES, R. Automatic vortex core detection.IEEE ComputerGraphics and Applications 18, 4 (July/August 1998), 70–74.

[22] LANE, D. Ufat: A particle tracer for time-dependent flow fields. InProceedings of Visualization’94 (October 1994), pp. 257–264.

[23] LAUR, D., AND HANRAHAN , P. Hierarchical splatting: A progressive refinement algorithm forvolume rendering. InComputer Graphics (Proceedings SIGGRAPH)(July 1991), pp. 285–288.Vol. 25, No. 4.

[24] LEUTENEGGER, S. L., AND MA, K. L. Fast retrieval of disk-resident unstructured volume datafor visualization. InDIMACS Workshop on External Memory Algorithms and Visualization(May1998).

[25] LEVOY, M. Volume rendering using the fourier projection-slice theorem. InProceedings ofGraphics Interface ’92(May 1992), pp. 61–69.

[26] LIVNAT, Y., SHEN, H.-W., AND JOHNSON, C. R. A near optimal isosurface extraction algo-rithm using the span space.IEEE Transactions on Visualization and Computer Graphics 2, 1(March 1996), 73–84.

[27] MA, K.-L., SMITH , D., SHIH, M.-Y., AND SHEN, H.-W. Efficient encoding and rendering oftime-varying volume data. NASA/CR-1998-208424 98-22, ICASE, 1998.

SIGGRAPH 99 56

Page 29: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Version 9-Sep-99

[28] MALZBENDER, T. Fourier volume rendering.ACM Transactions on Graphics 12, 3 (July 1993),233–250.

[29] MORAN, P., AND HENZE, C. Large field visualization with demand-driven calculation. InProceedings of Visualization ’99(October 1999).

[30] MURAKI , S. Approximation and rendering of volume data using wavelet transforms.IEEEComputer Graphics and Applications 13, 4 (July 1993), 50–56.

[31] MURAKI , S. Multiscale volume representation by a dog wavelet.IEEE Transactions on Visual-ization and Computer Graphics 1, 2 (June 1995).

[32] NEUBAUER, R., OHLBERGER, M., RUMPF, M., AND SCHWIRER, R. Efficient visualization oflarge-scale data on hierarchical meshes. InProceedings of Visualization in Scientific Computing’97 (1997), Springer Wien.

[33] NING, P., AND HESSELINK, L. Fast volume rendering of compressed data. InProceedings ofVisualization ’93(October 1993), pp. 11–18.

[34] PAINTER, J., MCCORMICK, P., AND MCPHERSON, A. Reality monster volume rendering. InNSF/DOE Workshop on Large Scale Visualization and Data Management(May 1999). Presen-tations available atftp://sci2.cs.utah.edu/pub/ldv99/.

[35] PANG, A., AND LODHA, S. Towards understanding uncertainty in terascale visualization. InNSF/DOE Workshop on Large Scale Visualization and Data Management(May 1999). Presen-tations available atftp://sci2.cs.utah.edu/pub/ldv99/.

[36] SAM USELTON, C. Panel: Computational steering is irrelevant to large data simulations. InNSF/DOE Workshop on Large Scale Visualization and Data Management(May 1999). Presen-tations available atftp://sci2.cs.utah.edu/pub/ldv99/.

[37] SCHROEDER, W. A multi-threaded streaming pipeline architecture for large structured datasets. InNSF/DOE Workshop on Large Scale Visualization and Data Management(May 1999).Presentations available atftp://sci2.cs.utah.edu/pub/ldv99/.

[38] SHEN, H. W. Isosurface extraction from time-varying fields using a temporal hierarchical indextree. InProceedings of Visualization ’98(October 1998), pp. 159–166.

[39] SHEN, H.-W., HANSEN, C. D., LIVNAT, Y., AND JOHNSON, C. R. Isosurfacing in span spacewith utomost efficiency (issue). InProceedings of Visualization ’96(October 1996).

[40] SONG, D., AND GOLIN, E. Fine-grain visualization in data flow environments. InProceedingsof Visualization ’93(October 1993), pp. 126–133.

[41] TOTSUKA, T., AND LEVOY, M. Frequency domain volume rendering. InComputer Graphics(Proceedings SIGGRAPH)(August 1993), pp. 271–278. Vol. 27, No. 4.

57 Big Data

Page 30: Large Data Management for Interactive Visualization Designma/ECS276/readings/cox99large.pdf · Large Data Management for Interactive Visualization Design 1 Michael Cox MRJ/NASA Ames

Michael Cox

[42] TROTTS, I. J., HAMANN , B., JOY, K. I., AND WILEY, D. F. Simplification of tetrahedralmeshes. InProceedings of Visualization ’98(October 1998), pp. 287–295.

[43] UENG, S. K., SIKORSKI, C., AND MA, K. L. Out-of-core streamline visualization on largeunstructured meshes.IEEE Transactions on Visualization and Computer Graphics 3, 4 (October- December 1997).

[44] WESTERMANN, R. A multiresolution framework for volume rendering. InProceedings of the1994 Symposium on Volume Visualization(October 1994), pp. 51–57.

[45] WILHELMS, J., AND GELDER, A. V. Octrees for faster isosurface generation.ACM Transac-tions on Graphics 11, 3 (July 1992), 201–227.

[46] WILHELMS, J., AND GELDER, A. V. Multi-dimensional trees for controlled volume renderingand compression. InProceedings of the 1994 Symposium on Volume Visualization(October1994), pp. 27–34.

[47] YEO, B. L., AND LIU, B. Volume rendering of dct-based compressed 3d scalar data.IEEETransactions on Visualization and Computer Graphics 1, 1 (March 1995).

[48] ZHOU, Y., CHEN, B., AND KAUFMAN , A. Multiresolution tetrahedral framework for visualiz-ing regular volume data. InProceedings of Visualization ’97(October 1997), pp. 135–142.

[49] ZHU, Z., MACHIRAJU, R., FRY, B., AND MOORHEAD, R. Wavelet-based multiresolutionalrepresentation of computational field simulation datasets. InProceedings of Visualization ’97(October 1997), pp. 151–158.

SIGGRAPH 99 58