A Vertex Program for Efﬁcient Box-Plane Intersection

A Vertex Program for Efficient Box-Plane Intersection

Christof Rezk Salama and Andreas Kolb

Computer Graphics and Multimedia Systems GroupUniversity of Siegen, Germany

Email: {rezk,kolb}@fb12.uni-siegen.de

Abstract

Object-order texture-based volume rendering de-composes the volume data set into stacks of tex-tured polygons. The performance of such hardware-based volume rendering techniques is clearly dom-inated by the fill-rate and the memory bandwidth,while only very little workload is assigned to thevertex processor. In this paper we discuss a vertexprogram which efficiently computes the slicing fortexture-based volume rendering. This novel tech-nique enables us to balance the workload betweenvertex processor, fragment processor and memorybus. As a result we demonstrate that the perfor-mance of texture-based volume rendering can beefficiently enhanced by trading an increased vertexload for a reduced fragment count. As an applica-tion we suggest a novel approach for empty spaceskipping for object-order volume rendering.

1 Introduction

In computer graphics, algorithms for the synthesisof virtual images can be categorized into object-order or image-order approaches. Image-order ap-proaches, such as raytracing, decompose the tar-get image into elements (usually pixels1) and tra-verse the virtual scene to compute its contributionto the final color of each element. Object-order ap-proaches, such as rasterization, on the other handsplit the virtual scene (the source) into geometricprimitives (points, lines, triangles) and then com-pute how these primitives contribute to the final im-age, e.g. by ”splatting” the primitive onto multiplepixels in screen space at once.

Direct volume rendering can be performed withboth approaches, efficiently accelerated by mod-ern graphics processing units (GPUs). The tradi-

1elements smaller than pixels may be used to account for sub-pixel accuracy and anti-aliasing

tional texture-based volume rendering algorithm isan object-order approach. It decomposes the vol-ume data into one or more stacks of polygonal sliceswhich are usually composited in back-to-front or-der onto the image plane. Modern graphics hard-ware also allows direct volume rendering to be im-plemented as an image-order approach, basically byperforming ray-casting within a large loop that sam-ples the volume successively along the viewing rayin the fragment shader.

If we compare both implementations, we findthat slice-based volume rendering might be consid-ered the “brute-force”- approach, that relies solelyon the fill-rate and the high fragment throughput ofthe rasterization unit. Ray-casting on the other handemploys optimization techniques such as empty-space skipping and early-ray termination. At thebottom line, however, the brute-force approach isstill advantageous in terms of performance, whileGPU-based ray-casting has several clear advantageswhen rendering iso-surfaces or sparse volumes.

The performance limit for both approaches, how-ever, is the same. It is either the fill-rate or thememory bandwidth. Neither of these approachesis geometry-limited. The major work is done in thefragment processor and only a negligible computa-tional load is assigned to the vertex processor. Inthis paper, we demonstrate that the performance ofobject-order volume rendering can be improved bymoving the necessary slice decomposition, which isusually done on the CPU, onto the vertex proces-sor. This gives us the flexibility which is neces-sary to load-balance the rendering process for max-imum performance. We demonstrate that this is ef-fective for optimizing bandwidth and texture cachecoherency. Finally, we introduce an implementa-tion of empty-space skipping for slice-based vol-ume rendering using an octree-decomposition of thevolume data.

In the following section, we will point the reader

VMV 2005 Erlangen, Germany, November 16–18, 2005

to important related work in the field of real-timevolume graphics. Section 3 describes the original3D-texture based approach to volume rendering andexamines its limitations. In Section 4 our algorithmfor plane-box intersection is explained. Section 5describes the implementation of this algorithm in avertex program. This implementation is the basisof our straight-forward object-order algorithm forempty-space skipping using an octree decomposi-tion. This is described in Section 6. The resultsof our implementation are analyzed in Section 7.Eventually, Section 8 concludes our work.

2 Related Work

Real-time volume graphics has a long history incomputer graphics, starting with the first straight-forward CPU-based image-order approaches in1984 [?]. Maybe the most important purely CPU-based method was the shear-warp-algorithm intro-duced by Lacroute and Levoy in 1994 [?]. This ap-proach is also the basis for efficient GPU-based im-plementations, such as 2D-multitexture based vol-ume rendering [?]. With the increasing availabilityof hardware accelerated 3D textures, many volumerendering techniques have utilized this feature fordifferent purposes. For an overview of state-of-the-art object- and image-order volume rendering tech-niques, we suggest reading the SIGGRAPH 2004course notes on Real-Time Volume Graphics [?],which include many implementation details that areusually omitted from scientific papers. The ideaof slicing simplices has been examined in differentcontexts before by Reck et al. [?] for tetrahedra andby Lensch et al. [?] for prisms.

GPU-based image-order techniques have beendeveloped only recently with the introductionof graphics boards that support true conditionalbranches and loops of variable length in the frag-ment processor. Raycasting has been implementedboth in a combined CPU/GPU solution [?] and apurely GPU-based solution [?]. Another importantGPU-based raycasting algorithm for isosurface ren-dering has been introduced recently by Hadwiger etal [?]. It includes many advanced concepts such asempty space skipping, and non-photorealistic draw-styles in a unified framework.

Although, the raycasting technique seems to bea more straight-forward way of implementing vol-ume graphics on a GPU, such approaches still suf-

fer from a of series of problems, most of them aris-ing from the fact, that graphics hardware has notbeen designed for image-order approaches in thefirst place. However, it is very likely that these lim-itations will be overcome in the near future.

Nevertheless, also object-order approaches stillsuffer from several limitations in practise. If 3Dtextures are used, one popular problem is the lim-ited amount of graphics memory which often forcesthe programmer to split the volume data set intoseveral smaller chunks (usually called bricks), eachof which fitting entirely into local video memory.The 3D texture based approach and its current lim-itations will be explained in detail in the followingsection.

3 3D Texture-Based Volume Graphics

Current graphics hardware does not support trulythree-dimensional rendering primitives. If you wantto leverage such hardware for direct volume ren-dering, your volumetric object must be decomposedinto polygonal primitives. However, most graphicsboards do have support for three-dimensional tex-ture objects, which allows you to “cut” the textureimage out of a 3D texture block using trilinear fil-tering.

With 3D textures, the volume is usually split intoviewport aligned slices. These slices are computedby intersecting the bounding box of the volume witha stack of planes parallel to the current viewport.In consequence, viewport-aligned slices must be re-computed whenever the camera position changes.Figure 1 illustrates the rendering procedure. Dur-ing rasterization, the transformed polygons are tex-

Figure 1: Polygons are computed by intersectionthe bounding box with a stack of planes parallel tothe viewport. The textured polygons are blended inback-to-front order into screen space to create thefinal rendering.

666

tured with the image data obtained from a solid tex-ture block by trilinear interpolation. During frag-ment processing the resulting polygon fragmentsare blended semi-transparently into the frame bufferand finally displayed on screen.

The intersection calculation for generating theslice polygons is done on the CPU and the resultingpolygons are transferred to the graphics processorfor each frame. In this case, the vertex processor isonly used to transform each incoming vertex withone affine matrix, which comprises the modeling-,viewing-, and perspective transformation. If thevolume data set is small enough to fit into lo-cal video memory, the overall rendering process isclearly fill-rate limited. This means that the majorworkload is performed by the fragment processorwhich splits each polygon into fragments and ob-tains the trilinearly filtered texel values from localvideo memory. If the arithmetic computation in thefragment program is not too complex, the fill-rateis mainly determined by texture cache coherencyand the memory bandwidth between the fragmentprocessor and local video memory.

3.1 Bricking

Difficulties arise when the volume data set doesnot fit entirely into local video memory. The tex-ture block must be divided into smaller bricks, eachof which is rendered separately. In this case therendering process is limited by the memory band-width between the GPU and the host memory, whilethe GPU is stalled until the required texture data isfetched from host memory. To make matters worse,bricking increases the overall memory for storingthe volume data set. This is because correct inter-polation across brick boundaries requires one planeof voxels to be duplicated at the boundary betweenany two bricks [?].

The size of the bricks has significant influence onthe overall performance. In order to optimize cachecoherency, the bricks should be kept small enoughto fit into the texture cache. On the other side, how-ever, the bricks should not be too small, otherwisethe duplicated voxels at the brick boundaries wouldsignificantly increase the memory required for stor-ing the volume. To make matters worse, a largenumber of bricks result in a higher number of inter-section calculations for the CPU, a higher numberof vertices which must be transferred to the GPU

Figure 2: Intersecting a box with a plane. The re-sulting polygon has between 3 and 6 vertices. Sym-metric cases are omitted.

for each frame and thus a deterioration of the band-width problem.

To these ends we present a vertex program whichshifts all the intersection calculation between thebounding box of the volume and a stack of sliceplanes in the vertex processor. We will see that sucha vertex program minimizes the amount of data thatmust be transferred from host memory to the GPU.This allows us to render significantly smaller bricksand thus increases overall performance.

4 Cube-Slice Intersection

The intersection between a box and a plane resultsin a polygon with 3 to 6 vertices (assuming thatthe plane actually intersects the box). The differ-ent cases are illustrated in Figure 2. Our approachis to compute such intersection polygons directly inthe vertex processor. A vertex program, however,can only modify existing vertices. It can neither in-sert new vertices into the stream nor remove ver-tices from the stream. As a consequence, we designa vertex program that always receives 6 verticesand outputs 6 vertices. If the intersection polygonconsists of less than 6 vertices, the vertex programwill generate one or more duplicate vertices (i.e twoidentical vertices with an edge of length zero inbe-tween).

Intersecting an edge of the box with the sliceplane is easy, if the plane is given in Hessian nor-mal form,

〈~nP ◦ ~x〉 = d (1)

with ~nP denoting the normal vector of the plane andd the distance to the origin. For viewport-aligned

666

slicing the normal vector ~nP is simply the viewingdirection. An edge between two vertices Vi and Vj

of the bounding box can be described as

Ei→j : X(λ) = Vi + λ (Vj − Vi) (2)

= Vi + λ~ei→j with λ ∈ [0, 1]

Note that the vector ~ei→j does not have unit lengthin general. The intersection between the plane andthe straight line spanned by Ei→j is simply calcu-lated by

λ =d− 〈~nP ◦ Vi〉〈~nP ◦ ~ei→j〉 . (3)

The denominator becomes zero only if the edge iscoplanar with the plane. In this case, we simplyignore the intersection. We have found a valid in-tersection only if λ is in the range [0, 1], otherwisethe plane does not intersect the edge.

The main difficulty in performing the intersectioncalculation in the vertex processor is to maintain avalid ordering of the intersection points, so that theresult forms a valid polygon. To understand the slic-ing algorithm, let us assume for now, that we haveone vertex V0 that is closer to the camera than allother vertices, as displayed in Figure 3 (left) Ver-tex V7 is then identified as the vertex lying on theopposite corner across the cube’s diagonal. In thefollowing we will refer to the vertex indices givenin Figure 3 (left).

If V0 is the front vertex and V7 is the back ver-tex, there are exactly three independent paths fromV0 to V7 as marked in Figure 3 (left) by the solidlines in different shades of gray. In this context,independent means that these paths do not share

V4

V0

V5

V6

V1

V2

V7

V3

V4

V0

V5

V1

V2

V4

V0

V5

V1

V2

P0

P1

P2

P0

P1

P2

Figure 3: Left: The vertices are numbered sequen-tially. There always exist three independent pathsfrom the front vertex V0 to the back vertex V7 asmarked by the solid lines. Right: The intersectionpoint of the dotted line must be inserted between theintersection points from the solid lines.

any vertices other than the start and the end ver-tex. Each path consists of a sequence of three edges{E1, E2, E3}, e.g. E1 = E0→1, E2 = E1→4 andE3 = E4→7 for the light gray path. For a givenfront vertex, we can construct these three pathsuniquely by forcing that the vectors correspondingto E1, E2 and E3 for each path form a right handedsystem.

Now imagine we are sweeping a viewport-parallel plane from front to back through the boxin Figure 3 (left). The first vertex that the planetouches is V0. Before this happens, we do not haveany valid intersection with the box. The last vertexthat the plane touches, if we proceed from front toback, is vertex V7. After that, we will not have anyvalid intersection anymore. As a consequence, anyviewport-aligned plane that intersects the box willhave exactly one unique intersection point alongeach of the three paths, respectively. In the casethat our intersection polygon has only three ver-tices, they will be exactly those intersection pointswith the three paths. As a result, we can computethree of the possible six intersection points Pi bychecking intersections with a sequences of edges,respectively.

P0 = Intersection with E0→1 or E1→4 or E4→7



Now, let us consider where the remaining inter-section points must lie if our polygon has more thanthree vertices. We will first examine the light graydotted edge E1→5 in Figure 3. If there exists a validintersection with this edge, then it must be insertedbetween the intersection points with the light graypath and the medium gray path as can be easily seenin Figure 3 (right). If an intersection with the dottededge does not exist, we simply set the point equalto P0, which is the intersection point with the lightgray path. The other dotted edges can be treatedanalogously, resulting in the remaining three inter-section points:

P1 = Intersection with E1→5, otherwise P0



We have now determined all six intersectionpoints of the plane with the box in a sequence thatforms a valid polygon. Now it is easy to check, that

666

01 void main(02 int2 Vin : POSITION,0304 // updated per cube05 uniform float3 vecTranslate,06 uniform float dPlaneStart,0708 // updated per frame09 uniform float4x4 matModelViewProj,10 uniform float3 vecView,11 uniform int frontIndex,1213 // const: never updated14 uniform float dPlaneIncr,15 uniform int nSequence[64],16 uniform float3 vecVertices[8],17 uniform int v1[24],18 uniform int v2[24],1920 // output variables21 out float4 VertexOut : POSITION,22 out half3 TexCoordOut : TEXCOORD023 )24 {2526 float dPlaneDist = dPlaneStart + Vin.y * dPlaneIncr;2728 float3 Position;2930 for(int e = 0; e < 4; ++e) {3132 int vidx1 = nSequence[int(frontIndex * 8 + v1[Vin.x*4+e])];33 int vidx2 = nSequence[int(frontIndex * 8 + v2[Vin.x*4+e])];3435 float3 vecV1 = vecVertices[vidx1];36 float3 vecV2 = vecVertices[vidx2];3738 float3 vecStart = vecV1+vecTranslate;39 float3 vecDir = vecV2-vecV1;4041 float denom = dot(vecDir,vecView);42 float lambda =43 (denom!=0.0)? (dPlaneDist-dot(vecStart,vecView))/denom:-1.0;4445 if((lambda >= 0.0) && (lambda <= 1.0)) {46 Position = vecStart + lambda * vecDir;47 break;48 } // if(...4950 } // for(...5152 VertexOut = mul(matModelViewProj, float4(Position,1.0));53 TexCoordOut = 0.5 * (Position) + 0.5.xxx;54 return;55 }

Listing 1: Cg vertex program for box-plane intersection.

666

the same sequence works fine if the front edge orthe front face of the box is coplanar with the view-ing plane. We simply select one of the front verticesas V0 and set V7 to the opposite corner. Rememberthat we ignore any intersections with an edge that iscoplanar with the plane.

5 Implementation

The algorithm for computing the correct sequenceof intersection points as described in the previoussection has been implemented as a vertex program.The Cg code for the program is given in Listing 1.The program has been designed for slicing a highnumber of equally-sized and equally-oriented boxeswith a stack of equidistant planes. Care has beentaken to minimize the number of state changes andthe amount of data transferred to the graphics boardfor each frame.

The input stream of vertices to calculate one in-tersection polygon is specified in Listing 2. The x-coordinate of the vertex is an index that specifieswhich of the six possible intersection points shouldbe computed. The y-coordinate of the vertex is theindex of the plane that is used for intersection. Asthe plane index is constant for one polygon, it couldalternatively be specified as a separate parameterin the vertex stream (e.g. as texture coordinate).However, current hardware implementations do notsupport vertices that have only one coordinate, soI decided to incorporate the plane index into the y-coordinate of the vertex.

In this implementation we assume that all theboxes have the same size and orientation, althoughsimple modifications to the program will allow ar-bitrary size and orientation at the cost of a slightly

glBegin(GL POLYGON);glVertex2i(0,nPlaneIndex);glVertex2i(1,nPlaneIndex);glVertex2i(2,nPlaneIndex);glVertex2i(3,nPlaneIndex);glVertex2i(4,nPlaneIndex);glVertex2i(5,nPlaneIndex);

glEnd();

Listing 2: OpenGL example vertex stream for cal-culating one intersection polygon simply stores theindex of the intersection point to be calculated andthe index of the plane.

larger number of state changes. In our case eachbox consists of the same set of vertices and a trans-lation vector vecTranslate (line 05 in Listing 1)which is specified once for each box to be rendered.The vertices of one box are kept in a constant uni-form vector array vecVertices[8] (line 16) andwill not be changed at all.

Besides the usual modelview-projection-matrix(line 08), we specify for each frame the index ofthe front vertex with respect to the viewing direc-tion in the uniform parameter frontIndex (line10). Since all our boxes are equally oriented, thefront index will not change during one frame. Ad-ditionally, we set the uniform parameters vecView(line 09) to the normal vector ~nP of the plane anddPlaneIncr (line 14) to the distance between twoadjacent planes. In line 26, the correct distance dfor the plane equation is computed.

The constant uniform index array nSequence

(line 15) stores the permutation of vertex indiceswith respect to the given index of the front vertexfrontIndex (see lines 32 and 33). As described inthe previous section, several edges must be checkedfor intersection in sequence, according to the indexof the intersection point.

In order to calculate the intersection points P1,P3 and P5, we must first check for an intersectionwith the dotted edge, and if this intersection doesnot exist we must check for intersection with thecorresponding path (solid line, Figure 3). Hence,the maximum number of edges that must be testedfor intersection is four. This is done within the for-loop that starts in line 30. For the intersection pointsP0, P2 or P4, we have to check only three edges.In this case the program breaks out of the for-loopwhen the intersection point is found after a maxi-mum of three iterations.

The two constant index arrays v1 and v2 store theindices of start and end vertices of the edges thatmust be tested successively for intersection. Theyare indexed by the intersection index Vin.x fromthe vertex stream in combination with the currentiteration count e (line 32 and 33).

Line 32 and 33 compute the correct vertex in-dices of the edge that must be tested for intersec-tion. The vertices are fetched from the constant uni-form array vecVertices in lines 35 and 36. Lines38 and 39 compute the correct start point and theedge vector for the current edge, taking into accountthe local translation of the box. Line 41 calculates

666

the denominator for Equation 2. If the denomina-tor is unequal zero (which means that the edge isnot coplanar with the plane), the λ value for theedge is computed as in Equation 2. Line 45 testsif we have a valid intersection. If this is true, theprogram breaks out of the for loop. Finally, line52 transforms the resulting intersection point intoscreen space and line 53 calculates the texture co-ordinate for the vertex. The texture coordinates inthis example are obtained by simply scaling the ver-tex position to the range [0, 1]. Alternatively texturecoordinates could be specified by another uniformparameter similar to vecVertices.

This admittedly intricate implementation allowsone box to be intersected with several parallelplanes using one single function call that feeds apredefined vertex buffer into the graphics pipeline.

6 Empty-Space-Skipping

As an application of the described vertex program,we have implemented an empty space skipping ap-proach based on an octree decomposition of the vol-ume data, similar to the one proposed in [?] for iso-surface extraction. At each level, the volume blockis subdivided into eight equally sized sub-blocks.Each node of the octree stores the minimum and themaximum scalar value contained in its sub-block.Typical transfer functions for volume visualizationmap a significant portion of the scalar data range tocomplete transparency. The octree structure allowsus to completely skip sub-blocks that do not con-tribute to the final image, because all of its voxelvalues are being transparent. This is the well-knownidea of empty-space-skipping borrowed from ray-casting approaches.

During rendering the octree is traversed top-down by the CPU. Depth sorting of the sub-blocksis necessary to ensure correct back-to-front order-ing. This is efficiently achieved during the octreetraversal. Each octree nodes traverses its eight chil-dren in the correct sequence with respect to the cur-rent viewing direction. Before a subblock is tra-versed, our implementation uses the precomputedminimum and maximum scalar values to checkwhether the sub-block will contribute to the finalimage or not. Additionally, we can decide at eachnode of the octree, whether we store the 3D texturedata for this level or store one separate 3D-texturesfor each sub-level, to optimize the size of 3D tex-

tures for texture cache coherency.

7 Results

The presented approach for empty space skippinghas been implemented on an NVidia Geforce6board with 256 MB of local video memory. Thedataset used for performance measurement was a16 bit CTA2 data set of size 512× 512× 96 show-ing blood vessels in the human brain. In medicalpractise, 3D angiography data sets are usually visu-alized by setting the soft tissue to completely trans-parent. Angiography data is thus ideal for evalu-ating empty space skipping techniques. Exampleimages are given in Figure 6.

The vertex program described in the paper wassupplemented by two different fragment programs:

1. a simple shader for direct volume renderingusing dependent texture lookups for postclas-sification (2 texture lookups per fragment)

2. a fragment shader for direct volume render-ing with postclassification and Phong shading.This fragment shader estimates the gradientvector on-the-fly using central differences andmultiple texture lookups per fragment (8 tex-ture lookups per fragment).

We have tested different levels of subdivision andmeasured the overall performance. In order to ob-tain optimal image quality, a high number of sliceimages has been used to render the volume. Fig-ure 4 shows the performance using the first frag-ment shader (direct rendering without shading) fordifferent levels of subdivision. We have tested eachconfiguration with two different transfer functions,one which sets all voxels below one third of the

2CTA = computed tomography angiography

Direct Volume Rendering

200 5 10 15

4 x 4 x 2 bricks

no subdivision

8 x 8 x 4 bricks

16 x 16 x 4 bricks

32 x 32 x 8 bricks

64 x 64 x 8 bricks

64 x 64 x 16 bricks

2 x 2 x 1 bricks

Figure 4: Performance in frames per second fordirect volume rendering with postclassification viadependent texture lookup. The white and gray barsshow results for different transfer functions.

666

0 2 3 41

4 x 4 x 2 bricks

no subdivision

8 x 8 x 4 bricks

16 x 16 x 4 bricks

32 x 32 x 8 bricks

64 x 64 x 8 bricks

64 x 64 x 16 bricks

2 x 2 x 1 bricks

Volume Shading and Gradient Estimation

Figure 5: Performance in frames per second for di-rect volume rendering with on-the-fly gradient esti-mation and Phong shading. The white and gray barsshow results for different transfer functions.

scalar range to completely transparent (the gray bar)and another which sets all voxels below one half ofthe scalar range to zero (the white bar). The secondtransfer function allows a higher number of blocksto be skipped. Such transfer functions are typicallychosen for visualizing angiography data.

As displayed in Figure 4, our empty-space-skipping technique does not improve the perfor-mance for low subdivision levels. Due to the largesize of the blocks, it is not likely that one block canbe completely skipped during traversal. If we in-crease the depth of the octree, we reach optimumperformance by subdividing into 32×32×8 bricks.For higher subdivision levels, the rendering processis dominated by the increasing load on the vertexprocessor, which does not outweigh the reducednumber of fragments to be processed. In the worstcase, where we have a fully opaque volume, emptyspace skipping is not applicable at all and the in-creased vertex load of our algorithm could signif-icantly degrade the performance. However, sincewe are using an octree decomposition which is in-dependent of the texture storage, we can stop theoctree traversal at any time and generate the sliceimages for the current level.

The second fragment shader significantly in-creases the load on the fragment processor by com-puting gradient vectors on the fly using central dif-ferences and 8 texture samples per fragment (onefor the scalar value, six for the central differencesand one for the transfer function). In this case theperformance gain of our empty space skipping ap-proach is very high as displayed in Figure 5. Wehave measured a performance gain of about 150%for the 16×16×8 and the 32×32×8 configurationcompared to not applying empty space skipping.

8 Conclusion

We have presented a vertex program which effi-ciently calculates intersections between a box anda stack of equidistant parallel planes. The ap-proach has been used to implement empty-space-skipping for object-order texture-based volume ren-dering. We have demonstrated that applying ourempty-space-skipping techniques significantly in-creases the performance by balancing the workloadbetween the vertex and the fragment processor.

666

Figure 6: Resulting images and octree decomposition (32 × 32 × 8) of the example angiography data setfor different transfer function settings.

666

A Vertex Program for Efﬁcient Box-Plane Intersection

Documents

Transcript of A Vertex Program for Efﬁcient Box-Plane Intersection