Comparing Hierarchical Data Structures for Sparse Volume Rendering with Empty Space Skipping

Empty space skipping can be efficiently implemented with hierarchical data structures such as k-d trees and bounding volume hierarchies. This paper compares several recently published hierarchical data structures with regard to construction and rendering performance. The papers that form our prior work have primarily focused on interactively building the data structures and only showed that rendering performance is superior to using simple acceleration data structures such as uniform grids with macro cells. In the area of surface ray tracing, there exists a trade-off between construction and rendering performance of hierarchical data structures. In this paper we present performance comparisons for several empty space skipping data structures in order to determine if such a trade-off also exists for volume rendering with uniform data topologies.


Introduction
Ever growing data set sizes resulting from scientific simulations also result in an increased demand for more efficient rendering algorithms for volumetric data sets.While strategies like adaptive mesh refinement (AMR) [WWW * 19] help to reduce the memory space required for the data domain, simulations to a large degree still use uniform grid topologies, so that volume rendering with structured grids is still an active area of research.Structured grids also result from medical imaging techniques such as computer tomography (CT) or magnetic resonance imaging (MRI).Data sets from the clinical context in practice often have low resolution (for such reasons as to avoid unnecessarily exposing patients to radiation), but on the other hand often consist of multiple modalities like fiber tracks or blood vessels that were derived or segmented from the original CT or MRI data and manifest as separate volume channels.Those derived volume channels are often sparse in nature.
In all the described settings, it is beneficial if the user can explore the 3-d data set by means of interaction.A part of this interaction is adjusting the color and alpha transfer function that maps scalar input from the volume field to (RGB) color and opacity.When the alpha transfer function changes, so does the overall amount of empty space as well as the spatial arrangement of non-empty voxels.
A number of recent publications from our research group [ZSL18,ZHL19,ZSL19,ZML19] has therefore concentrated on the construction performance of spatial indices for direct volume rendering and proposed several data structures that can -depending on the size of the data set -be built at interactive rates.As the construc- † zellmann@uni-koeln.de,Department of Computer Science, University of Cologne tion times reported in the papers vary significantly, in this paper we present a thorough performance comparison of the techniques.
The construction algorithm from [ZHL19] is based on the linear bounding volume hierarchy (LBVH) algorithm [LGS * 09] and shows significantly improved construction rates compared to the other more involved data structures.In the field of real-time ray tracing with surface geometry, the LBVH data structure is generally known for its inferior culling properties but superior construction performance.The data structures from [ZSL18] and [ZSL19] especially employ very intricate construction schemes that use a cost function comparable to the surface area heuristic (SAH) [Wal07] known from surface ray tracing.In that field, it is generally accepted that the quality obtained with SAH is usually higher than that of LBVH, which uses the middle split heuristic.The construction time for hierarchies built with SAH however is generally higher than the construction time required when using a simpler heuristic.
While the various papers from our prior work individually proved the effectiveness of the respective data structures in general, they only compared them with naïve ray marching without empty space skipping, or with simple data structures like structured grids with macro cells.This paper tries to fill the gap in this respect and provides a comparison of the various data structures against each other.This is done by testing the data structures using different combinations of data sets and transfer functions.This paper shall be understood as an addendum to our prior work: the benchmark results that constitute the main contribution of this paper are meant to serve as a guideline and to give further insight into when to use which of the data structures we proposed in our recent papers on empty space skipping.

Related Work
Interactivity is an important property for scientific visualizations and can be aided in several ways, such as via remote rendering [ZAL12,SH15], level-of-detail techniques [LCCK02] or adaptive sampling of the data domain [MUWP19].Empty space skipping is one of the traditional optimization strategies for direct volume rendering (DVR) algorithms and is often implemented using hierarchical data structures [LMK03,HBJP12,LCDP13].A general overview on empty space skipping and other optimization techniques can e.g.be found in the state-of-the-art report by Beyer et al. [BHP14].Notable systems or system approaches are those by Hadwiger et  Hardware accelerated ray tracing uses acceleration data structures that are described by a whole corpus of research papers that is primarily focused on surface ray tracing, e.g.[WH06, Wal07, Kar12, GHFB13, MB18].While some researchers have used acceleration data structures to implement volume rendering algorithms in the past [ZHL17], with the advent of hardware accelerated ray tracing on modern GPUs it is likely that the two research branches that are concerned with surface ray tracing on the one hand, and with volume ray casting on the other hand, are going to converge in the future.
While interactive hierarchy rebuilds have gained a significant amount of attention from the real-time ray tracing community in the past [LGS * 09, PL10, MB18], interest from scientific visualization researchers has only recently started to focus on this topic.Our group has recently proposed a number of empty space skipping data structures [ZSL18, ZHL19, ZSL19, ZML19] that are primarily aimed at fast spatial index reconstruction and are explained in detail in the following section.While our approach is focused on rebuilding the hierarchy whenever the data or the transfer function changes, another approach is to reuse or adapt an already existing acceleration data structure.This can e.g.be achieved by means of min-max range queries [WFKH07, KTW * 11, Wal19].An alternative approach to update an existing data structure is the one by Schneider et al. [SR17] who use Fenwick trees, which bare similarities to the summed area tables that our techniques use as auxiliary data structures.

Construction Algorithms for Empty Space Skipping Hierarchies
In this section we briefly summarize the various construction algorithms from our recent papers.While the linear bounding volume hierarchy algorithm is solely targeted towards GPUs, the k-d construction algorithm we have targeted towards both multi-core and GPU systems.The hybrid grid construction algorithm is based on the multi-core CPU variant of the k-d tree construction algorithm but could generally also be implemented on the basis of the GPU variant.

Linear bounding volume hierarchies
The LBVH algorithm was initially introduces by Lauterbach et al. [LGS * 09].Our implementation [ZHL19] adapts this algorithm to volume rendering with uniform grids.The implementation uses NVIDIA CUDA.We first decompose the volume into bricks of size 8 3 and then run a CUDA kernel where each thread is responsible for one voxel.Each thread determines if the voxel is visible w.r.t. the current transfer function.The threads responsible for one brick then vote if the whole brick is visible by atomically updating a flag in on-device shared memory.This per-voxel operation is followed by a number of operations with O(n) work complexity-where n is the number of bricks-each of which is carried out in a single CUDA kernel.We first perform compaction since we are only interested in the non-empty bricks.Then we assign 30-bit 3-d Morton codes to the non-empty bricks that we use to sort the bricks using a parallel O(n) GPU algorithm from the Thrust library [BH11].The Morton order implicitly defines a hierarchy over the bricks that just spatially splits the respective child nodes in the middle.The split positions can be read from the bit codes of the Morton indices and can be efficiently found using Karras' algorithm [Kar12].A final CUDA kernel traverses the hierarchy from each leaf node up to the root node and assembles the respective axis-aligned bounding box of each inner node encountered along that path.

k-d trees
The k-d tree construction algorithms from [ZSL18] and [ZSL19] are loosely based on original work by Vidal et al. [VMD08] which employs a summed area table to quickly determine the occupancy (i.e. the number of non-empty voxels) inside a volumetric region bound by an axis-aligned bounding box.We first introduce the multi-core parallel variant of the construction algorithm that will produce the exact same results as the algorithm by Vidal et al. (apart from certain parameters such as halting criteria etc. that we might set differently) and then briefly describe the adaptation of this algorithm to the GPU.

Multi-core construction
With the x64 multi-core CPU implementation that was first presented in [ZSL18] and later refined in [ZSL19], the volume is first decomposed into bricks of size 32 3 .A preclassified version where each voxel is only associated with a flag telling whether it is empty or not is derived from that whenever the transfer function changes.We then build a three-dimensional summed area table (a "summed volume table", SVT) over the binary preclassification for each brick.By choosing a brick size of 32 3 , an SVT will fit exactly into the L1 cache of an x64 CPU.The respective SVTs for each brick are built in parallel.
This SVT construction phase is followed by a second algorithmic phase where a k-d tree is built in a top-down fashion.We first determine a tight bounding box for the root node by shrinking the axis-aligned bounding box of the whole volume to tightly fit the non-empty voxels according to the current transfer function.This can be done by querying the SVTs to determine the occupancy inside the boxes.Occupancy queries are performed in parallel for each SVT that the bounding box overlaps.Therefore, local bounding boxes are computed in parallel per SVT and the result is later combined to form the overall bounding box.The shrinking procedure just compares the occupancy inside smaller boxes to the occupancy of the parent bounding box; if the occupancy is the same, shrinking was valid.A cost function is used to determine an optimal splitting plane by using sweeping and by inspecting the volume of the (tight) bounding box to the left and the right of the candidate planes.Certain halting criteria favor either shallow or deep trees-both of which we evaluate in Section 4-and can be set as parameters to the construction algorithm.The targeted size (height; number of nodes) of the hierarchy one can expect to influence both construction time and rendering performance.

Construction on GPUs
The CPU construction algorithm-would it be ported without modifications-is not well suited for execution on GPUs, at least not if the whole algorithm would be carried out in a single GPU kernel.In [ZSL19] we therefore proposed an adapted version of the algorithm that performs the plane sweeping and top-down construction phase of the algorithm on the CPU, while the procedure that finds tight bounding boxes around non-empty voxels is offloaded to the GPU.
The plane sweeping procedure thus involves starting two GPU kernels for both the left and the right half-space and for each plane tested against.To reduce the number of kernel calls, we use a binning approach.Binning has another crucial advantage: strategically aligning the bin boundaries on the same raster imposed by the SVTs, we will never consider plane candidates that would split an SVT into two halves and can thus just precompute the tight bounding box inside that macro cell once when the transfer function changes (as opposed to each time that we consider another plane candidate).We do this by initially computing SVTs that are however immediately discarded as soon as the local bounding box has been determined.In contrast to the CPU, where we optimized for L1 memory, on the GPU we explicitly perform the computations in shared memory, so that a macro cell will have a size of 8 3 .
We order the resulting bounding boxes on a z-order Morton curve.In order to determine a tight bounding box for a spatial region spanning multiple macro cells during splitting, we perform a parallel reduction on the GPU using Thrust.Since this is a 1-d operation, we need to check if the macro cells we consider are actually inside the volumetric region of interest.To minimize the number of macro cells to test, we first conservatively determine the first and last cell in the list that will definitely fall inside the region of interest by using the Morton code order of the list.Although the algorithm is per se not very well suited for GPUs due to the top-down recursion, we still achieve good GPU utilization during splitting as we perform the reduction twice for each candidate plane, and also because of the smaller size and thus larger amounts of the macro cells to reduce over.

Hybrid grids
Our benchmarks suggested that shallow k-d trees built with the original halting criteria proposed by Vidal et al. would result in effective space skipping data structures, but on contemporary hardware, deeper trees with a leaf node size of approximately 8 3 would perform even better.The approach by Vidal et al. was originally intended to be used in a way where macro blocks would be rendered using an outer loop, which calls a volume rendering shader per block.In [ZML19] we evaluated the outer loop approach against full ray traversal on the GPU; we found that full ray traversal with relatively deep trees would outperform the outer loop approach during rendering, but also that the construction times for deep trees was significantly higher than that for shallow trees.
We therefore proposed an alternative rendering strategy, which would combine a shallow k-d tree with only relatively few leaves with a global grid of macro cells.As the k-d tree construction algorithm effectively uses macro cells-each SVT can be thought of as a macro cell that stores occupancy information about its volumetric region-deriving a grid from that and sending it over to the GPU is straightforward and comes at no recognizable storage overhead.Instead of the original size of 32 3 that the construction algorithm uses, we found macro cells of size 16 3 in general to perform better and thus resample the grid to that size, which can still be done in constant time using the SVTs.We adapted the ray marching algorithm to perform full traversal per ray until we find a k-d tree leaf node, and inside that one use the global grid to skip over empty space with finer granularity.Our benchmarks suggest that the data structure is helpful in certain cases, specifically when the data sets are large (i.e.near the amount of available texture memory), or when empty space manifests as wholes inside the volume.Construction performance however was literally the same as that for shallow k-d trees.

Performance Comparisons
The main focus of this paper is a thorough comparison of the various data structures w.r.t.construction and rendering performance.To achieve this, we integrated them into the Virvo volume rendering library [SWWL01] and use the Visionaray library [ZWL17] to implement the low level ray tracing algorithms like k-d tree or BVH traversal.

Test setup
Our test system consists of an eight-core Intel Xeon Gold 5122 CPU system with 3.60 GHz clock frequency and an NVIDIA Titan-V graphics card with 12 GB GDDR video memory.The system is equipped with 96 GB DDR memory.For the k-d tree and hybrid grid construction algorithms, we deactivate simultaneous multithreading (marketed by Intel under the name "hyperthreading") and assign a single core to each thread using the numactl tool that comes with the Linux distribution installed on our test system.
For the evaluation we use the data sets and transfer function settings from Figure 1.Note that Figure 1 does not depict the exact view points we used for the evaluation.Rather than that, we use an orthographic camera and a view point that is zoomed in so that Figure 1: Data sets with spatial dimensions and occupancy (percentage of voxels that are visible given a certain transfer function) used to evaluate our method.We pick a number of different settings with transfer functions that favor different types of spatial arrangements of the visible voxels.Table 1: Statistics for the various data sets we use for the evaluation.We report the number of nodes and the height of the respective tree structures: Linear bounding volume hierarchies with leaves of size 8 3 .Shallow k-d trees where a leaf is created when the volume of the node gets below 10 % of the root node's volume during splitting; k-d trees with leaf volume less or equal 8 3 and a maximum leaf node size of 32 3 ; k-d trees with leaf volume less or equal 8 3 and a maximum leaf node size of 128 3 ; the k-d trees are built with different algorithms depending on whether they are built on the CPU or the GPU; on the GPU we use four bins to determine candidate split planes (cf.Section 3.2.2) the volume fills the whole rendering window.We use a CUDAbased renderer with the absorption plus emission model and postclassification transfer function lookups; for our tests we deactivate gradient shading and early-ray termination.For the benchmarks we use a rotating camera animation to account for and average out ef-fects like unfavorable offsets and strides when accessing 3-d textures from certain viewing angles.We render images into viewports of size 1024 × 1024 pixels.In addition to the rendering performance, we also evaluate the tree construction performance.

Data sets
We use a variety of different data set and transfer function combinations that are depicted in Figure 1.We deliberately choose combinations with a varying number of non-empty voxels.Since all algorithms are in-core, the largest data sets we test with have a size of 2048 3 voxels, which amounts to 8 GB when voxels are stored with 8 bit precision.The Richtmeyr-Meshkov instability specifically has two prominent large regions that are either empty, or not empty but totally homogeneous.We expect the different data structures to adapt differently to the large amount of empty space.The Menger Sponge data set we consider a hard case for most empty space skipping data structures as the empty space is contained inside the volume, a spatial arrangement which is known to be particularly ill suited for typical k-d tree builders.

Parameters
We consider three different types of k-d trees: We build shallow trees where we only split a node if its volume is above 10 % of the root node's volume (that is the original halting criterion proposed by Vidal et al. [VMD08]).This setup will create large volume chunks which will potentially contain lots of empty space.The two remaining setups employ a halting criterion where a node is split whenever its volume is above 8 3 and will result in deeper trees.
The greedy heuristic will in certain cases accept a local optimum and thus stop the recursion due to the cost function (cf. Figure 2), even though the resulting leaf's extent is quite large and still bounds a substantial amount of empty space.That in particular happens when empty space is contained inside the volume and was already pointed out by Vidal et al. [VMD08] This generally undesirable behavior can be mitigated by enforcing a split when the leaf node would otherwise exceed a certain size.We perform benchmarks for configurations where we enforce a maximum leaf node size of either 32 3 or 128 3 ; when the cost function reports a leaf that exceeds  For our tests, we disable gradient shading and early-ray termination.For the absolute timing results cf.Table 2.
the maximum size, we just split it in two at the middle position of the axis where the spatial extent is largest and continue the recursion.Note that this procedure will also enforce splitting large areas of homogeneous or non-empty space (like it is present in the Richtmeyr-Meshkov and Menger Sponge data sets).The increase in tree depth due to this behavior may be potentially undesirable, which is why we include results for two different maximum leaf node sizes.See Table 1 for some statistics regarding the tree height and number of nodes when building the data structures for our test data sets.
Like in [ZSL19], the GPU construction algorithm uses binning, while the CPU construction algorithm does not.We set the number of bins to 4. See [ZSL19] for a thorough evaluation of the influence of binning on construction and rendering performance.We report the rendering performance in Figure 3 and the construction performance in Figure 4.The absolute numbers plotted in the two figures can also be found in Table 2.

Discussion and Limitations
Our benchmarks suggest that construction with the LBVH algorithm will be very fast, but as this algorithm makes the least informed decision as of the position where to perform the split will generally result in inferior spatial indices.Out of the several data structures that use SVTs to find tight bounding boxes, we see a clear correlation between the depth of the resulting tree and the construction performance, but also observe that deeper trees will generally be superior to more shallow ones.
The problem with the algorithms accepting local optima can only partially be mitigated using a maximum leaf size.In particular, this strategy might fail when the data set is huge and extra splits of homogeneous space that would otherwise have been bound by a single leaf cause the resulting spatial index to be deeper and contain far more leaf nodes than without using this strategy.This behavior can be seen in Figure 2 for the Richtmeyr-Meshkov data set, where the homogeneous space to the bottom of the data set is split excessively and to no avail.In the future, we intend to investigate alternative strategies to circumvent this problem.l−g(0ec.)

Aneurism
Figure 4: Build rates for the various data sets and hierarchy construction algorithms.Note the logarithmic scale of the y-axis.For the absolute timing results cf.Table 2.
Our benchmarks also give an overview of how effective the algorithms are given a certain amount of empty space and specific spatial arrangements.We already noted that empty space inside nodes is particularly hard to find for the construction algorithms.The Menger Sponge data set e.g.suffers from this problem and one can see that hardly any of the algorithms is effective at skipping empty space in this situation.
While the hybrid grid algorithm will usually not outperform the deep k-d tree construction algorithms, its construction rate is in general superior to the latter.Hybrid grids-for larger data setsenable frame rates that fall in-between those of shallow and deep kd trees.For that reason, we consider them an interesting data structure.As the shallow k-d tree construction algorithm is in most cases almost as fast as the LBVH algorithm, it might be interesting to combine this construction algorithm with the hybrid grid algorithm instead of using k-d tree construction on the CPU.

Conclusion
In this paper, we thoroughly compared the various algorithms to construct spatial indices for empty space skipping and structured volumes we proposed in our prior work.We proved that the construction algorithms that make a more informed decision than just splitting in the middle along one access result in superior space skipping hierarchies.The paper may serve as a guideline for pracitioners to decide which algorithm to choose depending on the specific problem.A general limitation of the top-down construction algorithms is that the greedy heuristic may accept local optima, which can be partially mitigated by enforcing a maximum leaf node size.We however showed that this strategy might fail in certain situations and consider an alternative solution to that problem interesting future work.

Figure 2 :
Figure 2: Enforcing a maximum leaf size to avoid unfavorable local optima.Top row, left: Xmas tree data set with bounding box overlay.Building a deep k-d tree for this data set and transfer function combination will cause the splitting heuristic to accept a local optimum resulting in a single very large leaf node (middle).Enforcing a maximum leaf node size can mitigate this issue (right).If the volume consists of regions that are non-empty (bottom row, left), enforcing a maximum leaf size can however have the undesirable effect of the hierarchy growing exceptionally deep and wide and splitting regions containing no empty space (middle).A sensible choice for the maximum leaf size depends on the data set and the transfer function (right).We consider investigating this issue and potentially finding better solutions interesting future work.

Figure 3 :
Figure3: Rendering performance obtained with our benchmarks.We render rotating orthographic views with a 1024 × 1024 pixel viewport.For our tests, we disable gradient shading and early-ray termination.For the absolute timing results cf.Table2.
al. [HBJP12, HAAB * 18] or the grid-based solution that can be found in OSPRay [WJA * 17].The system by Hadwiger et al. and the OSPRay implementation represent two different approaches to fundamentally solve the same problem;