ParaView Remote Rendering Guidelines - Virginia Tech · ParaView remote rendering usage guidelines...

ParaView Remote Rendering Guidelines Version 1.0 by

Srijith Rajamohan and Nicholas F. Polys Advanced Research Computing, Virginia Tech Feb 6, 2015

ParaView remote rendering usage guidelines

Introduction

ParaView allows remote parallel rendering for scalable and fast rendering of large datasets. Currently this feature is enabled on both Blueridge and Hokiespeed clusters. This document in-tends to provide some usage guidelines for both structured and unstructured meshes. These guidelines pertain to the choice of filters and the order in which they can be applied for optimal performance. The use of the D3 filter [1] for unstructured meshes is illustrated.Also, some pointers are provided on how to partition the data for remote visualization on Blueridge.

Timing results using two cases are provided below to showcase scaling properties of ParaView. It must be pointed out here that run-times are dependent on the data format, the type and num-ber of filters applied in ParaView [1] and the load on the processors. Not all file formats can be expected to scale on the cluster environment. The profiling information presented below is in-tended to allow the user to make informed decisions for their particular use case.

1. Methodology

ParaView was profiled using both structured and unstructured meshes on Blueridge, which is currently equipped with 4 GPU nodes br131 - br134. The GPUs are K40 Teslas and belong to the Kepler architecture. The files were located on the Lustre [8] file system and rendering was done in a client-server mode. Remote rendering was set up in accordance with the instructions provided in the document ‘Parallel preview Instructions’ located at http://www.arc.vt.edu/re-sources/vis/ParaviewInstructions.pdf. All simulations were automated using the Python API and timing was done using the python ‘time’ module. This included loading the file, application of the filters and rendering the time-steps in the simulation. This process was scripted to ensure accu-racy in timing and reproducibility. Visualization jobs are restricted to 2 nodes and 32 processors using the ‘Vis_q’ on Blueridge. The case using 48 nodes for structured meshes was intended only for demonstration of scalability.

Timing results displayed in Tables 1 and 2 includes the time taken to load the files, apply the filters and render the time-series data. After timing results are obtained the filters and files are deleted. The above process is repeated three times and noted in columns four,five and six in Tables 1 and 2. Scaling properties of structured and unstructured meshes are investigated for remote rendering. In order to help illustrate this, the averaged runtimes are plotted in Fig. 4 for a structured mesh and Fig. 5 for an unstructured mesh.

2. Rendering test cases

This section summarizes two test cases that were used to profile remote rendering using Par-aView on Blueridge.

2a. Structured mesh

http://www.arc.vt.edu/resources/vis/ParaviewInstructions.pdf

The results of remote rendering using an structured mesh with 3,449,952 points and 3,145,728 hexahedral cells are shown below in Table (1). The structured data comes from 96 processors and each processor has 100 files corresponding to each time-step for a total of 9700 files. The cumulative size of the files was observed to be 28GB. This was opened using the python com-mand ‘XMLMultiBlockDataReader' provided in the ParaView API. Visualization of a time-depen-dent structured mesh data scales well with the addition of more processors, albeit not linearly. Slightly variations can be expected in run-times based on processor load and network delays. The dataset, which was made available to us by Dr. Scott King [5], consists of viscosity,stress,temperature,velocity and vorticity information that illustrates the time evolution of the planet Venus. The dataset consists of 100 time-steps and the visualization included the streamline, slice and glyph filters in ParaView (Fig. 1a). The structure and quality of the mesh can be observed in Fig. 1b.

2b. Unstructured mesh

This exercise was repeated for an unstructured mesh generated with 2,067,633 points and 12,790,098 tetrahedral cells with 18 time-steps. The unstructured data exists as a single HDMF file of size 11 GB and this includes the data corresponding to all 18 time-steps. This mesh was generated using Constructive Solid Geometry (CSG) and data in the form of random noise was added to the mesh in the form of a time-series. A clip filter was added to the data in ParaView and the profiling results are displayed in Table. (2). Fig. 2a displays the solution through a clip plane and Fig. 2b illustrates the dense unstructured spatial discretization. Although a structured mesh is implicitly decomposed by ParaView using the D3 filter [2] , this feature does not cur-rently exist for an unstructured mesh.

If the data is already decomposed into ’N’ partitions and in a format that can be read in by Par-aView [3], rendering can be performed on up to ’N’ processors. If the data is contained in a sin-gle large file, and assuming it will fit into the memory available on a single node, running the D3 filter has the effect of creating load-balanced partitions of the data. This can then be saved in

Figure 1a - Structured mesh solution Figure 1b - Structured mesh

the ‘*.pvtu’ file format [3]. The disadvantage to this approach is that it tends to generate a large number of files.

Load balancing can be checked by viewing the ‘VtkProcessId’ on the loaded dataset. This colors the mesh by ‘Process Id’, which is essentially the MPI rank of the process that owns the portion of the data (Fig. 3).

Figure 3 - Unstructured mesh partitioning using D3

Figure 2a - Unstructured mesh solution Figure 2b- Unstructured mesh

The number of partitions equals the number of processes spawned, irrespective of the number of partitions that D3 created. In this experiment the dataset was parti-tioned into 16 pieces for a total of 326 files. However, one cannot use more processes than available data partitions.

3. Results

A summary of the runtimes along with the number of nodes and processors used are shown be-low.

3a. Results for structured mesh

The results in Table 1 indicate strong scaling properties as can be observed

from the fact that the total runtime goes down with an increase in the number of pro-cessors from 1 to 48. Fig. 4 illustrates the general trend in scaling on the Blueridge GPU-equipped nodes. A speedup of roughly around 8 times is observed when the number of processors is increased from 1 to 48.

It should be noted for the two cases with 16 processors in rows 5 and 6 that intranode processing results in faster rendering as this avoids the overhead associated with in-

Table 1. Profiling information for a structured mesh with 3,449,952 points and 100 time-steps

Number of processors

Nodes Processors per node

Run 1(s) Run 2(s) Run 3(s)

1 1 1 1035.7882 1025.2994 989.8697

2 1 2 1113.3154 1118.5557 944.7309

4 1 4 549.1688 525.448 549.9214

8 1 8 320.3603 324.0742 317.4129

16 1 16 198.6750 185.6521 165.5376

16 2 8 214.7970 204.7585 209.7485

32 2 16 122.6229 122.0907 123.2164

48 3 16 116.7357 113.2833 111.9469

Be wary of the large number of

datafiles generat-ed with D3

Internode ren-dering always slower than in-

tranode

ternode communication.

3b. Results of unstructured mesh

The results of rendering the unstructured mesh are shown in Table 2 and the scaling properties are illustrated using average total runtimes in Fig. 5. The mesh was decomposed into 16 pieces for the runtimes shown in the first six rows of Table 2. A speedup of roughly 6 was observed when the number of processors was increased from 1 to 32. However, it can be noticed that the runtimes for both 16 processors in row 5 and 32 processors in row 6 are roughly the same. Ad-dition of more processors has no impact since the unstructured mesh does not perform D3 im-plicitly to distribute data to the 16 additional processors. When the original data is manuallypartitioned using D3 into 32 pieces and the experiment is rerun, as can be seen in row 7, run-times go down from around 40s to 25s.

Run

times

(s)

0

275

550

825

1100

Number of processors1 2 4 8 16 32 48

Averaged total runtimes for structured mesh

Figure 4 - Scaling for structured mesh

Table 2. Profiling information for an unstructured mesh with 2,067,633 points and 18 time-steps

Number of processors

Nodes Processors per node

Run 1(s) Run 2(s) Run 3(s)

1 1 1 252.4529 256.6944 243.1041

2 1 2 217.6702 242.4578 215.1074

4 1 4 133.8851 151.6513 145.6353

8 1 8 88.4758 86.5699 91.1222

16 1 16 41.1658 41.1668 40.4798

32(16 partitions) 2 16 41.4443 40.2669 40.7539

32 (32 partitions) 2 16 26.7415 25.8786 25.6409

Run

times

(s)

0

75

150

225

300

Number of processors1 2 4 8 16 32

Averaged total runtimes for unstructured mesh

Figure 5 - Scaling for unstructured mesh

4. Notes

The following are guidelines/best practices that can be used to ensure optimal performance when performing remote visualization using ParaView. Note that these are general guidelines for handling various file formats and the choice of filters and the user may observe slight varia-tions depending on file size, number of files, processor load, type and order of filters in the visu-alization pipeline. ParaView is designed to take advantage of Gustafson’s law [7] as opposed to Amdahl’s law [6] and such it benefits from greater number of processors as data size is in-creased [4], however strong scaling can still be observed to a certain extent in the problems tested.

4a. Notes on file size and partitioning

In general, larger number of small files, as opposed to a single big file, tend to bog down a file system and can affect performance not only for the user but for anyone sharing that file system.

It is important to note that not all file readers work in parallel. If ParaView is run in parallel and a reader that does not support parallel I/O is used, every process will read the entire data on all cores [3]. This is both slow and inef-ficient as far as a memory footprint is concerned. In this case, it is recom-mended to follow the approach outlined in Section 2b of applying a D3 filter explicitly to the data and writing it out as ‘partitioned vtu (*.pvtu)’ files.

It was recommended in [1] to use a processor per every 500,000 cells for unstructured meshes and a processor per every 1 million cells for structured meshes. However, stronger scaling properties, as illustrated in Tables 1 and 2, were observed in our testing of both structured and unstructured meshes.

4b. Note on Filter usage in ParaView

Application of filters to large structured and unstructured data should be done with care. Filters that generate new data or extract a subset of the existing data tend to increase memory usage. Also, one should be weary of filters that convert structured data to unstructured data which has a higher memory footprint [1]. The following filters write out unstructured data that is roughly the same size as the input.

• Append Datasets• Append Geometry• Clean• Clean to grid• Connectivity• D3• Delaunay 2D/3D• Extract Edges• Linear Extrusion• Loop subdivision• Reflect

• Note that the D3 filter is not implicitly applied to

an unstructured dataset. It must be applied explic-

itly in the pipeline• Load balancing must be

checked after this is done by viewing the

‘VtkProcessId’ on the dataset

• Use caution when using these filters with large

structured and unstruc-tured files

• Check memory usage on the clusters

• More number of time-steps can exacerbate

these effects

• Rotational Extrusion• Shrink• Smooth• Subdivide• Tessellate• Tetrahedralize• Triangle strips• Triangulate

It is recommended to use, if possible, filters that reduce the dimension of the data (e.g. from 3D to 2D) earlier in the visualization pipeline. Also, there are filters such as the ‘Temporal Interpola-tor’ that would have to load multiple time-steps at a time, which can double or more the amount the memory usage [1]. Any filter that would be required to compute a temporal metric can take an impractical amount of time for large datasets.

References

1. ParaView User’s guide, Version 4, 20122. http://www.paraview.org/ParaView3/Doc/Nightly/www/py-doc/paraview.simple.D3.html3. http://www.paraview.org/Wiki/ParaView:FAQ#Which_readers.2Ffilters_work_in_parallel_.3F4. Big Data analysis with ParaView, http://extremecomputingtraining.anl.gov/files/2013/07/DOEPV.pdf5. http://www.geophys.geos.vt.edu/sking07/VT_Geodynamics/Welcome.html6. Amdahl, G.M. Validity of the single-processor approach to achieving large scale computing capabilities. In AFIP$ Conference Proceedings, vol. 30 (Atlantic City, N.I., Apr. 18-20). AFIPS Press, Reston, Va., 1967, pp. 483-4857. John L. Gustafson. 1988. Reevaluating Amdahl's law. Commun. ACM 31, 5 (May 1988), 532-533. DOI=10.1145/42411.42415 http://doi.acm.org/10.1145/42411.424158. Braam, Peter J. "The Lustre storage architecture." (2004).

http://www.paraview.org/Wiki/ParaView:FAQ#Which_readers.2Ffilters_work_in_parallel_.3F

http://extremecomputingtraining.anl.gov/files/2013/07/DOEPV.pdf

http://www.geophys.geos.vt.edu/sking07/VT_Geodynamics/Welcome.html

http://doi.acm.org/10.1145/42411.42415

ParaView Remote Rendering Guidelines - Virginia Tech · ParaView remote rendering usage guidelines...

Documents

Transcript of ParaView Remote Rendering Guidelines - Virginia Tech · ParaView remote rendering usage guidelines...