Sparse Matrices in Matlab*P: Design and …gilbert/reports/dsparse.pdfSparse matrices may have...

Sparse Matrices in Matlab*P:

Design and Implementation

Viral Shah and John R. Gilbert

{viral,gilbert}@cs.ucsb.edu?

Department of Computer ScienceUniversity of California, Santa Barbara

Abstract. Matlab*P is a flexible interactive system that enables com-putational scientists and engineers to use a high-level language to pro-gram cluster computers. The Matlab*P user writes code in the Mat-

lab language. Parallelism is available via data-parallel operations ondistributed objects and via task-parallel operations on multiple objects.Matlab*P can store distributed matrices in either full or sparse for-mat. As in Matlab, most matrix operations apply equally to full orsparse operands. Here, we describe the design and implementation ofMatlab*P’s sparse matrix support, and an application to a problem incomputational fluid dynamics.

Introduction

Matlab is a widely used tool in scientific computing. It began in the 1970s as aninteractive interface to EISPACK, and LINPACK. Today, Matlab encompassesseveral modern numerical libraries such as ATLAS, and FFTW, rich graphicscapabilities for visualization, and several toolboxes for such domains as controltheory, finance, and computational biology.

Almost all of today’s supercomputers are based on parallel architectures.Companies such as IBM, Cray, SGI sell supercomputers with proprietary inter-connects. Commodity clusters are omnipresent in research labs today. However,the tools used to program them are still predominantly Fortran and C with MPIor OpenMP.

Matlab*P brings interactivity to supercomputing. There have been sev-eral efforts in the past to parallelize Matlab. The parallel Matlab survey [6]discusses most of these projects. Perhaps the most notable project that pro-vides a large scale integration of parallel libraries with a Matlab interface isNetSolve [2]. NetSolve provides an interfaces by invoking RPC calls through aspecial Matlab function, as opposed to Matlab*P which takes a unique ap-proach to parallelization. The Matlab*P language is a superset of the Matlab

? This material is based on research sponsored by Air Force Research Laboratories underagreement number AFRL F30602-02-1-0181. The U.S. Government is authorized toreproduce and distribute reprints for governmental purposes not withstanding anycopyright notation thereon.

language, and parallelism is propagated through programs using the the dlayoutobject – p. In the case where these systems use the same underlying packages,we believe that they can all achieve similar performance. However, we believethat the systems have different design goals otherwise and it is unfair to comparethem.

Sparse matrices may have dimensions that are often in millions and enoughnon–zeros that they cannot fit on one workstation. Sometimes, the sparse ma-trices are themselves not too large, but due to the fill–in caused by intermediateoperations (for eg. LU factorization), it becomes necessary to distribute the fac-tors over several processors. Iterative methods maybe a better way to solve suchlarge sparse systems. The goal of sparse matrix support in Matlab*P is to allowthe user perform operations on sparse matrices in the same way as in Matlab.

1 User’s View

In addition to Matlab’s sparse and dense matrices, Matlab*P provides sup-port for distributed sparse (dsparse) and distributed dense (ddense) matrices.The system design of Matlab*P and operations on ddense matrices are de-scribed elsewhere [12, 7].

The p operator provides for parallelism in Matlab*P. For example, a randomparallel dense matrix (ddense) distributed by rows across processors is createdas follows:

>> A = rand (100000*p, 100000)

Similarly, a random parallel sparse matrix (dsparse) also distributed across pro-cessors by rows is created as follows: (An extra argument is required to specifythe density of non-zeros.)

>> S = sprand (1000000*p, 1000000, 0.001)

We use the overloading facilities in Matlab to define a dsparse object. TheMatlab*P language requires that all operations that can be performed in Mat-

lab be possible with Matlab*P. Our current implementation provides a work-ing basis, but is not quite a drop–in replacement for existing Matlab programs.

Matlab*P achieves parallelism through polymorphism. Operations on ddensematrices produce ddense matrices. But once initiated, sparsity propagates. Oper-ations on dsparse matrices produce dsparse matrices. An operation on a mixtureof dsparse and ddense matrices produces a dsparse matrix unless the operatordestroys sparsity. The user can explicitly convert a ddense matrix to a dsparsematrix using sparse(A). Similarly a dsparse matrix can be converted to a ddensematrix using full(S). A dsparse matrix can also be converted into a Matlab

sparse matrix using S(:,:) or p2matlab(S). In addition to the data–parallelSIMD view of distributed data, Matlab*P also provides a task–parallel SPMDview through the so–called “MM–mode”.

Matlab*P currently also offers some preliminary graphics capabilities tohelp users visualize dsparse matrices. This is based upon the parallel rendering

Fig. 1. Matlab and Matlab*P Spy plots of a web crawl dsparse matrix

for ddense matrices [5]. Again, this demonstrates the philosophy that Matlab*Pshould feel like Matlab. Figure 1 shows the spy plots (showing the non–zerosof a matrix) of a web crawl matrix in Matlab*P and in Matlab.

2 Data Structures and Storage

Matlab stores sparse matrices on a single processor in a Compressed SparseColumn (CSC) data structure [10]. The Matlab*P language allows for matricesto be distributed by block rows or block columns. This is already the case forddense matrices [12, 7]. The current implementation supports only one distri-bution for dsparse matrices – by block rows. This is a design choice to preventthe combinatorial explosion of argument types. Block layout by rows makes theCompressed Sparse Row data structure a logical choice to store the sparse ma-trix slice on each processor. The choice to use a block row layout is not arbitrary,but based on the following observations:

– The iterative methods community largely uses row based storage. Since webelieve that iterative methods will be the methods of choice for large sparsematrices, we want to ensure maximum compatibility with existing code.

– A row based data structure also allows efficient implementation of matvec(sparse matrix dense vector product) which is the workhorse of several itera-tive methods such as Conjugate Gradient and Generalized Minimal Residual.

By default, a dsparse matrix in Matlab*P has the block row layout whichwould be obtained by ScaLAPACK [3] for a ddense matrix of the same dimen-sions. This allows for roughly the same number of rows on each processor. Theuser can override this block row layout in a couple of ways. The Matlab sparse

function takes arguments specifying a vector of row indices i, a vector of column

indices j, a vector of non–zero values v, the number of rows m and the numberof columns n as follows:

>> S = sparse (i, j, v, m, n)

By using a vector layout which specifies the number of rows on each processorinstead of the scalar m which is simply the number of rows, the user can createa dsparse matrix with the desired layout:

>> S = sparse (i, j, v, layout, n)

The block row layout of a dsparse matrix can also be changed after creationwith:

>> changelayout (S, newlayout)

The CSR data structure stores whole rows contiguously in a single array on eachprocessor. If a processor has nnz non–zeros, CSR uses an array of length nnz tostore the non–zeros and another array of length nnz to store column indices, asshown in Figure 2. Row boundaries are specified by an array of length m + 1,where m is the number of rows on that processor.

� � � � � � � �

� � � � � � � �

� � � � � �

� �

�

� �

�

� � ��

� � � � � � � � � �

� � � � � � � � � � � � �

� � � � � � � � �

Fig. 2. Compressed Sparse Row (CSR) data structure

Assuming a 32–bit architecture and using double precision floating pointvalues for the non–zeros, an m × n real sparse matrix with nnz non-zeros usesup 12nnz + 4m bytes of memory. Support for complex sparse matrices will beavailable very soon in Matlab*P.

It would be simple to modify this data structure to allow some slack in eachrow so that element–wise insertion, for example, could be efficient. However, thecurrent implementation uses the simplest possible data–structure for robustnessand efficiency in matrix and vector operations.

3 Operations and Implementation

In this section, we describe the implementation of several sparse matrix oper-ations in Matlab*P. All experiments are performed on a cluster of 2.6GHz

Pentium IV Xeon processors with 3GB RAM and a gigabit ethernet intercon-nect.

3.1 Parallel Sorting

�

� !

" !

�

�

#

#

#

#

� !

#

# !

� !

�

Fig. 3. Starching

2 4 6 8 10 12 14 160

10

20

30

40

50

60

70

80

Processors

Tim

e (S

ecs)

1e7 nnz(4/row)1e6 nnz(32/row)

Fig. 4. Scalability of sparse (Altix)

Sorting of ddense vectors is an important building block for several paral-lel sparse matrix operations in Matlab*P. Sorting is an extremely importantprimitive for parallel irregular data structures in general. Several parallel sortingalgorithms have been surveyed in the literature [4, 8]. Cluster computers havedistributed memory, and the interconnect typically has high latency and lowbandwidth compared to shared memory computers. As a result, it is extremelyimportant to minimize the communication while sorting. We are experimentingwith SampleSort with different sampling techniques, and other median–basedsorting ideas. Although we have an efficient parallel sorting algorithm alreadyin Matlab*P, we are still trying to improve it, given the importance of havingfast sorting. We will describe results from our efforts in a future paper.

Several sorting algorithms produce a data distribution that is different fromthe input distribution. Hence a reshuffling of the sorted data is often required.We refer to this process as Starching, as shown in Figure 3.1. The distributionafter sorting is on the left, whereas the desired distribution is on the right inthe bipartite graph. The weights on the edges of the bipartite graph show thecommunication required between pairs of processors during starching. This stepis required to ensure consistency of Matlab*P’s internal data structures.

3.2 Constructors

There are several ways to construct parallel sparse matrices in Matlab*P:

1. matlab2pp converts a sequential Matlab matrix to a distributed Mat-

lab*P matrix. If the input is a sparse matrix, the result is a dsparse matrix.

function s = sparse (i, j, v)

[j, perm] = sort(j);

i = i(perm); v = v(perm);

[i, perm] = sort(i);

j = j(perm); v = v(perm);

starch (i, j, v);

s = assemble (i, j, v);

Fig. 5. Implementation of sparse

2. sparse – sparse works with [i, j, v] triples, which specify the row value,the column value and the non–zero value respectively. If i, j, v are ddensevectors with nnz non–zeros, then sparse assembles a sparse matrix withnnz non–zeros. If there are duplicate [i, j] indices, the corresponding valuesare summed. The pseudocode for sparse is shown in Figure 3.1. However, inour implementation, we implement this by sorting the vectors simultaneouslyusing row numbers as the primary key, and column numbers as the secondarykey.The starch phase here is similar to the starching used in the parallel sort,except that it redistributes the vectors so that row boundaries do not over-lap among processors and the required block row distribution for the sparsematrix is achieved. The assemble phase actually constructs a dsparse ma-trix and fills it with the non–zero values. Figure 4 shows the scalability ofsparse on an SGI Altix 350. Although performance for commodity clusterscannot be as good as that of an Altix, our initial experiments do indicategood scalability on commodity clusters too. We will report more detailedperformance comparisons in a future paper.

3. spones, speye, spdiag, sprand etc. – Some basic functions implicitlyconstruct dsparse matrices.

3.3 Matrix Arithmetic

One of the goals in designing a sparse matrix data structure is that, whereverpossible, it should support matrix operations in time proportional to flops. As aresult, arithmetic on dsparse matrices is performed using a sparse accumulator(SPA). Gilbert, Moler and Schreiber [10] discuss the design of the SPA in detail.Matlab*P uses a separate SPA for each processor.

3.4 Indexing, Assignment and Concatenation

The syntax of matrix indexing in Matlab*P is the same as in Matlab. It isof the form A(p, q). p and q can each be either a range (1 : n), or a permutation

vector or scalars. Depending on the context, however, this can mean differentthings.

>> B = A(p,q)

In this case, the indexing is done on the right side of “=” which specifies thatB is a submatrix of A. This is the subsref operation in Matlab.

>> B(p,q) = A

On the other hand, indexing on the left side of “=” specifies that A should bestored in a submatrix of B. This is the subsasgn operation in Matlab.

If p and q are both integers, A(p, q) directly accesses the dsparse data struc-ture. If p or q are vectors or a range, A(p, q) calls find and sparse. find is thereverse of sparse – it converts the matrix from CSR to [i, j, v] format. In thisformat, it is very easy to find [i, j] pairs which satisfy the indexing criteria. Theresulting submatrix is then assembled by simply calling sparse .

Matlab also supports horizontal and vertical concatenation of matrices. Thefollowing code, for example, concatenates A and B horizontally, C and D hori-zontally, and finally concatenates the results of these two operations vertically.

>> S = [ A B; C D ]

The basic primitives, find and sparse are used to provide support for concate-nation operations in Matlab*P.

3.5 Matvec

The matvec operation multiplies a dsparse matrix with a ddense column vector,producing a ddense column vector as a result. Matvec is the kernel for manyiterative methods.

For the matvec, y = Ax, we have A and x distributed across processors byrows. The submatrix of A at each processor will need a piece of x depending uponits sparsity structure. When matvec is invoked for the first time on a dsparsematrix A, Matlab*P computes a communication schedule for A and caches it.When more matvecs are performed using A, this communication schedule doesnot need to be recomputed, which saves some computing and communicationoverhead, at the cost of extra space required to save the schedule. Matlab*Palso overlaps the communication and computation during matvec. This way,each processor starts computing the result of the matvec whenever it receives apiece of the vector from any other processor. Figure 6 also shows how matvecscales in Matlab*P, since it forms the main computational kernel for conjugategradient.

Communication in matvec can be reduced by performing graph partitioningof the graph of the sparse matrix. If fewer edges cross processors, lesser commu-nication is required during matvec. Matlab*P can use several of the availabletools for graph partitioning. However, by default, Matlab*P does not performgraph partitioning during matvec. The philosophy behind this decision is sim-ilar to that in Matlab, that reorganizing data to make later operations moreefficient should be possible, but not automatic.

40 50 60 70 800

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Tim

e pe

r ite

ratio

n of

CG

(se

cs)

Problem Size (grid3d)

MatlabMatlab*P

Fig. 6. Time per iteration of CG, scalability of matvec

3.6 Solutions of Linear Systems

Matlab solves the linear system Ax = b with the matrix division operator,x = A\b. In sequential Matlab, A\b is implemented as a polyalgorithm [10],where every test in the polyalgorithm is cheaper than the next one.

1. If A is not square, solve the least squares problem.2. Otherwise, if A is triangular, perform a triangular solve.3. Otherwise, test whether A is a permutation of a triangular matrix (a “morally

triangular” matrix), permute it, and solve it if so.4. Otherwise, if A is Hermitian and has positive real diagonal elements, find

a symmetric minimum degree ordering p of A, and perform the choleskyfactorization of A(p, p). If successful, finish with two sparse triangular solves.

5. Otherwise, find a column minimum degree order p, and perform the LUfactorization of A(:, p). Finish with two sparse triangular solves.

Different issues arise in parallel polyalgorithms. For example, morally trian-gular matrices and symmetric matrices are harder to detect in parallel. One alsoexpects to be able to use iterative methods. Design for the right polyalgorithmfor \ in parallel is an active research problem. For now, Matlab*P uses a par-allel general direct sparse solver for \, which is SuperLU DIST [14] by default,although a user can choose to use MUMPS [1] too.

The open question at this point is, should Matlab*P use preconditionediterative methods to solve sparse linear systems instead of direct methods. Cur-rently, iterative methods are not usable as a black box, and not yet suitable forMatlab*P.

3.7 Iterative methods - Conjugate Gradient

Conjugate Gradient is an iterative method used to solve a symmetric, positivedefinite system of equations. The same code is used for Matlab*P and matlab,

except that the input is dsparse in the Matlab*P case. In Fig 6, grid3d(k) isa routine used from the meshpart [9] toolbox, which returns a k3×k3 symmetricpositive definite matrix A with the structure of the k × k × k 7–point grid.

4 An Application in Computational Fluid Dynamics

2 2

ρ , µ11

ρ , µ

1

2

x

y

z

g

c =1

c =0

Fig. 7. Geometry of the Hele-Shaw cell (left). The heavier fluid is placed above thelighter one. Either one of the fluids can be the more viscous one. Mesh point distributionin the computational domain (right). A Chebyshev grid is employed in the y–direction,and compact finite differences in the z–direction.

We are using a prototype version of Matlab*P in collaboration with a num-ber of domain scientists for applications in computational science and engineer-ing. We describe an application here.

Goyal and Meiburg [11] are studying the influence of viscosity variations onthe density–driven instability of two miscible fluids. The two fluids, of differentdensity and viscosity are in a vertical Hele–Shaw cell as shown in figure 7. Thisproblem is used to model porous media flows and finds applications in enhancedoil recovery, fixed bed regeneration and groundwater flows.

Fig. 7 shows the discretization of the problem, which yields an algebraicsystem of the form Aφ = σBφ. The eigenvalue σ represents the growth rate ofthe perturbations, while the eigenvector φ reflects the shape of the perturbations.A positive (negative) eigenvalue indicates unstable (stable) behavior. The systemhas a 5×5 block structure reflecting the 5 variables at each mesh point (3 velocitycomponents u, v and w, relative concentration of the heavier fluid c, and pressurep).

A discretization of 165 × 25 points turns out to be sufficient for this prob-lem. Since we solve for 5 variables at each grid point, the matrix A is of the

Fig. 8. Spy plots of the matrices A and B

size 20, 625× 20, 625. The number of non–zeros is 3, 812, 450. The matrix is un-symmetric, both in values and nonzero structure, as shown in the spy plots inFigure 4. In order to calculate the largest eigenvalue, we use the power methodwith shift and invert in Matlab*P.

function lambda = peigs (A, B, sigma, iter)

[m n] = size (A);

C = A - sigma * B;

y = rand (n*p, 1);

for k=1:iter

q = y ./ norm (y);

v = B * q;

y = C \ v;

theta = dot (q, y);

res = norm (y - theta * q);

if res <= 0.0001, break; end

end

lambda = 1 / theta + sigma;

Fig. 9. Matlab*P code for power method with shift and invert

The original non–Matlab*P code used LAPACK with ARPACK [13], whilethe Matlab*P code is using SuperLU DIST with the power method as shownin figure 4. We use a guess of 0.1 to initialize the power method and it convergesto 0.0194 which is enough precision for linear stability analysis. We use a clusterwith 16 processors to solve the generalized eigenvalue problem. Each node has

a 2.6GHz Pentium Xeon CPU, 3GB of RAM and a gigabit ethernet connection.Results are presented in Table 1.

Table 1. Time to solve the generalized eigenvalue problem

No. of processors Time (seconds)

4 908 3916 33

As a next step we want to incorporate a variable viscosity net flow throughthe Hele–Shaw cell to incorporate the potentially destabilizing effects of vis-cous fingering into play, so that the possibility of complex interactions betweendensity- and viscosity-driven instabilities arises. The existence of a more com-plex flow field necessitates a finer grid and a larger domain size for the linearstability calculations as compared to the previous case discussed where the twofluids were essentially at rest with respect to each other. We expect that we willrequire about 10 times more computing resources (CPU and memory) to tacklethese challenges .

5 Conclusion

The implementation of sparse matrices in Matlab*P is work in progress. Cur-rent available functionality includes being able to construct sparse matrices,perform element–wise arithmetic and indexing operations on them, multiply asparse matrix with a dense vector and solve linear systems. This level of func-tionality allows us to implement several algorithms such as conjugate gradientand the power method.

Much remains to be done. A complete implementation of sparse matricesrequires matrix–matrix multiplication and several factorizations (Cholesky, QR,SVD etc). Improvements in the sorting code can lead to general improvements inmany parts of Matlab*P. It is also important to make existing graph partition-ers available in Matlab*P – Meshpart and ParMetis. Several preconditioningmethods also need to be implemented for Matlab*P, since iterative methodsmight possibly be the way to solve large linear systems.

The goal of sparse matrix support in Matlab*P is to provide an interactiveenvironment for users to perform operations on large sparse matrices in parallel,while being compatible with Matlab. Our current implementation is ready tobe used for simple real life problems.

6 Acknowledgements

David Cheng made useful contributions to the sorting code. We had several usefuldiscussions with Parry Husbands, Per–Olof Persson, Ron Choy, Alan Edelman,Nisheet Goyal and Eckart Meiburg.

References

1. Patrick Amestoy, Iain S. Duff, and Jean-Yves L’Excellent. Multifrontal solverswithin the PARASOL environment. In PARA, pages 7–11, 1998.

2. D. Arnold, S. Agrawal, S. Blackford, J. Dongarra, M. Miller, K. Seymour, K. Sagi,Z. Shi, and S. Vadhiyar. Users’ Guide to NetSolve V1.4.1. Innovative ComputingDept. Technical Report ICL-UT-02-05, University of Tennessee, Knoxville, TN,June 2002.

3. L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Don-garra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C.Whaley. ScaLAPACK Users’ Guide. SIAM, Philadelphia, PA, 1997.

4. Guy E. Blelloch, Charles E. Leiserson, Bruce M. Maggs, C. Greg Plaxton,Stephen J. Smith, and Marco Zagha. A comparison of sorting algorithms for theconnection machine cm-2. In Proceedings of the third annual ACM symposium on

Parallel algorithms and architectures, pages 3–16. ACM Press, 1991.5. Oskar Bruning, Jack Holloway, and Adnan Sulejmanpasic. Matlab *p visualization

package. 2002.6. Long Yin Choy. Parallel Matlab survey. 2001. http://theory.csail.mit.edu/

∼cly/survey.html.7. Long Yin Choy. MATLAB*P 2.0: Interactive supercomputing made practical. M.S.

Thesis, EECS, 2002.8. D. E. Culler, A. Dusseau, R. Martin, and K. E. Schauser. Fast parallel sorting under

LogP: from theory to practice. In Proceedings of the Workshop on Portability and

Performance for Parallel Processing, Southampton, England, July 1993. Wiley.9. John. R. Gilbert, Gary L. Miller, and Shang-Hua Teng. Geometric mesh partition-

ing: Implementation and experiments. SIAM Journal on Scientific Computing,19(6):2091–2110, 1998.

10. John R. Gilbert, Cleve Moler, and Robert Schreiber. Sparse matrices in MATLAB:Design and implementation. SIAM Journal on Matrix Analysis and Applications,13(1):333–356, 1992.

11. Nisheet Goyal and Eckart Meiburg. Unstable density stratification of miscible fluidsin a vertical hele-shaw cell: Influence of variable viscosity on the linear stability.Journal of Fluid Mechanics, (To appear), 2004.

12. P. Husbands and C. Isbell. MATLAB*P: A tool for interactive supercomputing.The Ninth SIAM Conference on Parallel Processing for Scientific Computing, 1999.

13. R. B. Lehoucq, D. C. Sorensen, and C. Yang. ARPACK Users Guide: Solution of

Large Scale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods. SIAM,Philadelphia, 1998.

14. Xiaoye S. Li and James W. Demmel. Superlu dist: A scalable distributed–memorysparse direct solver for unsymmetric linear systems. ACM Trans. Math. Softw.,29(2):110–140, 2003.

Sparse Matrices in Matlab*P: Design and …gilbert/reports/dsparse.pdfSparse matrices may have...

Documents

Transcript of Sparse Matrices in Matlab*P: Design and …gilbert/reports/dsparse.pdfSparse matrices may have...