A New Parallel Architecture for Sparse Matrix Computation Based on Finite Projective Geometries - N...

download A New Parallel Architecture for Sparse Matrix Computation Based on Finite Projective Geometries - N Karmarkar

of 12

Transcript of A New Parallel Architecture for Sparse Matrix Computation Based on Finite Projective Geometries - N...

  • 8/3/2019 A New Parallel Architecture for Sparse Matrix Computation Based on Finite Projective Geometries - N Karmarkar

    1/12

    ANew Parallel Architecture for S arse Matrix1omputation Based on Finite Projec ive Geometries

    Narendra KarmarkarAT&T Bell LaboratoriesMurray Hill, NJ 07974

    AbstractMany problems in scientific cotnputation involve

    sparse matrices. While dense matrix computations canbe parallelized relatively easily, sparse matrices witharbitrary or irregular structure pose a real challenge tothe design of highly parallel machines.

    In this paper we propose a new parallel architecturefor sparse matrix computation based on finite projectivegeometries. Mathematical structure of these geometriesplay an itnportant role in dejining the pattern ojinterconnection between memories and processors aswell as solving several difficult problems arising inparallel systems (such as load balancing, data-routing,memory-access confl icts etc. ) in an efli .cient manner.

    1. Introduction1.1 Application domain

    The architecture described in this paper wasmotivated by certain types of problems arising inscientific computation such as linear programming,solution of partial differential equations, signalprocessing, simulation of non-linear electronic circuits,non-linear programming etc. A large body of researchin parallel computation is directed towards finding themaximum amount of parallelism available in a giventask. Fortunately, in the types of applications we areinterested in, there is plenty of intrinsic parallelismavailable at a fine-grain level. The real difficulties arisenot in jinding parallelism, but in exploiting it ejiciently.1.2 Difficulties in exploiting parallelism

    These difficulties arise at roughly three differentlevels: architectural level, hardware operational-leveland software level. Some of these difficulties are brieflydescribed below.

    Indian Institute of TechnologyBombay, India

    1.2.1 Problem-independent interconnection topologyEach computational problem usually has an

    associated topology that is most naturally suited to it. Incase of some important problem one may be able tojustify building a machine whose interconnectionpattern is dedicated to solving that particular problem.On the other hand, one can base the architecture on ageneral interconnection network that can simulate anytopology efficiently. There have been a number ofinterconnection networks proposed and explored,differing in the trade-off they make between generalityand efficiency.1.2.2 Sparse matrices with irregular non-zero

    structureWhile dense matrix computations can be parallel ized

    relatively easily on traditional pipe-lined vectormachines (e.g. Cray), as well as on many other parallelarchitectures, many scientific applications give rise tosparse matrices with arbitrary or irregular pattern ofnon-zero locations. Such problems pose a real challengeto the design of highly parallel machines. On the otherhand, very efficient data structures and programmingtechniques have been developed for sparse matrixcomputation on sequential machines. When evaluatingthe effectiveness of a parallel architecture on a sparsematrix problem, it is important to compare theperformance with the best sequential method for solvingthe same problems rather than comparing somealgorithm on the two architectures.1.2.3 Load balancing

    It is necesstiy to distribute the computational loadamong processors as evenly as possible to obtain highefficiency.

    358@1991ACM 0-89791-459-7/91/0358$01,50

  • 8/3/2019 A New Parallel Architecture for Sparse Matrix Computation Based on Finite Projective Geometries - N Karmarkar

    2/12

    1.2.4 Routing algorithm balancing the load on processors, etc. A questionThe movement of data through the network

    interconnecting various hardware elements of thesystem such as processors and memories needs to begoverned by a routing algorithm. here the issues arecongestion, delays, ability to find conflict-free paths,amount of buffering needed at intermediate nodes, andspeed of the routing algorithm itself.1.2.5 Memory accesses

    A memory access conflict arises when twoprocessors try to access the same memory location at thesame time. One needs to either make sure that memoryaccess conflicts do not arise or provide a mechanism forresolving them. Another serious difficulty regardingmemory accesses is caused by mismatch of bandwidth,i.e., memory is unable to supply data at the rate theprocessor needs. This has been a problem even forsingle processor machines, Typically processors arefaster than memories, they can be pipelined. In theconventional fetch-decode-execute mode, it is difficultto pipeline memory accesses.1.2.6 Difficulties in programming

    Some parallel architectures require the user todecompose the task and assign the subproblems toprocessors and also to program the communicationbetween processom. In case of a problem involving aregular square grid of nodes, it is easy to decompose theproblem into isomorphic subproblems. Unfortunately,many real problems involve irregular grids orboundaries, sharp transitions in the functions beingcomputed etc. The task of decomposing such problemscan be quite tedious. It is desirable to have the compilerdo efficient mapping of users problem on theunderlying hardware. Another issue that comes up isthe level of granularity at which the parallelism is to beexploited. If one restricts the parallelism available at acoarse-grain level, it is easier to exploit but m one goesto finer levels, there is more parallelism available. Thereal challenge is to provide the ability to exploit evenfine-grain parallelism.1.3 An architecture based on geometric

    subspacesThere have been a number of parallel architectures

    proposed in the last decade. Typically one first decidesa method of interconnecting the hardware elements suchas processors and memories. Severat algorithms arethen found to control the operation of the system such asmoving data through the network, assigning and

    naturally- ~ises: is it possible to define a parallel systemin which the most elementary instruction you can giveto the system as a whole automatically results incoherent, conflict-free operation and uniform, balanceduse of all the resources such as processors, memoriesand wires connecting them, and can one expresscomputation specified by the user in terms of a sequenceof such instructions? Thus an instruction for the systemshould be more than just a collection of instructions forthe individual elements of the systcm. It should havethree further properties: first, when individual elementsof the system follow such an instruction, there should beno conflicts or inefficient use of the resources. Secondlythe collection of such system-instructions should bepowerful enough so that computation specified by theuser can be expressed in terms of these instructions.Furthermore the instruction set should have a structurethat pertnits this process of mapping user programs ontothe underlying architecture to be carr ied out efficiently.It seems that the mathematical structure of objectsknown as finite geometries is eminently suited fordefining such a parallel system for solving the types ofproblem in scientific computation described earlier. Thework reported here began at first as a mathematicalcuriosity to explore how the structure and symmetry offinite geometries can be exploited to define theinterconnection pattern between memories andprocessors, assigning load to processors and accessingmemories in a conflict-free manner. Since then it hasgrown in several directions: how to design applicationalgorithms for various problems in scientificcomputation? How best to map a given computationonto the architecture, how to design the hardware bothat system level and VLSI chip level etc. These issueswill be addressed in a series of papers that document thestudy performed by several researchers at ATtkT BellLaboratories and Indian Institute of Technology. Thispaper, which is the first in the series describes themathematical concepts underlying the new architectureand two applications important in scientificcomputation, namely matrix-vector multiplication andgaussian elimination for sparse matrices. The otherpapers describe the compiler mHI 89], simulationresults ~KR91] and aspects of hardware design. Inaddition to showing how the mathematical structure offinite geometries helps in solving many difficultproblems in parallel system design, our work is guidedby the objective that even the first implementation basedon these concepts should result in a practical machinethat many scientists and engineers would want to use.

    359

  • 8/3/2019 A New Parallel Architecture for Sparse Matrix Computation Based on Finite Projective Geometries - N Karmarkar

    3/12

    Consequently, the first version of the compiler andhardware are limited to applications that involveexecution of a data-flow graph whose symbolic structureremains fixed but which is executed with differentnumerical values severat times during the course of thealgorithm. As an example, let us consider the problemof electronic circuit-simulation, a problem which isnotorious y difficult for parallelization. A singleexecution of the simulator typically involves severathundred time steps having the same data-flow graph,each time step which solves non-linear algebraicequations involves several linear-system solutionshaving the same symbolic structure and each linear-system, if solved by an iterative method like conjugate-gradien~ involves several multiplications with a sparsematrix with fixed non-zero structure. Since the non-zeropattern of the matrix depends only on the structure ofelectronic circuit being simulated, a single execution ofthe simulator may involve several thousand matrix-vector multiplications having the same data-flow graph.

    Typicat number of iterations of the same data-flowgraph for several other applications are compiled inTable #1.

    LPProblemsPartialDifferentialEqns.FractionalHypergraphCovering

    ControlSystems

    Problem Repetition CountNumber of Data Flow Graph

    1 1,4282 12,5561 16,0592 6,2993 7,5921 7082 1,8633 7,254

    Table #1. Iterations on Same Data-Flow Graph2. Description of the Architecture2.1 Host processors and the attached processor

    The machine proposed here is meant to be used asan attached processor to a general purpose machinereferred to as host processor. The two processors sharea common global memory. The main program runs on

    the host processor. Computationally intensivesubroutines or macros that have fixed symbolic structurebut are executed several times with different numericalvalues are to be carried out on the attached processor.Since the two processors share the same memory, it isnot necessary to communicate large amounts of databetween the processors. Only certain structuralinformation such as base addresses of arrays need to becommunicated from host processor to attachedprocessor before invoking a subroutine to be executedon the attached processor.2.2 Interconnection scheme based on subspaces

    of a projective geometryA finite geometry of dimension d consists of a finite

    set of points S, and a collection of subsets of Sassociated with each integer i < d which constitutesubspaces of dimension i. Thus subsets associated withdimension 1 are called lines, those associated withdimension 2 are called planes, etc. These subsets haveintersection properties similar to the properties of thefamiliar 3-dimensional (infinite) geometry. e.g. Twopoints determine a unique line, three non-collinearpoints determine a plane, two lines in a plane intersectin a point etc. In section 3 we will define the class ofgeometries we are going to use more precisely. Thegeometric structure is used for defining theinterconnection between memories and processors asfollows. Given a finite geometry of dimension d,choose a pair of dimensions d,., dP < d. Put theprocessors in the system in one-to-one correspondencewith all subspaces of dimension dP, put the memorymodules in one-to-one correspondence with allsubspaces of dimension d,; and connect a processorsand memory module if the corresponding subspaceshave a non-trivial intersection.2.3 Memory system

    The memory system of the attached processor ispartitioned into n modules denoted byM 1, M2, . . . .M.. The type of memory access possiblein one machine cycle in this architecture is between thetwo extremes of random access and sequential accessand could be called structured access. Only certaincombinations of words can be accessed in one cycle.These combinations are designed using certainsymmetries present in the projective geometry so that noconflicts can arise in either accessing the memory orsending the accessed data through the interconnectionnetwork. The total set of aIlowed combinations ispowerful enough so that a sequence of such structured

    360

  • 8/3/2019 A New Parallel Architecture for Sparse Matrix Computation Based on Finite Projective Geometries - N Karmarkar

    4/12

    accesses is as effective as random accesses. On theother hand, such a structured-access memory has a muchhigher bandwidth than a random-access memoryimplemented using comparable technology. Insection 3, a mathematical characterization of allowedaccess patterns is given. Furthermore, these individualaccess patterns can be combined to form certain specialsequences again using the symmetries present in thegeometry. Each such sequence, called a perfectsequence, defines the operation of the hardware forseveral consecutive machine cycles at once, and has theeffect of utilizing the communication bandwidth of themachine fully. As a resul~ it is possible to connect thememories and processors in the system so that thenumber of wires needed grows linearly with respect tothe number of processors and memories in the system.2.4 Rule for load-assignment

    The assignment of computational load to processorsis done at a fine-grain level. Consider a binaryoperation that takes two operands u and b as inputs andmodifies one of them, say a, as output

    ai--aob.Suppose the operand a belong to the memory

    module M, and operand b belongs to the memorymodule M,. Then we associate an index-pair (i, j) withthis operation. Similarly with a ternary operation, weassociate an index-triplet. If we number the prccessoras P~, P2, ..., Pn then the processor P1 thatisresponsible for doing a bhmy operation with associatedindex pair (i, j) is given by a certain function thatdepends on the geometry.

    1 = f(i, j) .Tbus two operations having the same associated index-pair (or triplet) always get assigned to the sameprocessor. Furthermore, the function used for load-assignment is compatible with the structure of thegeometry, i.e. the processor numbered $(i, j) isconnected to memory modules i and j.

    Since the assignment of operations to processors isdetermined entirely by the location of the data thecompiler has an indirect control over load-balancing byexploiting the freedom it has of assigning memorylocations to intermediate operands in the data-flowgraph and moving the operands from one memorymodule to another by inserting move (i, j) instructionsif necessary. (Even random assignment of memorylocations tends to distribute the load evenly as the sizeof the data-flow graph increases.)

    2.5 Instructions for the system and its elementsA system-instruction decides what action each of the

    individual hardware elements are to perform in a singlemachine cycle (or over a sequence of machine cycles incase of instructions corresponding to perfect sequencesof conflict-free patterns). Each hardware element suchas an arithmetic processor, a switch or a memorymodule has its own instruction set. The compiler takesa data-flow gmph as inpu~ uses system-levelinstructions implicitly to produce as output a collectionof programs, one for each element of the system,consisting of list of instructions drawn from theinstruction set of that element. Each hardware elementof the system has local memory for storing theinstruction sequence to be used by that element. Theinitial loading of these instruction sequences for eachsubroutine to be executed on the attached processor isdone by the host. Once loaded, the instruction sequencemay typically get executed many times before beingoverwritten by the instruction sequence for some othersubroutine. Depending on the size of local instructionmemories, instruction sequences for several subroutinescan reside simultaneously in these memories. Since theinstruction sequences are accessed sequentially, anhardware implementation of the instruction storagememory can take advantage of sequential access.2.6 Pipelining of memory accesses

    The instruction sequence for each memory modulespecifies a list of addresses of operands along with aread/write bit. The instruction sequence is stored in thesame module. Since the address sequence is known inadvance, full pipelining is possible in decodingaddresses, accessing bits in memory and in dynamicerror detection and correction. Operands being writtencan also be placed in a shift register whose lengthdepends on the number of stages in the pipeline, so thatconsistent values of operands can be supplied.Alternatively, the compiler can ensure that any memorylocation that is modKied is not immediately accessed forreading (within as many machine cycles as the length ofthe pipeline for writ ing).

    In a large-scale implementation each memorymodule can be connected to a separate local disk, toprovide for secondary storage in addition to thesecondary storage of the host processor.2.7 Arithmetic processors

    Arhhmetic processors are capable of perfomling thebasic arithmetic and logical operations. The instruction

    36 I

  • 8/3/2019 A New Parallel Architecture for Sparse Matrix Computation Based on Finite Projective Geometries - N Karmarkar

    5/12

    sequence to be followed by a processor (created by acompiler) can also contain move (i, j) operationssimply for moving operands from memory module M,to M,, as well as no-ops. In the case of pipe-linedoperations taking more than one machine cycleexecution of the corresponding instruction meansinitiating the operation. It is up to the compiler toensure that delay in the availability of the results istaken into account when the data-flow graph isprocessed. The instruction sequence to be foltowed by aprocessor is stored into its own local instructionmemory. The instruction sequence does not containaddresses of operands, but only the type of operation tobe performed (including no operation). There is noconcept of fetching an operand. The processorssimply operates on whatever data flows on its dedicatedlinks. Each processor has its own local memory. In anapplication such as multiplication of a sparse matrix bya vector, the matrix elements can be stored in the localmemories of processors and the input and output vectorscan be stored in the shared, partitioned global memory.2.8 Compiler

    Given a sequence of instructions, manyrearrangements of the sequence are possible thatproduce the same end result. An optimizing compilerfor a sequential machine seeks to perform suchrearrangements with the goal of reducing the sequentialcomplexity of the program.

    In case of a parallel machine, the number of possiblerearrangements is even larger because of a greaterdegree of freedom available in assigning operations toprocessors, in moving the data through theinterconnection network etc. In order to make thisfreedom more visible and available to the compiler,input to the compiler is expressed in terms of a data-flow graph that represents a partial order to be satisfiedby operations based on dependencies. The first versionof the compiler [DHI 89] is restricted to applications thatinvolve repeated execution of the same data-flow graphwith different numerical values. The compiler doesextensive processing of the data-flow graph to re-express the computation in balanced, conflict-freechunks as much as possible.inefficiency shows up in the formperfect patterns which result ininstruction sequences for processors.

    Any remainingof holes in the no-ops in the

    3. Application of Finite ProjectiveGeometries

    3.1 Projective spaces over finite fieldsIn this section we briefly review the concept of a

    projective space over a finite field and introduce somenotation used later.Consider a finite field F = GF(s) having s

    elements, where s is a power of a prime numberp, s = p, and k is a positive integer.

    A projective space of dimension d over the finitefield F, denoted by PJ (F), consists of one dimensionalsubspaces of a (d + 1) dimensional vector space Fd +1over the finite field, F. Elements of this vector spacecan be represented as (d + 1)-tuples(x~, x~, .r~, ..., XJ + 1) where each x, e F. Clearly,the total number of such elements is Sd+ 1 = p (d+ 1).Two non-zero elements ~, y # Oof this vector space aresaid to be equivalent if th&e exists a L e GF(s) suchthat I = ky. Each equivalence class gives a point in theprojective ~pace. Hence the number of points in Pd (F)is given by ~d+l_ 1Pd = sl

    An m-dimensional projective subspace of Pal(F)consists of all one dimensional subspaces of an(m + 1)-dimensional subspace of the vector space. Letb., b ~,... b~ be a basis of the latter vector subspace.me elements of the vector subspace are given by

    x= ~ cxlb, where ~i ~ F . J.IJ Hence the number of such elements is s+ 1, and thenumber of points in the corresponding projectivesubspace is

    m+lpm=~l s1Let r = d m. Then an (m + 1)-dimensional

    vector subspace of Fd +1 can also be described as the setof all solutions of a system of r independent linearequations ui~~ = O where ai ~ (F*)d+l, the dual ofFd+~,

    i =,... r, this ve~tor subspace and thecorresponding projective subspace are said to have co-dlmension r.

    Let f,21 denote the collection of atl projectivesubspaces of dimension 1. Thus Q. is the set of allpoints in the projective space, ~ ~ is the set of all lines,~J_ ~ is the set of all hyperphmes, etc. For n 2 m,

    362

  • 8/3/2019 A New Parallel Architecture for Sparse Matrix Computation Based on Finite Projective Geometries - N Karmarkar

    6/12

    define parallel lines.$(~, ms) = sn~~m:ll)(~l)...m+l-m+l 1) .- l)(sm - 1)... (s- 1)Then the number of l-dimensional projective subspacesof Pd(CF(s)) is given by

    @(d, 1,5) .The number of points in an l-dimension projectivesubspace is given by

    +(1, 0> s)The number of distinct l-dimensional subspaces througha given point is

    @(d- 1,1- l,s)and the number of distinct l-dimensional subspaces,122, containing a given line is

    @(d-2, 1-2, s).More generally, for O S 1 c ms d, the number of

    m-dimensional subspaces of Pd ( GF(s)) containing agiven l-dimensional subspace is

    $(d-l-l, m-l- l,s)and the number of l-dimensional subspaces contained ina given m-dimensional subspace is

    3.2 Interconnection scheme based on two-dimensional geometryIn this section, we introduce the simplest scheme

    that is based on a two dimensional projective spacelP2 (F), where F = GF(s), also called the projectiveplane of orders.

    The number of points in a projective plane of ordersis

    53-1+(2,0, s)=7s2+s+1The number of lines is given by

    4)(2, 1, s) = (sJ-l)(S* -l) =5*+S+1(s2 - l)(s - 1) ,which is the same as the number of points

    Each line contains (s + 1) points and through anypoint there are (s + 1) lines. Every dkxinct pair ofpoints determine a line and every distinct pair of linesintersect in a point. Note that there are no exceptions tothe latter rule in a projective geometry since there are no

    Let n = S* + s + 1. Given n processom and amemory system partitioned into n memory modulesM ~, ikf2,... M., a method of interconnecting them basedon the incidence structure of a finite projective plane,can be devised as follows.

    Put the memory modules in one-to-onecorrespondence with points in the projective space, andprocessors in one-to-one correspondence with lines. Amemory module and a processor are connected if thecorresponding point belongs to the corresponding line.Thus each processor is connected to (s + 1) memorymodules and vice versa. Consider a binary operationwith associated index pair (i, j). Suppose i # j. Thenthe memory modules Ml and M, correspond to a distinctpair of points in the projective space. This pair of pointsdetermines a line in the projective space that correspondto some processor p ~. Then the binary operation isassigned to this processor.

    If i = j or if the operation is unary, we have somefreedom in assigning the operation. A ternary operationcan not be directly performed in this scheme.3.3 Example of an application of the two-

    dimensional schemeEven this simple scheme can be used to construct

    practically useful devices such as a fast matrix-vectormultiplier for sparse matrices with arbitrary or irregularsparsity pattern. Suppose we have an p x q matrix A;and we wish to compute the matr ix-vector product

    y=Ax .Flrs~ the indices 1 to q of the vector ~ and indices 1

    top of vector y are assigned to logical memory modulesMl... M. by ~eans of two functions, say f and g. whichcan be hashing functions

    i.e. let p = f(i)and V = g(j) ,

    where p, v e Llo .Then x(i) is stored in the memory module Mv and y(j)is stored in the memory module M. Now consider thefollowing multiply-and-accumulate operation,corresponding to a non-zero element A (j, i) of thematrix A

    Y(j) + y(j) + A(j, i) * x(i) .We associate the index-pair (p, v) with this operation.Assuming L #v, let p I be the processor corresponding

    363

  • 8/3/2019 A New Parallel Architecture for Sparse Matrix Computation Based on Finite Projective Geometries - N Karmarkar

    7/12

    to the line passing through the pair of pointscorresponding to memory modules MW and M,. Thematrix entry A ( j, 0 is stored in the local memory of theprocessor p ~. Note that the processor pl is connected tomemory modules ML and M because the linecorresponding to processor pl contains pointscorresponding to memory modules MW and M,. neinput operands x(i) and y(j) are sent from thepartitioned shared memory to the processor p ~ throughthe interconnection network. The processor p ~ atsoreceives A ( j, i) from its local memory and performs thearithmetic operation. The output Y(j) is shipped backthrough the interconnection network to the memorymodule MV in the partitioned shared global memory.This method is very effective if the same matrix A isused for several matrix-vector multiplications; a typicalsituation in many iterative linear-algebraic methodsoccurring in applications such as linear programming orsolution of part ial differential equations.3.4 Perfect access patterns

    We will first define a perfect access pattern for atwo-dimensional geometry.

    Let the number of points (and hence the number oflines) in the geometry be n.

    A perfect access pattern is a collection of n orderedpairs of points

    = {(l>bl) ,(aa, b?),... (a~, brl)la~#bi,ai,bi~ !20, i = 1,... n}

    having the fol lowing properties:1. First members {al, a2,... a,,} of all the pairs

    form a permutation of all points of the geometry.2. Second members { b ~, bz,... b,, ) of atl pairs form

    a permutation of all points of the geometry.3. Let 1, denote the line determined by the iti pair of

    points. i.e.1, = (U,, b,)

    Then the lines { 1~, lZ,... in } determined by thesen pairs form a permutation of all lines of thegeometry.

    i

    PointPairs

    (Pl>ql)(p29 qz)

    (Pn, q,, )

    CorrespondingLines

    1, =(p~,ql)12 = (pz,qz)

    1. = (pn, qn)

    Table #2. Perfect access patterns for 2-d geometryClearly, if one schedules a collection of binaryoperations corresponding to such a set of index-pairs for simultaneous parallel execution, then wehave the following situation:1. There are no read or write conflicts in

    memory accesses.2. There is no conflict or waiting in processorusage.3. All processors are fully utilized.4. Memory bandwidth is fully utilized.

    Hence the name perfect pattern. Recatl that a memory module is connected to a

    processor if the point a corresponding to the memorymodule belongs to the line ~ corresponding to theprocessor. Thus we can denote this connection by theordered pair (a, @).

    Let C denote the collection of all processor-memoryconnections.i.e. C={(rx, p)l aGQl), ~GQ~, ctq p}.If (a i, b,) is one of the pairs in a perfect pattern P andli is the corresponding line, we say that the perfectpattern P exercises the connections (a,, 1,) and (b,, 1,).

    A sequence of perfect access patterns is called aperfect sequence if each connection in C is exercised thesame number of times collectively by the patternscontained in the sequence. If such a perfect sequence ispackaged as a single instruction that defines theoperation of the machine over several machine cycles, itleads to uniform utilization of the communicationbandwidth of the wires connecting processors andmemories.

    Perfect access patterns and perfect sequences can beeasily generated using the group-theoretic structure ofprojective geometries, as described in the next section.

    364

  • 8/3/2019 A New Parallel Architecture for Sparse Matrix Computation Based on Finite Projective Geometries - N Karmarkar

    8/12

    3.5 Group-theoretic structure of projectivespacesRecall that points in the projective space of

    dimension d over GF(s) were defined as rays throughthe origin in the vector space of dimension (d + 1) overGF(s), which contains s +1 elements. Since $ = p~, apower of a prime number, so is sd + 1 = p~(d+ i ). Hencethere is a unique finite field with Sd+ 1 elements. Onemight suspect that there may be some relation betweenthe finite field GF(sd+ 1) and the projective spacelPd(GF(s)), which would give the projective spaceadditional structure based on the multiplicationoperation in GF(sd+ 1), besides its geometric structure.Indeed, there is such a relation, and we want to elaborateit further,

    If p is a prime number, then GF(p ) containsGF(p ) as a subfield if and only if n I m. ThereforeGF(sd+ 1) (where s = pk ) contains GF(s) as a subfieldand the degree of GF(sd+l ) over GF(s) is (d + 1).Hence GF(sd+ 1) is a vector space of dimension (d + 1)over GF(s). Each non-zero element of this vector spacedetermines a ray through the origin and hence a point inthe projective space @(s).

    Let G* = the multiplicative group of non-zeroelements in GF(sd+ 1)andH* = the multiplicative group of non-zero elements inGF(s)Clearly H* is a subgroup of G* and both groups arecyclic. Let x e G*. The ray through origin determinedby x consists of the points

    {kxlke GF(s)}.This is precisely the coset of H* in G* determined by x,along with the origin.

    This establishes a one-to-one correspondencebetween points of the projective space Pd (s) andelements of the quotient group G */H*. Thiscorrespondence allows us to define a multiplicationoperation in the projective space.

    Let g be a fixed point in the projective space P~(s)and consider the mapping of U@(s) onto itself definedby

    Lg:x+goxwhere g o x is the multiplication operation introducedabove. It is easy to check that this operation maps lines

    in the projective space onto lines, planes onto planesand, in general any projective subspace of dimension konto another projective subspace of the same dimension.Such a mapping is called an uutomorphism of theprojective geometry.

    Since G q and H* are cyclic, so is G qH*. Let g bea generator of G*/H*. (By abuse of notation, we willnot henceforth distinguish between elements of G* /H*and points of P~ (s)). Thus we can denote points inPd(s)asgi, i = O,... n 1. The mapping L-g becomes

    Lg :gi +gi+lThis will be called a shift operation.

    Any power l,; of the shift operation is also anautomorphism of the geometry and the collection of allpowers of the shift operation forms an automo~hismgroup denoted henceforth by L, which is a subgroup ofthe group of all automorphisrns of the geometry. Asubgroup G of the full automorphism group is said toact transitively on subspaces of dimension k if for anypair of subspaces H,, Hz of dimension k, there is anelement of G which maps H 1 t o H2.

    For any projective space, the shift-operationsubgroup L acts transitively on the points and onsubspaces of co-dnension one (hyperplanes). Thisproperty is used in the next section for generatingperfect patterns and sequences for the two dimensionalgeometries. Perfect patterns for 4-dimensionat case willbe described in section 3.8.3.6 Generation of perfect patterns for 2-d

    geometryIn case of a two-dimensional geometry, the

    hype@nes (co-dimension = 1) are the same as lines(dimension = 1). Hence the shift operation L~ and itspowers L: act transitively on lines.

    To generate a perfect pattern using the shiftoperation, take any pair of points a, b, a # b, in thegeometry. Let 1denote the line generated by a and b

    i.e. 1 = (a, b) .Set ao = a, bo = b and 10 = 1 and define a~, bk andlk, k = 1,... n 1, by successive application of theshift operation L ~ as follows:

    ak = L~ aklbk = L8 Obk-,

    and lk = Lgolk_~

    365

  • 8/3/2019 A New Parallel Architecture for Sparse Matrix Computation Based on Finite Projective Geometries - N Karmarkar

    9/12

    Since L~ is an automorphism of the geometry, we havel~_l = (a~-l,b~_l)el~ = (u~,b~).

    Then P = {(a~, b~), k = O,... YZ 1) is a perfectpattern of the geometry.

    Now in order to generate a perfect sequence of suchpatterns, take any line 1 = {al, a2,... a~) of thegeometry, Form all ordered pairs (a,, a,), i #j, fromthe points on 1. Generate a perfect pattern from each ofthe pairs. The collection of perfect patterns obtainedthis way (sequenced in any order) forms a perfectsequence.3.7 Example of an application of 4-dimensional

    geometryIn this section we illustrate the use of higher order

    subspaces of a projective space by means of anarchitecture suitable for performing sparse Gaussianelimination, an operation required in many scientific andengineering computations.

    A typical operation in symmetric gaussianelimination applied to matrix A is

    A(i, k) +- A(i, k) A(i, j) * A(j, k)A(j, j)where A (j, j) is the pivot element. Such an operationneeds to be carried out only if A ( i, j) # O andA ( j, k) # O. When non-zero elements in the matrix arenot in consecutive locations, it is very difficult to obtainuniformly high efficiency on vector machines.However, the operation shown above has an interestingproperty, regardless of the pattern of non-zero in thematrix: Two elements A(i, j) and A(i, j) of thematrix need to be brought together for a multiplication,division or subtraction only if the index pairs (i, j) and(i, j ) have at least one of the constituent indices incommon.

    This property can be exploited as follows:1,

    2.

    3.

    Map the row and column indices 1, 2, . ...1 topoints of the projective space by means of anassignment function ~, which can be a hashfunction.

    cx= j(i) cte Qo.Put the memory modules in the logical partition inone-to-one correspondence with lines, i.e. withelements of Q 1.A non-zero element A(i, j) is assigned to amemory module as follows

    let ct = f(i)and ~ = ~(j) where et, ~ e Llo

    Then the pair of points a, ~ in the projectivespace determines a line 16 Q 1 (if et = ~ we havesome freedom in determining the line). Theelement A (i, j) is stored in the memory modulecorresponding to line 1.

    4. Processors are put in one-to-one correspondencewith the 2-dimensional subspaces i.e., elements ofCl*.

    If a line corresponding to a memory module iscontained in a plane corresponding to a processor then aconnection is made between the memory module andthe processor. Again consider a typical operation ingaussian elimination

    and let ct = f(i), ~ = f(j) and T = ~(k). Then all theindex-pairs involved in the above operation are subsetsof the triplet (rx, ~, y).

    Assuming that the triplet of points (u, ~, y) are ingeneral position, they determine a plane, say S, of theprojective space. i.e. 6 e Llz. The above operation isassigned to the processor corresponding to 8. In order tocarry out this operation, the processor needs to be ableto communicate with memory modules corresponding tothe pairs (u, ~) (~, y) and (u, y). Note that the linesdetermined by these pairs are contained in the planedetermined by the triplet (et, ~, f), hence the necessaryconnections exist.

    1. In a projective space, the number of subspaces ofdimension m equals the number of subspaces of co-dimension m + 1. Hence if we are interested in asymmetric scheme with equal number of processors andmemory modules, then we should make the co-dimension of subspaces corresponding to processors onemore than the dimension of subspaces corresponding tomemory modules. Thus in the present case we shouldchoose a four dimensional geometry, i.e., d = 4.

    2. In the projective space Pq (F) where F = GF(s)the number of lines is

    n = 4)(4, 1,,s) = (s5 - 1)(SJ - 1)(s2 - l)(s - 1) Therefore

    366

  • 8/3/2019 A New Parallel Architecture for Sparse Matrix Computation Based on Finite Projective Geometries - N Karmarkar

    10/12

    n=(s~+l)(sd +ss+sj+s+ 1).The number of planes is

    0(4, 2,s) = Q(4, 1,s) = n .Hence the number of processors is also n. The

    number of planes containing a given line is+(d -2,0, s) = 4K2,o,s) = J+s-

    =s2+s+1.Let m = S2 + s + 1. Thus m = O(V). Hence ach

    vemory module is connected to m = 0( n )processo~.

    Similarly the number of lines contained in a givenplane is $(2, 1, s) = Sz + s + 1 = m. Hence eachprocessor is connected to tn = O(W ) memorymodules.

    The number of interconnections required betweenmemories and processors can be reduced significantly ifthe communication between processors and memories iscarried out in a disciplined and co-ordinated manner,based on the perfect sequences of conj?ict-free patterns.In a machine designed to support only these perfectpatterns, it is possible to connect memories andprocessors so that the number of wires needed growslinearly with respect to the number of processors in thesystem.

    Note that the fundamental property of Gaussianelimination exploited here is the fact that index pairs(i, j) and (i, j ) need to interact only when they havean index in common. There are many examples ofcomputations having the same property, e.g. finding thetransitive closure of a dwected graph or the joinoperation for binary relations in a relational data-base.Hence the scheme described is also applicable to suchproblems.3.8 Perfect access patterns for 4-d geometry

    In P4 (s) let n denote the number of lines, which isalso equal to the number of 2-dimensional planes.

    A perfect access pattern is a collection n non-collinear triplets.

    P = {(a,, b,, c,) Ia,,b,,c,= Qo,dim(a,, b;, c,) = 2, i = 1,... n)

    having the following properties.

    1.

    2.

    3.

    4.

    Let u i, i = 1,... n, denote the lines generated byfirst two points from each triplet.

    i.e. U~ = (di, b,) .Then the collection of lines { u 1, u j,... Un ] formsa permutation of all the lines of the geometry.Let v,, i = 1,...n, denote the lines

    vi = (bl, ci).Then the collection of lines { v 1, V2,,.. v. } formsa permutation of all the lines of the geometry.Let wi, i = 1,... n,denote thelines(cj, ai)

    W, = (Ci, al) .Then the collection of lines { w 1,... w. ) forms apermutation of all the lines of the geometry.Let hi, i = 1 ,... n, denote the planes generated bythetipletsai, bi, ci)

    hi = (Ui, bi, C,) .Then the collection of planes { h 1, h j,... h. }forms a permutation of all the planes of thegeometry.

    When an operation having (a i, b,, c,) ~ theassociated triplet is performed, the three memorymodules accessed correspond to the three lines (a,, b,),(bi, c1) and (c,, a, ), and the processor performing theoperation corresponds the plane (a,, b,, c,). Hence itis clear that if we schedule n operations whereassociated triplets form a perfect pattern to execute inparallel in the same machine cycle then1. there is no read or write conflicts in memory

    accesses2. there is no conflict in the use of processors3. all the processors are fully utilized4. the memory bandwidth is fully utilized.

    Hence such a collection of triplets is called perfectpattern.

    Automorphisms of the geometry based on cyclicshifts are not enough for generating perfect accesspatterns for the 4-d geometry. Fkst we consider othertypes of automorphism.

    In a finite field of characteristic p, the operation ofraising to the p ti power

    i.e. x + xp

    367

  • 8/3/2019 A New Parallel Architecture for Sparse Matrix Computation Based on Finite Projective Geometries - N Karmarkar

    11/12

    forms an automorphism of the field.i.e. (x + y)p = Xp + yp

    and (Xy)p = XpjJp .

    Since the points in the projective space Pd (s)correspond to elements of the multiplicative groupG* /H*, the operation raising to the p h power can alsobe defined on the points of I@ [s). It is easy to showthat such operation is an automorphism of the projectivespace.

    The operations cyclic shift and raising to the p hpower together are adequate for generating the perfectaccess patterns for the smallest 4-dimensional geometryP4 (2) which has 155 lines and planes. For bigger 4-dgeometries we need other automorphisms. The mostgeneral automorphism of Pd (s) is obtained by means ofa non-singular (d + 1) x (d + 1) matrix over GF(,s).More detailed discussion of generation of perfectpatterns for 4-dimensional geometries along withspecific examples of practical interest will be given in asubsequent paper.4. Areas for Further Research

    The ideas presented in this paper lead to a number ofinteresting areas for further investigation. Some ofthese are briefly described below.4.1 Efficient mapping of algorithmsIn this paper, we showed how matrix multiplicationand inversion can be mapped onto the proposedarchitecture, using two and four dimensionalgeometries. How best can one map a variety of othercommonly used methods in scientific computation?How can one exploit geometries of even higherdimension?4.2 Compilation

    The first version of the compiler for the proposedarchitecture has been designed and implemented.Description of this work can be found in [DH189],[DKR91]. Results of application of the compiler to anumber of large-scale real-life problems in scientificcomputation arising in domains such as linearprogramming, electronic circuit simulation, partialdifferential equations, signal processing, queueing

    topics.4.3 System partitioning and embedding

    Since the geometry of a typical me.chum (e.g. printedcircuit boards) for building hardware is basically two-dimensional and euclidean, how does one embed a twoor four dimensional finite projective geometry into sucha medium? How should the system be partitioned toreduce the number of wire-crossings between differentparts?4.4 Design of custom integrated circuits

    How should one define the basic building blocksof the system so that each block can be implemented asa custom integrated circuit? What should be the internalarchitecture of such chips in order to fully exploit thestructure of underlying geometry to achieve highperformance?

    Results of ongoing exploration of many of theseissues will be reported in a forthcoming series of papers.References[ADL 89]

    [BN71]

    @3HI 89]

    [HAL 86]

    pKR90]

    pKR91]

    Adler, I., Karmarkar, N. K. Resende, G. C.R., Veiga, G., Data Structure andProgramming Techniques for theImplementation of Karrnarkms Algorithm,OSRA Journal on Computing, Vol. 1,No. 2, Spring 1989.Bell, G. C., Newell, A., ComputerStructures: Readings and Examples,McGraw-Hill, 1971.Dhillon, I. S., A Parallel Architecture forSparse Matrix Computations, B. Tech.Project Report, Indian Institute ofTechnology, Bombay, 1989.Hall, Marshall, Jr., Combinatorial Theory,Wiley -Intescience Series in DiscreteMathematics, 1986.Dhillon, I., Karmarkar, N. andRamakrishnan, K. G., PerformanceAnalysis of a Proposed ParallelArchitecture on Matrix Vector MultiplyLike Routines, Technical Report #11216-901004-13 TM, AT&T Bell Laboratories,Murray Hill, N.J., October 1990.Dhillon, I. Karmarkar, N. and

    theory, etc. are reported in DKR90].compilation process suggests many

    This work on thefurther research

    368

    Ramalcrkhnan, K. G., An Overview of theCompilation Process for a New Parallel

  • 8/3/2019 A New Parallel Architecture for Sparse Matrix Computation Based on Finite Projective Geometries - N Karmarkar

    12/12

    Architecture, Proceedings of the FifthCanadian Supercomputing Conference,Fredericton, N.B., Canada June 1991.

    [KAR90A] Karmarkar, N. A New Parallel Architecturefor Sparse Matrix Computations,Proceedings of the Workshop on ParallelProcessing, BARC, Bombay, February1990, pp. 1-18.

    [KAR90B] Karmarkar, N., A New Parallel Architecturefor Sparse Matrix Computations Based onFinite Projective Geometries, invited talk atthe SIAM Conference on DiscreteMathematics, Atlanta, June 1990.

    [VDW 70] Van Der Waerden, Algebra, Vol. 1, Unger1970.

    369