automatic data structure selection and transformation for sparse matrix computations

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 7, NO. 2, FEBRUARY 1996 109

Automatic Data Structure Selection and Transformation

for Sparse Matrix Computations Aart J.C. Bik and Harry A.G. Wijshoff

Abstract-The problem of compiler optimization of sparse codes is well known and no satisfactory solutions have been found yet. One of the major obstacles is formed by the fact that sparse programs explicitly deal with particular data structures selected for storing sparse matrices. This explicit data structure handling obscures the functionality of a code to such a degree that optimization of the code is prohibited, for instance, by the introduction of indirect addressing. The method presented in this paper delays data structure selection until the compile phase, thereby allowing the compiler to combine code optimization with explicit data structure selection. This method enables the compiler to generate efficient code for sparse computations. Moreover, the task of the programmer is greatly reduced in complexity.

Index Terms-Data structure selection, data structure transformations, restructuring compilers, sparse matrix computations, program transformations.

1 INTRODUCTION

SIGNIFICANT part of scientific codes consists of sparse A matrix computations which are difficult to develop and show notoriously bad efficiency on today’s supercomputers. In most cases, only a small fraction of the computing power of these computers can be utilized. Many reasons can be given for these effects. First, the codes usually suffer from a lack of spatial locality caused by irregular data accesses induced by sparse computations. This prohibits efficient cache utilization and reduces memory bandwidth. Another reason is that temporal locality is not as high as in most dense codes, because the amount of possible reuse of data is limited due to the elimination of many operations. Not only does this prevent data locality optimizations, but the communication overhead in distributed memory computers can be substantial. Finally, problems arise because sparse matrices need to be represented in a compact way to keep the storage requirements and computational time to reasonable levels. This causes the representation of a sparse code in either FORTRAN, with the occurrence of indirect addressing, or in another language, with pointer structures, to be very complex. This is probably the most important problem, because it complicates both the development and maintenance of sparse codes which are quantitatively and qualitatively more complex than the dense counterparts [20], and because it disables most compiler optimizations.

The latter problem has been recognized for a long time now and there have been many efforts to overcome this problem.

The authors are with the High Peiformance Computing Division, Dept. of Computer Science, Leiden University, 2300 RA Leiden, the Netherlands. E-mail: [email protected].

Manuscript received Feb. 23,1993; revised Jan. 4,1994. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number D95072.

Most of the concurrentization methods proposed in the past, for instance, rely on pre-evaluation of the execution set of an indirect addressed loop [33], possibly followed by a reordering of this set [35], [36]. These methods can be successfully em- ployed, but only in cases where subscripts are constant during a period of time, for instance in triangular solvers. However, as subscripts change during the course of loop execution, these methods will fail due to the incurred overhead of run-time evaluation. Another problem of these methods is that the selected data structures cannot be changed. As different architec- tural features might favor different data structures for sparse matrices, the code will stay inefficient regardless of the optimizations performed. So, in effect, to overcome the problem of sparse matrix computations the compiler should not only be able to perform program transformations but also transformations on the data structures themselves.

In order to tackle the sparse data structure problem, exami- nation of the following generic definition is useful: sparse computations are computations that compute on sparse data structures and sparse data structures (i.e., sparse matrices) are data structures that are logically contained in enveloping data structures (i.e., dense matrices). The underlying problem for sparse computations now is where to deal with the fact that only part of the enveloping data structure is computed on. The common approach is to deal with this at the programming level. However, it is also possible to deal with this issue at a lower level, i.e., at the compilation level [9], [12]. This implies that the computation is defined on enveloping data structures, which are dense matrices stored in two-dimensional arrays, and that the compiler trans- forms the code to execute on sparse data structures. This has the advantage that the compiler does not need to extract program knowledge from an obscured code, but is presented with a much cleaner program on which regular data dependence checking and standard optimizations can be performed [SI.

1045-9219/96$05.00 01996 IEEE

mailto:[email protected]

110 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 7, NO. 2, FEBRUARY 1996

Another advantage is that the compiler performs the final data structure selection, possibly in combination with standard prc- gram transformations if this selection cannot be resolved efficiently. Saving the initial program defined on the enveloping data structures enables the compiler to re-target the same code to other machines and to instances of the same problem having different underlying nonzero structures of the sparse matrices.

An annotation mechanism is used to identify which of the declared dense data structures in the program are actually sparse. This annotation only needs to occur in the declarative part of the program and as such is not intrusive. Matrices for which a dense data structure is used in the program but which are actually sparse, are referred to as implicitly sparse matrices, because the programmer does not have to deal with the sparsity explicitly. This burden is placed on the compiler. After the implicitly sparse matrices are identified, the compiler may need specific information about the nonzero structure of these matrices, e.g., the density or bandwidth, to enable an efficient data structure selection. This information can be obtained by interactive inquiries or annotations. Alterna- tively the compiler can automatically evaluate such characteristics using a library function which reads the sparse data structure from file, and collects the necessary statistics. An implementation of the latter analysis is presented in [lo]. Note that if sparse data matrices are stored on file, specific input/output routines have to be inserted by the compiler that will initialize the selected data structures at run-time.

The data structure selection and transformation method described in this paper is based on a bottom-up approach and consists of a three phase process. In the first phase, the instructions in the code are identified that will be affected by the sparsity of data structures and transformations are applied to make fully use of these statements. In the second phase, the data structure selection is performed and conflicts are resolved, possibly in combination with the applica- tion of loop transformations [7], [13]. In the third phase the actual data structure transformations are applied and the corresponding sparse code is generated.

The outline of this paper is as follows. Some prelimi- naries are discussed in Section 2. An overview of the method is given in Section 3, followed by a more detailed description in Sections 4, 5 and 6. Some preliminary experiments and examples are presented in Section 7, while conclusion and topics for further research are stated in Section 8.

In this section, some notational conventions as well as some terminology used throughout the paper are presented.

2.1 Sparse Matrices

An m x n matrix A is defined as the set of elements ag, where (i, j ) E [l .. m] x [l .. n] (the index set of A). If many elements are zero, matrix A is referred to as a sparse matrix. All other matrices are referred to as dense matrices. The nonzero structure of a sparse matrix is defined as follows:

Nonz (A) = {(i, j ) E [l .. ml x [l .. n]l a!, f 0)

Reduction of storage requirements for a sparse matrix A is obtained by storing the nonzero elements only. Storage required to store the numerical values is called primary storage, while storage necessary to reconstruct the underlying matrix is referred to ifs overhead storage. In some cases it is practical to store some zero elements, because the use of a simpler storage scheme with less overhead storage compensates for the increase in the amount of primary storage, and results in less run- time overhead. Elements that are stored explicitly are called entries. The set E(A), satisfying the following relation, is used to indicate the indices of these entries:

Nonz(A) c E ( A ) c [l .. m] x [l .. n]

Consequently, (i, j ) @E(A) implies all = 0, but the reverse implication does not necessarily hold. If aFxed set E(A) can be determined at compile-time such that Nonz(A) c E(A) will always hold at run-time, then a static data structure can be used for A. However, if Nonz(A) changes in a unpredict- able manner during program execution, a dynamic data structure is required in which E(A) can be altered at run- time to handle the insertion of a new entry (creation) and the deletion of an entry which becomes a zero element (cancellation). Creation is usually referred to as fill-in in the context of LU-factorization [ZZ], [27], [32], [43].

2.2 Framework

Throughout this paper, we use a FORTRAN-like language, extended with some mathematical notations. For example, we use the following construct to indicate a loop where the loop index I iterates over all values in the execution set V

DO I E V S(I)

ENDDO

This notation allows for loops with irregular positive strides. A statement appearing in a loop-body at nesting depth d is called an indexed statement and for the loop indices ? = (I~, ... I~), we denote such a statement as s ( ? ) .

An instance of an indexed statement is obtained by substi- tuting a value from the corresponding execution set for each surrounding loop index. For instance, for the execution set V = {l, 5, 8} in the previous example, instances ~ ( l ) , S(5) and $3) will be executed.

Because the input code operates on dense matrices stored in two-dimensional arrays, every occurrence of an implicitly sparse matrix A, of which the sparsity is indicated by an annotation in the declarative part, will appear as a two-dimensional array A in a particular nested loop:

REAL A ( M , N ) C-SPARSE ( A )

DO I I E vl ...

... DO Id€ vd

S,: ... A C f i ( 3 j 2 ( O ) .. ENDDO

... ENDDO

For the input code we assume that the bounds of each execution set VI are affine expressions in the indices I~, . . .,

BIK AND WIJSHOFF: AUTOMATIC DATA STRUCTURE SELECTION AND TRANSFORMATION FOR SPARSE MATRIX COMPUTATIONS 111

although this dependence is not denoted by the use of super- scripts for notational convenience. Moreover, in this paper we only consider subscript functions f, and f, that can be represented by a single affine mapping F: Z d -+ Z2, which has the form F(T) = %+MT,where f = (TI, ..., T d ) : T

We assume that subscript bounds are not violated, i.e., F(i) E [1 . . m] x [l . . n] holds for all valid values of . The set of indices of all elements operated on by different instances of s, in one complete execution of the innermost loop, is referred to as an access pattern of the corresponding occurrence of matrix A:

(1) P’ = [ F(f)lId E Vd}, where T denotes I,, . . . , I~-, Superscript 7 is used to indicate the dependence on sur-

rounding loop indices. Each fixed value of f induces an access pattern. Access patterns are called scalar-wise if either d = 0 or both mld and mu are zero, row-wise if ?nld = 0 and mZd # 0, column-wise if mld # 0 and mZd = 0. In case both elements are nonzero, an access pattern is referred to as diagonal-wise. The direction of an access pattern is

‘A = ( m l d j m2d)’

3 DATA STRUCTURE SELECTION AND TRANSFORMATION

The computation in the source program is defined on a two dimensional array REAL A (M, N) . This array is an enveloping data structure for the corresponding implicitly sparse m x n matrix A. Although there are many ways to store sparse matrices, the most efficient of which depends on the problem considered, some general sparse storage schemes exist. One possible approach to automatic dense into sparse conversion would be to select one of these data structures for every implicitly sparse matrix directly, followed by a corresponding transformation of the program.

This approach is too simple, however, as it does not give any control over the efficiency of the resulting code. To achieve a reasonable level of control, data structure selection must be based on the nonzero structure of the matrix and the actual operations performed by the code. Regard- ing the latter aspect, this control may be realized by first identifying statements where sparsity can be exploited to save computational time:

Observation: Instances of statements where a zero is assigned to a non-entry or where an arbitrary variable is updated with a zero can be eliminated.

The previous observation is usually only exploited for non-entries of sparse data structures, i.e., accidentally stored zeros (zero entries) are not exploited. Consider, for example, the following fragment in which an implicitly sparse matrix A occurs. We have tabulated the nonzero structure of A, where each ’x’ indicates the position of an entry, thereby abstracting from the actual numerical values:

DO I = 1, 3 DO J = 1, 3

S,: S2: A ( 1 , J ) = A ( I , J ) * 2 .0

ACC = ACC + A ( I , J )

ENDDO ENDDO

It is clear that, independent of the value of each entry, execution of only the following statement instances still preserves the semantics of the nested loop shown above (cf. loop-free code [22]):

sl(l , 1): ACC = ACC + A ( 1 , sz(l,l): A ( 1 , l ) = A ( 1 , 1) * 2 . 0 S1(1,3): ACC = ACC + A ( 1 , 3 ) S2(1,3): A ( 1 , 3 ) = A ( 1 , 3 ) * 2 .0 S1(2,3): ACC = ACC + A ( 2 , 3 ) S,(2,3): A ( 2 , 3 ) = A ( 2 , 3 ) * 2 . 0 5,(3,2): ACC = ACC + A ( 3 , 2 ) S2(3,2): A ( 3 , 2 ) = A ( 3 , 2 ) * 2 . 0

1 )

In some cases, the instances of statements exploiting sparsity can be executed in arbitrary order. This is certainly true if no cross-iteration dependences hold (e.g., a loop with only S J . If cross-iteration dependences exist, the original execution order must be preserved, although dependences caused by a simple accumulation (cf. S,) can be ignored if roundoff errors, due to inexact computer arithmetic, are allowed to accumulate in a different way [41].

This identification of statements that can exploit sparsity enables the compiler to take a bottom-up approach. In a first phase, every statement in the program is checked whether it contains an occurrence of an implicitly sparse matrix A. If so, this statement is conceptually an IF- statement, in which a distinction is made between code operating on entries and code that operates on zero elements. For example, a statement in which A ( I , J) occurs can be thought of as the following IF-statement of which, due to the previous observation, the ELSE-branch can be eliminated in some cases:

IF (I, J) E E(A) THEN

ELSE . . . A ( I , J ) . . . t operation on entry

. . . 0 .0 . . . e- operation on zero element ENDIF

Because the presence of guards, like ’(I, J) E E(A)’ in the previous example, reflects the run-time overhead that is inherent to sparse codes, actions are taken to eliminate this overhead. An important reduction of overhead is obtained by encapsulation of a guard that controls a one- way IF-statement in the execution set of a surrounding loop, as is illustrated below:

112 IEEE TRANSACTIeNS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 7, NO. 2, FEBRUARY 1996

DO I E V IF ($) THEN

ENDIF . . .

ENDDO

1 DO 1 E {VI@> ... ENDDO

In the second phase, the access patterns through each implicitly sparse matrix are examined. Constraints are imposed on the organization of the data structure of each implicitly sparse matrix to enable the reduction of computational time, while keeping the data structures compact to reduce storage requirements. Since conflicting constraints may result, a conflict resolving method must be applied. If possible, access patterns are reshaped and execution sets are partitioned 171, [13]. After that, a suitable storage scheme is selected for each implicitly sparse matrix.

Finally, in the third phase, the data structure transformations are applied, i.e., the code is converted into a form that operates on the chosen data structures. These three phases of the data structure selection and transformation method are described in more detail in the following sections.

4 FIRSTPHASE In the first phase, conditions are associated with each statement. Techniques to eliminate the overhead of these conditions and the resulting constraints on the organization of the data structure to make these techniques feasible are discussed.

4.1 Abstract Data Structures Because it is necessary to reason about compact data structures while no actual implementation has been selected yet, the following notation will be used in this section.

An abstract data structure A' is used for an implicitly sparse matrix A. Every entry aq of matrix A is stored at A'[o,(i,J)], where a bijective storage function aA: E(A) 't AD(A) maps indices of entries to addresses in the address set AD(A) of A'. The computation of such an address will be referred to as a oA-lookup. The opposite computation, i.e., computing the indices (i, j ) E E(A) of an entry stored at address AD E AD(A) is denoted by G ~ ( A D ) . As will become clear later, usually only one index of this tuple is required, which gives rise to one of the computations X,.CT;(AD) or n2 .G~*(AD), where X, . 2 denotes the component x,.

If the nonzero structure is known in advance, a fixed E(A) and storage function can be chosen and used at compile-time. For such static data structures, many computations can already be performed at compile-time. For dynamic data structures, we may need to adapt oA and E(A) at run-time to account for creation and cancellation. The notion of an abstract data structure enables us to abstract from these matters.

4.2 Conditions Associated with Statements

If an occurrence A(F(~)) of an implicitly sparse matrix

appears in a statement, where F ( i ) represents the subscript functions (cf. section 2.2), then conceptually this statement is a two-way IF-statement distinguishing between entries and zero elements in the abstract data structure A'.

For an implicitly sparse matrix A that occurs at the right- hand side of an assignment statement, the following code is appropriate, since either a value must be fetched from A', or a zero is used otherwise:

IF F(?) E E(A) THEN

x = A'[ oA (F( ?))I x = A@(?)) + ELSEIF F(?) P E(A) THEN

x = 0 . 0

ENDIF

In this case, making a distinction seems useless, and we rather use the convention that o A ( F ( ? ) ) = I holds

forF(?) E(A) and A'[l] = 0 to avoid the distinction shown above. However, as is explained below, in other cases this distinction may be useful.

If matrix A occurs on the left-hand side and an arbitrary expression at the right-hand side, the two branches must hand€e the change in value for an entry or creation respectively:

IF ~ ( 7 ) E E(A) THEN

A ' [ D A ( F ( ? ) ) ] = .. A ( F ( ~ ) ) = ... --f ELSEIF Fli) P E(A) THEN

A'[new,(F(?))] = ... ENDIF

We have used the function to create a new entry in A. This function returns the address of a new entry all, and adapts oA, E(A) and AD(A) accordingly as side-effect.

The 'E E(A)'- and 'E E(A)'-tests are referred to as positive and negative guards, respectively. Every d@erent occurrence of an implicitly sparse matrix in a statement intro- duces an additional positive and negative guard. Hence, statements with k different occurrences of implicitly sparse matrices can be thought of as multi-way IF-statements with 2k branches for every possible conjunction of the corresponding guards.

It should be stressed here, that this notation is only used to explain the dense to sparse conversion and that the compiler does not necessarily generate the same code. However, the resulting IF-statements reflect the test and lookup overhead introduced by the use of compact data structures, and offer the possibility to examine in which branches sparsity can be exploited according to the observation in Section 3.

For example, only one branch remains for the following two assignment statements:

BIK AND WIJSHOFF AUTOMATIC DATA STRUCTURE SELECTION AND TRANSFORMATION FOR SPARSE MATRIX COMPUTATIONS 113

Statement B ( 1 , J ) = B ( 1 , J ) +

A ( 2 * I , J )

ACC = ACC + A ( I , J ) A ( 2 * I , J ) = A ( 2 * I , J ) * 2 . 0

1 I F ( 1 , J ) E E(A) THEN

ENDIF I F ( 2 * I , J ) E €(A) THEN

ENDIF

ACC = ACC + A’[OA(I, J)]

A ’ [ O A ( ~ * I , J ) ] = A ‘ [ 0 ~ ( 2 * 1 , J ) ] * 2 . 0

Another, less obvious example where a branch can be eliminated is shown below. Cancellation is indicated with a call to subroutine deZA:

I F ( 1 , J ) E E(A) THEN A ( 1 , J ) = O . O + CALL delA(I, J)

ENDIF

Only zeros that are due to constants or non-entries in sparse matrices will be exploited, i.e., we conservatively assume that the contents of dense variables and entries are always nonzero. This implies that accidentally stored zeros are not exploited. However, this decision avoids additional tests on the value of such expressions which are likely to fail.

For example, the following IF-statement results if matrix B is also an implicitly sparse matrix, which simply ignores accidentally stored zeros of B:

I F (1,J) E €(A)A ( 1 , J ) E E(B) THEN A‘[OA(I, J)] = B’[O&I, J)1

ELSEIF ( I , J) E €(A) A ( I , J) G E(B) THEN A ( I , J ) = B ( I , J ) - + CALL del,(I,J)

ELSEIF (I, J) L €(A) A (I, J) E E(@ THEN A’[newA(I, J)] = B’[OB(I, J)]

ENDIF

Condition ( 2 * I , J ) E € ( A )

The condition associated with a statement is defined as the disjunction of all the conjunctions in the branches that have not been eliminated. This associated condition indicates those instances of the statement that have to be executed under the previous conservative assumption. The particular values of the guards determine which action is required in such an instance.

Consider, for example, the resulting IF-statement for the following assignment:

ACC = ACC + A ( 1 , J ) + B ( I , J )

1 I F ( I , J ) E €(A) A ( I , J ) E E(B) THEN

ACC = ACC + A’[OA(I, J ) ] + B’[OB(I, J)]

ACC = ACC + A‘[OA(I, J ) ]

ACC = ACC + B’[OB(I, J ) ]

E L S E I F (I , J) E €(A) A ( I , J) G E ( B ) THEN

E L S E I F (I , J ) G €(A) A (I , J) E E ( B ) THEN

ENDIF

The associated condition is :

( ((I, J) E E(A) A ( I , J) E E@)) v

( ( I , J ) E E(A)h(I ,J)EE(B)) v

((I, J) PE@) A (I, J) E E(B)) )

This condition, which simplifies to ‘ (1 ,J ) E E(A) v (I, J) E E(B)’, reflects the fact that only instances refer- encing two non-entries can fully exploit sparsity if we ignore the fact that some entries may accidentally be zero.

Fortunately, it is not necessary to actually construct the

IF-statements in order to determine the condition of a statement. In [12], [14 ] we have presented a simple attribute grammar (cf. [l], [25]) based on a context free grammar for assignment statements, directly computing the associated conditions. Some assignment statements and the associated (simplified) conditions are shown in Table 1.

A ( 1 , J ) = - 3 . 0 * S Q R T ( A ( 1 , J ) 1 ( 1 , J ) E E ( A ) I

Assignment statements are not the only statements for which conditions composed of guards result. Such conditions can also be associated with one-way IF-statements. Using semantic analysis, conditions are determined under which the boolean expression may hold. For example, condition ‘(J, I) E E(A)’ is associated with the following IF- statement, since the boolean expression necessarily evalu- ates to false for all non-entries:

I F ( A B S ( A ( J , I ) ) > A B S ( P V ) ) THEN ...

ENDIF

1 IF ( J , I ) E E(A)THEN

I F (ABS (A’[OA(I, J)]) > ABS ( P V ) ) THEN

ENDIF ...

ENDIF

Again, the conditions can be determined directly by means of an attribute grammar without actually constructing more IF-statements. Some examples are shown in Table 2. Details can be found in [14].

TABLE 2 IF-STATEMENTS AND ASSOCIATED CONDITIONS

I F-statement Condition I F ( A ( 1 , J ) = 1 . 0

I F ( A ( 1 , I ) >= (1~1) E € ( A ) ( l . O + A B S (XI

I F ( A ( 1 , J ) = X

4.3 Dominating Guards For some statements, the value of the associated condition is completely dependent on the value of particular positive guards, as in the following example:

ACC = ACC + A ( I , J ) * B ( I , J )

1 I F ( I , J )E € ( A ) A (1,J) E E(B) THEN

ACC = ACC + A’[OA(I, J ) ] * B’[OB(I, J)] ENDIF


Such situations frequently occur in linear algebra operations that are classified as static or simply dynamic [43]. For example, multiplication of a matrix with a scalar (A t a. A) is called a simply dynamic operation and only a positive guard appears in the condition:

DO I = 1, M DO J = 1, N

ENDDO A ( 1 , J ) = ALPHA * A ( 1 , J )

ENDDO

4.

DO I = 1, M DO J = 1, N

I F ( I , J ) E E(A) THEN A'[OA(I,J)] = ALPHA * A'[GA(I,J)]

ENDIF ENDDO

ENDDO

A positive guard yrdominates a condition 4, if 7 y+ 4 is a tautology. In words, if yrdoes not hold, 4 will not hold. For example, both guards in the condition (I, J) E E(A) A

(I, J) E E@)' are dominating. However, none of the guards dominates the condition '(I, J) E E(A) v (I, J) E E@)'.

The presence of guards and o,-lookups reflect the inherent sparse overhead for scanning a compact data stnicture to determine if and where an entry is stored. Additionally, skip ping operations by means of conditionals does not reduce the execution time on most machines [22], [32]. Therefore, overhead reducing techniques are required. Dominating guards give rise to a convenient overhead reducing method, cliscussed in the next section, because sparse data structures usually provide efficient generation of all entries, i.e., elements for which the guard holds, along an access pattern.

4.4 Guard Encapsulation and Access Pattem Expansion The disjunction of all conditions associated with the statements in a loop body is called the loop condition. This condition indicates the iterations in which at least one statement instance has to be executed. This implies ihat if a guard ydominates a loop condition x, iterations in which yr does not hold do not have to be executed.

Consequently, if mld or mZd is nonzero, then we would like to encapsulate such a dominating guard ' F ( 2 ) E E(A)' in the execution set of the innermost loop. This guard encapsulation yields the following irregular execuiion set:

{I~ E V,IF(?) E E ( A ) }

Only operations on the entvies along each access pattern, i.e., PI n E ( A ) , where T denotes I,, ..., I&,, are necessary. Since, in general, the previous execution set cannot be directly generated without test overhead, the following conversion i s performed:

-

DO AD E PA' X = X + A'[ADI

ENDW

This conversion is only feasible if the organization of the selected sparse storage scheme enables fast generation of addresses in the following address set:

PA' = (a,(t)i t E (P' n E(A) ) } c AD(A)

In this case, only iterations in which an entry is operated on, i.e., an element for which the dominating guard holds, are executed. Fewer iterations result since the size of the execution set decreases. In addition, test overhead is eliminated completely. Because AD = a,(F(?)) holds

and OA is invertible, we have the following equations:

R, . O:(AD) = m,, + m,,~, + . . . + mldId

zp . G:(AD) = mz0 + m,,~, + . . . + mzdId (2) { Consequently, for column-wise access pattern (mld # 0), the value of I~ is restored as follows:

zl . G ~ ( A D ) - q, - m,,~, -. . . - qd-,q-, Id =

?d

For row-wise access patterns nz . .:(AD) i s required, while for diagonal-wise access patterns one of the components of the ~ : ( ~ ~ ) t u p l e suffices to solve an equation in (2). This clearly illustrates the fact that e.g., row-wise ori- ented data structures require the column index of every entry to be stored.

Guard encapsulation is feasible if the following constraints on the organization of the sparse data structure are satisfied:

Fast generation of addresses AD E PA' is supported for all valid 'T. If cross-iteration dependences that cannot be ignored are present (see section 3), then address-generation corresponds to the original execution order of this loop. To restore the value of Id, either z, . O:'(AD) if mld # 0,

or z2 . O;(AD) if m2d # 0 can be supplied efficiently together with each address AD E AD(A).

For example, in the following fragment encapsulation of the guard '(I, I+J) E E(A),' dominating the (identical) loop condition. Hence, guard encapsulation in the execution set of the J-loop is possible and requires z2 . G ~ ( A D )

to reconstruct the value of J. For 1 I 1 I M, set PA' contains the addresses of all entries along access pattern P'={(I, I + J ) I ~ 5 J I 3 ) :


DO I = 1, M DO J = 1, 3

ENDDO X = X + A ( 1 , I+J)*J

ENDDO

1 DO I = 1, M

DO A D € PA1 J = Z2 . .;;‘(..)-I X = X + A‘[AD]*J

ENDDO ENDDO

It is important to realize that the requirement for fast generation of entries along an access pattern imposes constraints on the orgunizution of the data structure. The actual addresses of the entries and their position in the matrix will, in general, only be known at run-time. Consequently, the output of the first phase consists of a number of constraints on the organization of the data structure in order to enable the implementation of all possible guard encapsulations. In the second phase, an actual organization will be selected according to these constraints. If not all constraints can be satisfied, encapsulation cannot be implemented for some occurrences.

Some complications arising in the actual implementation of guard encapsulation have not been discussed. For example, cancellation or creation of entries along the access pattern for which guard encapsulation is applied may be responsible for changes in the values of guard. Also, addresses may change due to data movement or, for some data structures, due to an occasionally required garbage collection. Additionally, in some cases a dominating guard can be hoisted out the innermost loop, which effectively enables guard encapsulation at a higher level. Some of these issues will be further elaborated during the presentation of the third phase.

In some cases, the union of all address sets can be taken, resulting in further guard encapsulation. For example, the following nested loop can be converted into a single loop iterating over all addresses in arbitrary order, where each PA‘ denotes the set of addresses of all entries in row I:

DO I = 1, M DO J = 1, N

ENDDO A ( 1 , J ) = A ( I , J ) * 3 . 0

ENDDO

J, DO AD€ PA’ U ... U PAM

A’[ADl = A ’ [ A D l * 3 . 0 ENDDO

However, in these cases usually both elements of the o;(AD) tuple, i.e., row and column index, are required to solve the equations in (2). Because this requires a coordi- nate scheme-like storage [22], [43] with substantial storage overhead, this kind of encapsulation is not further considered in this paper.

A related overhead reducing technique imposing similar constraints on the organization of the data structure is formed by expansion. The compiler may decide to expand

the entries along a sparse access pattern into a dense representation with a scatter-operation, so that operations on this dense representation do not suffer from inherent sparse lookup overhead [HI , [22], [32].

For example, no substantial reduction in computational time can be obtained for an assignment statement like ’ D ( 1 , J) = D (I, J) * A ( I , J)’, where only A is an implicitly sparse matrix, because in that case the associated condition is ‘true’. However, the storage requirements of A can be kept low and lookup overhead in the innermost loop is avoided if the following conversion is applied, where PA1 again denotes set of addresses of all entries in row I:

DO I = 1, 100 DO J = 1, 1 0 0

D ( 1 , J ) = D ( 1 , J ) * A ( 1 , J ) ENDDO

ENDDO

J. DO I = 1, 100

A,(,, ..:(AD)) = A’[AD] scatter I DO AD E PA1

ENDDO

DO J = 1 , 1 0 0 D(I, J) = D(I,J) * AP(J)

ENDDO ENDDO

Only one temporary dense array, initialized to zeros, is required to store each row. Changes in the nonzero structure can be easily accounted for, if a so-called switch technique [32] is used, where a bit array records which elements are entries. Storage is obtained or released directly to account for creation or cancellation if necessary. The actual values are stored back afterwards with a gather-operation. Details of such operations can be found in e.g., [22], [32]. Expansion is feasible under constraints that are very similar to the requirements for guard encapsulation: addresses of entries together with either row or column indices along an access pattern can be generated easily, although the order in which this generation occurs is immaterial.

Since some access patterns are very likely going to be subsets of other access patterns, the constraint for fast generation of addresses and indices can be relaxed for both guard encapsulation and expansion at the expense of some run-time overhead. This is possible by allowing for fast generation of the addresses of entries and indices along an enveloping access pattern F’ 3 P’.

This relaxation is only feasible if at run-time it can be determined whether an entry corresponds to the actual access pattern. If only longitudinal enveloping access patterns are considered this can be done for each entry by a simple test for inclusion ’ I d E v d ’ on the restored loop index (as further elaborated in Section 6) .

A longitudinal enveloping access pattern of (1) may have the following form, for offsets R = mlo + al 1, + ...+ Ud-1 Id-1, c = m20 4- bl 1 1 + ... + bd-1 Id-1 and =

gcd(mld, m2d):


This access pattern can be extended to both directions for TI 5 g . min(V,) and T2 2 g max(Vd), as long as subscript bounds are not violated. Some examples of longitudinal enveloping access patterns are shown below:

P I = {(loo-3 J,I)I15 J 533) -+ P’ = { ( l O O - J , 1 ) 1 3 <J 599}

+ P = ((4 J , 2 J)I 15 J 525} = ((2 J, J)I 2 I J < 50}

P1{(I, J)Il<J<I} -+ FI={(I ,J)~~<J<N}

The latter relaxation, for example, enables the compiler to select row-wise storage of the entries in an implicitly sparse matrix, although the particular occurrence to which this access pattern belongs only accesses entries in the lower triangular part. In this manner, the compiler is able to select a data structure that supports different kinds of access patterns along the same direction. Consideration of ’wider’ enveloping access patterns would require more index information per entry to test inclusion in the actual access pattern.

4.5 Guard Manipulations Certain conventional program transformations can be used to increase the potential of guard encapsulation:

e Loop Distribution: Apply loop distribution in order to isolate parts of the loop-body that are dominated by a particular guard.

In the following example, the loop condition consists of ’(I, J) E E(A) v (I, J) E E@)’. Hence, iterations in which an entry in A or B is referenced must be executed. However, encapsulation of a disjunction is only feasible if a restrictive relation on the nonzero structures and addressy of enties of A and B holds which, in general, is not true. After loop distribution, guard encapsulation becomes feasible for both resulting loops. This transformation may be preferable over a loop with encapsulated disjunction, since the total number of statement instances executed decreases if the nonzero structures of A and B differ substantially:

DO J = 1, M DO J = 1, N

ACCl = ACCl + A ( 1 , J ) ACCZ = ACCZ + B ( I , J)

ENDDO ENDDO

L DO I = 1, M

DO AD E PA: ACCl = ACCl + A’[AD]

ENDDO

DO AD E PA; ACC2 = ACC2 + B’[AD]

ENDDO ENDDO

1. Alternatively, encapsulation of conjunctions and disjunctions can be implemented by a so-called in-phase scan 1221 if the entries are sorted on index, but this technique is not further considered here.

Loop distribution is valid if no lexically backward static dependences hold. Determination of the maximal strongly connected components in the dependence graph, and statement reordering are required to make effective use of loop distribution. It also improves data locality (cf. fission by name 1311).

The inverse transformation, loop fusion, is usually used to reduce loop overhead of adjacent loops and to increase data locality (cf. fusion by name [31 I). It can also be applied to reduce overhead of loops with the potential of encapsulation:

Loop Fusion: Fuse two adjacent loops with loop conditions that have identical dominating guards.

Loop fusion of two loops is allowed if both loop bounds are identical (which can be achieved by adjusting the bounds, partially unrolling the loop or adding some conditionals), and if no dependences hold between statements instances of the first loop to statement instances that correspond to later iterations of the second loop. Statement reordering can assist in making loops adjacent, as is demonstrated in the following example, where one loop with encapsulated guard results:

W I = l , M W J = l , N

ENDW COPY(1) = ACC DO J = 1, N

ENDW

ACC = ACC + 2 . 0 * A ( 1 , J )

A ( 1 , J ) = 3 . 0 * A ( 1 , J )

ENDW

1 W I = l , M

DO A D € PA1 ACC = ACC + 2.0 * A‘[AD] A’[AD] = 3 .O * A’[AD]

E N D W COPY(1) = ACC

ENDW

Finally, some overhead can be removed as follows: Update Expression Splitting: Split sparse subexpres- sion of the right-hand side expression in an updating statement over several updating statements, by applica- tion of the associativity, commutativity and distributiv- ity laws of the arithmetic operators.

In the following example, two implicitly sparse matrices are referenced in an updating statement. The condition associated with this statement is ’(I, J) E E(A) v (I, 3) E E@)’.

Y = Y + A ( I , J ) * * 3 + B ( I , J ) * * 3

1 I F (1,J) E E ( A ) A (1,J) E E ( B ) THEN

E L S E I F ( I , J ) E E ( A ) r \ ( I , J ) p E ( B ) THEN

E L S E I F ( I , J ) e E ( A ) r \ ( I , J ) € E@) THEN

Y = Y + A’[gA(I,J)**3 + B’[(JB(I, J)]**3

Y = Y + A’[OA(I,J)]**3

Y = Y + B‘[CB(I,J)]**3 ENDIF

Clearly, this statement suffers from the same problem as in the first example of this section: guard encapsulation of a disjunction is not possible in general, and could result in too many expensive operations on zeros. Rewriting the statement

BIK AND WIJSHOFF: AUTOMATIC DATA STRUCTURE SELECTION AND TRANSFORMATION FOR SPARSE MATRIX COMPUTATIONS

into two statements with respectively the conditions '(I, J) E E(A)' and '(I, J) E E(B)' eliminates this problem:

Y = Y + A(I,J)**3 Y = Y + B(I,J)**3

After this transformation, loop distribution of a surrounding loop becomes possible if the dependence cycle is recognized as a coupled reduction [41]. This increases the potential of encapsulation, since both resulting loop conditions have a dominating guard.

4.6 Remaining Topics Information about the nonzero structure, either user- supplied or obtained from on-file analysis as described in [lo], can be used by the compiler in the decision to use dense storage for certain regions in the sparse matrix (e.g., a few diagonals or a band). This eliminates all overhead due to lookups, creation, cancellation etc. for references to these regions, while the number of redundant operations performed is probably small. Usually, program transformations are required to isolate the operations on such regions before a suitable data structure can be selected [13].

Likewise, if particular regions are completely zero and will remain so at run-time, the compiler can eliminate all code performing useless operations on such regions [12], [13].

117

5 SECOND PHASE

Every occurrence of an implicitly sparse matrix A induces access patterns PT that effectively form paths through the index set of the matrix. The elements along every path can be viewed as a vector that, depending on the region of A accessed, is either dense or sparse. In order to enable guard encapsulation or expansion for sparse vectors, the compiler must select an organization of the data structure in which entries along the paths can be generated easily at run-time. Ideally, an organization would be selected such that the entries along each access pattern occurring in the program could be generated easily. However, since this would demand for extensive overhead storage, the storage will be done according to non-overlapping access pattems.

First, a convenient characterization of the access patterns of one occurrence is constructed, referred to as an access pattern collection. Subsequently, for each implicitly sparse matrix the compiler examines all the collections in the program to detect possible conflicts. Finally, after these conflicts have been resolved, an organization according to the access patterns in the collections is selected, and a storage scheme for the matrix as a whole is constructed.

5.1 Access Pattern Collections Furthermore, constructs in which whole implicitly sparse

tions or subroutines) are dealt with by cloning the called procedures [16]. This is done to enforce a unique correspondence between implicitly sparse matrices and the formal arguments that may be associated with these matrices, as is illustrated below, where only A is used as enveloping data structure:

In order to get a better grip on all patterns p~ of one

pattern collection. This collection is by the corn- piler and is represented by a approximation of the index set accessed by that occurrence in terms of a two-

simple section [21, [31, a in the matrix, and a normalized direction d'" along. which this index

matrices are passed as parameters to procedures occurrence, these access pattern are characterized by an access

CALL FACT(A, 100) CALL FACT(B, 10) CALL FACT (C, 90)

SUBROUTINE FACT(MAT, N) REAL MAT(N,N)

...

... END

1 CALL FACT-A0 (A, 100) CALL FACT ( B , 10) CALL FACT (C, 90) ... SUBROUTINE FACT(MAT, N) REAL MAT(N,N) ... END SUBROUTINE FACT-AO(MAT, N) REAL MAT(100,lOO) ... END

t MAT I A, N E 10 0

In this manner, program and data structure transformations can be applied to the occurrences of each implicitly sparse matrix without interfering with other data structures that may be associated with the same formal arguments. Details about this kind of cloning can be found in [ll].

" set is traversed. For access patterns of the form (l), d'" is defined as follows:

-1if m,d.m,d<O +lif m,d.m,d>O ' = [ g c < 2 ! z z d 1 gcd(mId 1~2'1 ,m,,) ] where s = {

Consider, for example, the following occurrence of an implicitly sparse matrix:

DO I = 1, 5 DO J = 0, 2

ENDDO ... A(I + 2*J,1 + 2*J) ..

ENDDO

For this occurrence, the normalized direction is k = (1,1), while the simple section approximating the index set accessed by this occurrence consists of the following discrete points:

(3)

Consequently the access pattern collection illustrated in Fig. 1 results. Due to direction normalization, additional points are included in the collection, so that, in effect, five longitudinal envelop ing access patterns result, as indicated in the figure.

Access pattern collections provide a convenient and uni- form characterization in which the distinction between par-


0 Accessed Point

222. Simple Section

H Included Point

Fig. 1. Access pattern collection.

tially overlapping or multiple traversed access patterns of one occurrence are eliminated. Pathological access patterns can also be handled by the compiler, while regular access patterns, which are likely to occur in most numerical codes, remain almost unaffected. Moreover, the representation can be manipulated efficiently and supports fast tests for overlap. Another advantage is that for a particular element, the number p of the access pattern in the collection along which this element may be stored can be computed easily. In the previous example, for instance, an element all within the region described by the simple section appears along access pattern p = i - j + 1. This computation is not so straightforward in general for arbitrary PI. More details are given in [12], [14].

5.2 Data Structures For each implicitly sparse matrix, the compiler tries to construct a number of access pattern collections of which the simple sections are mutually disjoint, possibly assisted by program transformations as explained in the next section. Subsequently entries along the access patterns in one collection are stored either in full-sized arrays or as a number of sparse vectors. The selection of different storage formats for different regions in the same matrix enables the compiler to account for characteristics of the nonzero structure of this matrix.

Entries along access patterns through dense regions of the matrix are stored in full-sized arrays, which do not suffer from sparse overhead because oA- and o:-lookups are trivial. For each access pattern collection, a separate array is used.

Entries along all other sparse access patterns, i.e., access patterns through sparse regions of the matrix, are stored as a pool of sparse vectors in either a compact or linked list storage scheme (cf. [Z], [32], [39], [43]), referred to as (Dl) or (Ll) respectively. Sparse vectors belonpg to different access patterns collections are stored in the same data structure, although the stored index information and the direction along which these vectors are stored may be different for each collection.

Which of the two schemes (Dl) or (Ll) is preferable depends on the architecture of the target machine. In both schemes, the numerical values of entries are stored in an

Fig. 2. Sparse data structures.

P

array AVAL, while corresponding column or row indices are stored in a parallel integer array A I N D to support fast o2-lookups. As illustrated in Fig. 2., the schemes differ in the way in which the entries along each pth access pattern are stored: consecutively in (Dl) or as linked list in (Ll), and the information required for each access pattern: the first and last location in (Dl) or a pointer to the first element in (Ll).

Enough working space must be available to account for creation in case of dynamic data structures. An index ordering (determined by links for (Ll)) between entries along one access pattem can be maintained at the penalty of more expensive insertions and deletions. An insertion or deletion in (L1) only affects the values of a few links. For (Dl) the addresses of other entries in the same access pattern may alter as a result of data movement. Extra data movement results if an insertion is performed in an access pattern with no directly surrounding space. In this case, the access pattern is moved to the end of the array where most working space is kept. If insufficient space remains, garbage collection is required to compress all data, which might affect the addresses of all entries.

5.3 Data Structure Selection and Conflict Resolving Methods

If guard encapsulation or expansion is desirable, the constraints of Section 4.4 are imposed on the organization of the sparse data structure. Obviously, as long as the simple sections computed for the different occurrences of an implicitly sparse matrix do not overlap, all constraints can be easily satisfied. However, if these simple sections overlap and are accessed along different directions, conflicts arise.

Consider, for example, the following code that computes the sum of the elements in the upper and strict lower triangular part of a square implicitly sparse matrix A. Both the occurrences AI and A2 (subscripted to make a distinction) give rise to

a dominating guard '(I, J) E E(A)', which can be encapsulated in the execution set of the surrounding loop as is shown below, where PA;, and PA;, contain the addresses of all entries in the strict lower and upper t r i a n e part of row I respectively:

DO I = 1, M DO J = 1, 1-1

SUML = SUML + Ai(1, J) ENDDO DO J = I, M

ENDDO SUMU = SUMU + A2(1, J)

ENDDO

-1


DO I = 1, M

DO AD €PA;,

SUML = SUML + A'[AD] ENDDO

DO AD E PA' A2

SUMU = SUMU + A'[AD] ENDDO

ENDDO

Since the simple sections computed for these occurrences are non-overlapping, the constraints of both encapsulations can be satisfied if entries along the rows of the upper and strict lower triangular pars are stored in a pool of 2M sparse vectors in either (Dl) or (Ll) storage. How- ever, if the following fragment occurs in the same program, a conflict arises, since the simple section computed for A3 overlaps with the simple sections of the other two occurrences, while column-wise accesses are induced by the third occurrence:

DO J = 1, M DO I = 1, M

ENDDO B ( 1 ) = B ( 1 ) + A , ( I , J )

ENDDO

Dealing with such conflicts in the selected data structure is infeasible in general, since this usually demands for extensive overhead storage in order to support the fast generation of entries along different kinds of access patterns (maintaining the entries of a matrix by row and columns requires expensive linked-lists implementations, while additional pointers are required to separate entries in the strict lower triangular part from entries in the upper triangular part).

In the previous example, one step towards resolving the conflict is performed by reshaping the access patterns with loop interchanging:

DO J = 1, M DO I = 1, M

B ( 1 ) = B ( 1 ) + A 3 ( I , J ) ENDDO

ENDDO

1 DO I = 1, M

DO J = 1, M

ENDDO B ( 1 ) = B ( 1 ) + A 3 ( I , J )

ENDDO

After this transformation, all directions in the program are equal and general row-wise storage can be selected. However, the disadvantage of this selection is that if encapsulation is applied, tests are required to test inclusion within the original execution set because longitudinal access patterns are used for the first two occurrences (see Section 4.4).

Since this disadvantage is due to the fact that simple sections overlap but are not identical, the conflict can be completely resolved by using index set splitting [42] to fragment the simple section of the third occurrence according to

the other simple sections:

DO I = 1, M DO J = 1, M

ENDDO B ( 1 ) = B ( 1 ) + A 3 ( I , J )

ENDDO

I l J l M 1

D O I = l , M DO J = 1, 1-1

B ( 1 ) = B ( 1 ) + A a ( I , J ) ENDDO DO J = I, M

ENDDO B ( I ) = B ( I ) + A3b ( I , J )

ENDDO

After this transformation, no more conflicts remain and a data structure organization in which entries along the rows of the upper and strict lower triangular pars are stored in a pool of 2M sparse vectors becomes possible again. A more detailed discussion about resolving conflicts using access pattern reshaping, based on the framework of unimodular transformations [5], [4], [29], [40]and the fragmentation of simple sections can be found in [7], [13], [14]. However, because in general not all conflicts can be resolved, strategies to control these mechanisms are required. The identification of such strategies is currently one of our research interests.

Once a data structure organization has been selected for an implicitly sparse matrix A, the generation of the corresponding declarations is straightforward. For the pool of sparse vectors, one of the following declarations results. Parameter p is the density of the matrix A, which is either user-supplied or obtained by on-file analysis of the matrix [lo]. Furthermore, U denotes the total number of access patterns in all the collections for which sparse storage is selected, z, is the total number of e2ements along these access patterns, and w is the maximum number of elements along one access pattern. These parameters can be determined automatically from all sparse collections:

INTEGER ANP, ASZ

( D l ) : PARAMETER (ANP = U,ASZ = p , V + 2 . W ) * REAL AVAL(ASZ) DXFITGER AIND(ASZ), AHIGH(ANP), ALC%J(ANP), ALAST

or ( L l ) : PARAMETER (ANP = U , ASZ = p . V )

REAL AVAL (ASZ) IiVl'EER AlND(ASZ), ALl3K(ASZ), AFIRST(ANP), AF'REE

Scalar ALAST points to the last used element in (D1) to support extra data movement, while AFREE is a pointer to a linked list in (Ll) that contains all free elements. For each sparse access pattern collection, an offset is recorded. This initially zero offset is increased by the number of access patterns in each collection. Consequently, if off denotes the offset associated with a particular collection, the first element belonging to pth access pattern in this collection can

2. The size of extra working space to prevent frequent garbage collection is inspired on [21], [24].

120 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 7, NO. 2, FFBRUARY 1996

LKPJ ) INS-( ) DEL-( )

be found at location ALOW (of+ p) or AFIRST ( o f f p) . For every access pattern collection that requires dense

storage the following declaration is generated, where a unique label lab is used for each collection. Parameters U and w denote the number of access patterns and the maximum length of an access patterns in that collection respectively:

REAL ADNSlab (W, U)

Because the maximum length is taken, this format is not very well suited for triangular shaped dense access patterns. Access patterns are stored along columns, so that elements in one access pattern are stored consecutively in memory to improve data locality or vector instruction performance of consistent accesses.

Address of an entry or I for non-entries Insertion of an entry Deletion of entry at a location

In this section, we present how code operating on the selected data structure is generated. Compiler-supplied primitives are presented first, followed by a discussion of code generation for specific program constructs. More details about code generation can be found in [12], although the generation of effective sparse code is still one of our research topics.

The definition of certain primitive operations for data structures (Dl) and (Ll) is supplied by the compiler in a separate file sparselib. f , so that these primitives can be used in the generated code. Below, we list some basic operations with a short description, where '-' can be either D1 or L1:

OINS-( ) 1 Ordered insertion of an entry ODEL-( ) I Ordered deletion of an entry

Finally, two operations are supplied to support expansion:

SCAT-( ) I Expansion into dense GATH-( ) I Compression into sparse

Encapsulated Guards

If guard encapsulation in the execution set of the Id-lOOp is feasible and cancellation or creation along the (enveloping) access patterns cannot occur, the following conversion has to be implemented:

W Id€ vd -

IF ( F (3 E E ( A ) ) THEN DO AD E PA1

. . . A(F (i)) . . . + ... A'[ADl ... ENDIF ENDDO

ENDW

Because a certain p can be determined independently of the value of I d such that the pth access pattern in a collection forms an enveloping access pattern of P' [12], the first location in the data structure is found through D = off + p in the data structure, where off denotes the offset associated with the collection. Depending on which sparse storage scheme is selected, one of the following frame- works is generated:

(Dl) : DO AD = ALOW(D), AHIGH(D)

ENDW

(Ll): AD = AFIRST(D) DO WHILE (AD# I)

AD = ALINK(AD) ENDDO

Within the loop body, the value of Id is restored. Thereafter, inclusion in v d is tested in case longitudinal enveloping access patterns are stored in the collection. The resulting code for column-wise access patterns is shown below, where we assume that Vd = [ 2, U'] :

Id = AIND(AD) -Vf1,--ml1 * 11 ... - mlrd-l * Id-1 IF (MOD(Id,mld) = 0) THEN

ENDIF ENDIF

In this construct, the original loop body is generated in which occurrences A(F(f)) that correspond to the encapsulated guard are replaced by AVAL (AD). Redundant operations in the index computation are eliminated, while the whole computation may be omitted if I d is not used in the tests or loop body

For example, the following code would result for the first fragment of Section 5.3 if sparse vectors along the rows in the strict lower triangular part and upper triangular part are stored at locations 1.. . M and M + 1.. . 2 *M respectively (the first collection has offset 0 and the offset of the second collection is M, so that D is respectively I and M+ I):

DO I = 1, M DO AD = ALOW(I), AHIGH(1)

ENDDO DO AD = ALOW(M+I), AHIGH(M+I)

ENDDO

SUML = SUML + AVAL (AD)

SUMU = SUMU + AVAL (AD)

ENDDO

If the execution order must be preserved, an ordering on the entries will be maintained, and the same constructs can

BIK AND WIJSHOFF AUTOMATIC DATA STRUCTURE SELECTION AND TRANSFORMATION FOR SPARSE MATRIX COMPUTATIONS 121

be used. Possible changes in the address set due to cancellation and creation can also be accounted for if entries are ordered on indices. For example, for (Dl) a WHILE-loop can be used to account for such changes:

AD = ALOW(D) DO WHILE (AD 5 AHIGH(D) )

AD = AD i 1 ENDDO

Statement ’AD = AD-ALOW ( D ) ’ is executed before each operation that possibly affects the address set. After that operation, either ’AD = ALOW(D) + AD’ or ‘AD = ALOW(D) +AD+l’ is executed to restore the value of AD. For example, after ordered insertion of an entry bqore the entry operated on, the offset to the base address has to be incre- mented. This is illustrated in Fig. 3 for ordered insertion of a32 in row-wise storage during operation on a3,, where extra data movement occurs. Insertions in other access patterns of this collection that possibly cause garbage collection are correctly handled by executing the first statement.

AVAL AIND

ALOW(D)~-+ AD

AVAL AIND

Fig. 3. Extra data movement.

6.3 Dense Storage

An occurrence A(F(?)) that is stored in a dense collection is replaced by the corresponding dense data structure ’ADNSlab (E, D)’, where D indicates the number of its access pattern, E is a normalized index within that access pattern and lab denotes the label of the dense collection.

For example, the following conversion may occur for dense storage of a band [12]:

DO J = -bl, b2 DO I = MAX(1, 1-J), MIN(M, N - J )

Y ( 1 ) = Y ( 1 ) + A ( 1 , J + I ) * X ( J + I ) ENDDO

ENDDO

1 DO J = -bl, b2

DO I = MAX(1, 1-J), MIN(M, N - J ) Y(1) = Y(1) + AtNSl(1 - Mpx(1, 1-J) + 1, 4 + 1-J)

ENDDO ENDDO

In the resulting storage scheme, elements along the access pattems are up-justified, as is illustrated below for a 4 x 4 tridiagonal matrix A having bl = b2 = 1:

However, layouts that use either the normalized index E = I+J or E = I can also be used, yielding more so- phisticated storage schemes with regular properties [22], 1321:

6.4 Expansion

Expansion of the pth access pattern in a collection into a dense vector AP of appropriate length in which all elements are initialized to zero is performed with the scatter- operation ‘SCAT‘. The value of each entry in the sparse data structure is assigned to the corresponding element in AP by means of a normalized index E. During the scatter, each Eth bit in an initially reset bit-vector BIT is also set to support a so-called switch to provide information about the nonzero structure [32].

Within the scope of an expansion, each occurrence A(F(?)) along the expanded access pattern is replaced by AP ( E ) . Creation or cancellation along the expanded access pattern are directly done if required, which can be tested on BIT. If changes may have occurred, the actual values are stored back with a gather operation afterwards. Used elements of AP and BIT are reset during this gather operation to support subsequent expansion^.^

An example is shown below, where encapsulation is applied for an implicitly matrix B and expansion of each row of another implicitly sparse matrix A. Both matrices are stored row-wise in (Dl):

DO I = 1, M

DO J = 1, N A ( I , J ) = A ( 1 , J ) + B ( I , J )

ENDDO

ENDDO

1

3. Resetting BIT is not necessary if the multiple switch technique 1281, [32] is used with the value of D, but increases the storage requirements of vector BIT.


DO I = 1, M CALL SCATDl(AWL, AIND,ALCbJ(I), “(I), AP, BIT, 1, 0) DO AD = BLOW(I), BHIGH(1) IF (BIT(J) = 0) THEN CALL INSDl(AVAL, AIND, ALOW, AHIGH, I, ANP,

ASZ, ALST, J, .O) BIT(J) = 1

ENDIF AP(J) = AP(J) + BVAL(AD)

ENDDO CALL GATHDl(AVAL, AIND, ALOW(I), AHIGH(I), AP,

BIT, 1, 0 ) ENDDO

If sparse BLAS subroutines [18] are available, the compiler can use these primitives for scatter and gather operations instead of the earlier presented primitives.

Code generation for all remaining occurrences of implicitly sparse matrices is based on the following observa- tions. A guard ’F(? ) E E(A)’ is ’true’ for an occurrence for which dense storage has been selected. For an occurrence along an expanded access pattern, the guard is evaluated with test ’BIT ( E ) = 1’. For all other cases, the guard is evaluated as follows:

AD = LKPDl (AIND, ALOW(D) , AHIGH(D) , E) IF (AD#I) ...

If the guard can be hoisted, the corresponding test is generated at the appropriate level. The oA-lookup is obtained as ’side-effect’ of guard evaluation if the test suc- ceeds. An occurrence at the right-hand side is replaced by AVAL ( A D ) . For a left-hand side occurrence, either AVAL (AD) or an insertion or deletion construct is generated, depending on the value of the right-hand side expressions under the conservative assumption presented in the first phase. For example, for two implicitly sparse matrices A and B , the following code results for state- m e n t ’ ~ ( ~ , ~ ) = A ( I , J ) * B ( I , K ) ’ ~ ~ guard encapsulation cannot be performed:

AD1 = LKPDl(AIND, ALOW(Dl), AHIGH(Dl), El) IF (ADlZL) THEN

AD2 = LKPDl(BIND, BLOW(D2), BHIGH(D2), E2) IF (AD2 # I) THEN AVAL (AD1) = AVAL (AD1) * BVAL (AD2 )

CALL DELDl(AIN3, ALOW(Dl), AHIGH(Dl), ALST, AD1) ELSE

ENDIF ENDIF

If, as another example, the occurrence of B is along an expanded access pattern, then the test on AD2 and BVAL (AD2 ) are replaced by ’BIT ( E 2 ) =1’ and ‘AP (E2 ) ’ respectively, which eliminates the second lookup.

It is clear that test and lookup overhead in this kind of code is substantial. It is therefore very important to apply guard encapsulation or expansion as much as possible. However, in cases where this is not possible for all occurrences, the compiler uses the code generation as explained in this section as a last resort and, by prefer- ence, only in the least executed parts of the program.

Again, strategies to control such decisions are currently under research.

7 QUANTITATIVE AND QUALITATIVE COMPARISONS

In this section, some examples of sparse codes that can be generated automatically are given. Furthermore, we make some statements about the efficiency of these codes.

7.1 LU-Factorization Consider the following version of LU-factorization, in which the factorization A = LU is computed such that L and U are a lower unit and upper triangular matrix respectively. The array A, used as enveloping data structure of an implicitly sparse A, initially contains the elements of A, but will be overwritten with elements of L and U. For the sake of clarity and simplicity pivoting for stability is not performed (which is possible for e.g., diagonal dominant matrices). We distinguish between the different occurrences of A using subscripts:

REAL A(N,N) C-SPARSE (A)

DO I 1, N-1 DO J = I+1, N

s, : A,(J,I) = A,(J,I) / A2(I,I)

S,: q(J,K) = %(J,K) - A4(J,I) * q(I,K) DO K = I+1, N

ENDDO ENDDO

ENDDO

First, the conditions associated with the assignment statements S, and S, are computed: ’(J, I) E E(A)’ and ’(J, I) E E(A) A (I, K) E €(A)’ respectively. These conditions indicate the iterations of the loop that must be executed. From dependence analysis on the clean dense code, the compiler may decide to perform a triangular loop interchanging [TI, [41] on the two outermost loops to increase the number of row-wise access patterns.

Thereafter, ‘(J, I) E €(A)’ can be hoisted from the K-loop, and can be collapsed with the same guard of SI. Conceptu- ally, the following code results, where the dominating guard ‘(J, I) E E(A)’ has been lifted to avoid the notational need for an additional branch and where we have used the convention that A‘[ I] = 0:


D(200,5) D(300,5) D(400,5) D(500,5)

For a general sparse matrix A, sparse row-wise storage seems to be appropriate, forming enveloping access patterns of the access patterns of A,, A3 and As. This storage scheme favors scatter- and gather-operations for the third occurrence. Loop interchanging has increased the number of operations performed on each expanded access pattern. Moreover, encapsulation of guard '(I, K) E E(A)' is possible:

DO J = 2, N CALL SCATDl ( A m , AIND, ALOW (J) , AHIGH (J) , AP, BIT, 1,O) DO I = 1, J-1

IF (BIT(1) .NE.O) THEN AP(1) = AP(I) / AVAL(LKpD1(ABD,AI.WI) ,AHIGH(I),I)) AD = ALOW(1) DO WHILE (AD.LE.AHIGH(1)) K = AIND(AD) IF (I + l.LE.K) THEN IF (BIT(K) .EQ.O) THEN AD = AD - ALOW(1) CALL INSDl (AVAL,AIND,ALOW,

AD = ALOW(1) + AD BIT(K) = 1

AHIGH,J,ANP,ASZ,ALST,K,O.O)

ENDIF AP(K) = AP(K)-AP(1) * AVAL(AD)

ENDIF AD = AD+1

ENDDO ENDIF

ENDDO C4L CXEDL(AVAL, AIM), J"J), AHIQl(J), AP, BIT, 1. 0)

ENDDO

Although, in principle more efficient codes can be obtained if more complex data structures are selected storing the column-wise structure of the matrix as well [22], [32], [43], the compiler has automatically derived a sparse version with acceptable performance. For example, below we present the results of some experiments on an HP-UX 9000/720 with the original dense version and the automatically derived sparse version, both compiled with optimizations enabled. The experiments have been performed with the D(n, c)-class of matrices [32] for c = 5 and several values of n. The nonzero structure before and after factorization of matrix D(100, 5) is illustrated in Fig. 4. In Table 3, the CPU-time required for the dense and sparse version (Td and Ts), as well as the number of entries before and after factorization are shown.

TABLE 3 LU-FACTORIZATION ON AN HP-UX 9000/720

855 + 2197 0.61 0.02 1255 -+ 3297 2.35 0.03 1655 + 4397 12.81 0.07 2055 + 5497 34.30 0.10

I Matrix I Number of entries I T A I T,

techniques to develop efficient programs in a relatively easy way.

Fig. 4. Factorization of D (100, 5).

Obviously, if more peculiarities of the data operated on are accounted for, even more improvements may be expected. Consider, for instance, the following fragment, consisting of the previous discussed factorization code together with code for forward and back substitution to solve a system A? = 6 using an array B that is initialized to the right-hand side. Now suppose that the compiler is informed about the fact that A is tridiagonal, i.e., ull # 0

implies I i - j I 2 1. In this case, the compiler can eliminate parts of the code performing useless operations (see [12], [13], [14] for more details) to account for the fact that only iterations for which the loop conditions hold have to be executed:

DO J = 2 , N DO I = 1, J-1 A(J,I) = A(J,I) / A(I,I) DO K = I+1, N A(J,K) = A(J,K) - A(J,I) * A(1,K)

ENDDO ENDDO

ENDDO DO I = 2, N

DO J = 1, 1-1 B(1) = B(1) - A(1,J) * B(J)

ENDDO ENDDO B(N) = B(N) / A(N,N) DO I = N-1, 1, -1 DO J = I+1, N

ENDDO B(1) = B(1) / A(I,I)

B(1) = B(1) - A(1,J) * B(J)

ENDDO

1

A substantial improvement with respect to the dense version is already obtained for relatively small matrices. Although in the first place these savings should be inter- preted as an argument for the use of sparse techniques rather than dense methods, the savings also motivates the use of compiler support for sparse matrices, because it enables programmers that are not familiar with sparse matrix


If three full-sized arrays AD1, AD2 and AD3 are used to store the three remaining diagonals, the following code can be derived automatically. In addition, we present a hand- coded version for solving a tridiagonal system of linear equations found in [30]. In this version, three arrays A, B and c are used to store the diagonals (where we set A ( 1 ) =

c (N) = 0 . o), array D contains the right-hand side, while array x is used to store the solution:

automatically generated code : DO J = 2, N

ADl(J-1) = ADl(J-1) / AD2(J-1) AD2(J) = AD2(J) - ADl(J-1) * AD3(J-1)

ENDDO DO I = 2, N

B(1) = E(1) - ADl(1-1) * B(1-1) ENDDO B(N) = B(N) / AD2(N) DO I = N-1, 1, -1

B(I) = B(I) - AD3(I) * B(I+1) B(1) = B(1) / AD2(I)

ENDDO

hand-coded BETA(1) = B(1) Y(1) = D(1) DO I = 2, N

AMULT = A(I) / BETA(1-1) EETA(1) = B(1) - AMULT * C(1-1) Y(1) = D(1) - AMULT * Y(1-1)

ENDDO X(N) = Y(N) / BETA(N) DO J = 2, N

I = N-J+l X(1) = (Y(1) - C(1) * X(I+l)) / BETA(1)

ENDDO

Not surprisingly, experiments on an HI’-UX 9000/720 reveal that both versions have identical performance, indi- cating that code which has been highly optimized for a particular instance of a problem can be obtained automatically from a general implementation solving that problem.

Consider, as an illustration of the potential difference in performance that may rem@, the following fragment in which the operation 2 t A . b is followed by an accumulation of particular elements in an implicitly sparse matrix A:

DO I = 1, M DO J = 1, N

ENDDO s, C ( 1 ) = C ( 1 ) = A(1,J) * B(J)

ENDDO DO J = 1, N/2 DO I = 1, M/2

ENDDO s,: ACC = ACC + A(2*I, 2*J)

ENDDO

Since SI and S, induce respectively row- and column- wise access patterns through A, only one guard can be encapsulated after selecting either sparse row- or column-wise storage. Selection of row-wise storage in (Dl) yields the following code:

row-wise storage of A : D O I = l , M DO AD = ALOW(I), AHIGH(1) J = AIND(AD)

SI: C(1) = C(I) + AVAL(AD) * B(J) ENDW

ENDDO DO J = 1, N/2 DO I = 1, M/2 AD = LKPDl(ADJD, ALOW(2*1), AHIGT€(2*I), 2*J)

S,: IF (AD # I) ACC = ACC + VAL(AD) ENDW

ENDDO

Likewise, selection of column-wise storage gives rise to the generation of the following code:

column-wlse storage of A : DO I = 1, M

DO J = 1, N AD = LKPDl(AIND,ALOW(J) ,AHIGH(J) ,I)

SI: IF (AD # L ) C(1) = C(1) + AVAL(AD) * B(J) ENDW ENDDO DO J = 1, N/2 W AD = ALOW(2*J), AHIGH(2*J) I = AIND(AD)

S,: IF (MOD(1, 2) = 0 ) ACC = ACC + AVALcAD) ENDDO

ENDDO

Unfortunately, lookups are required for the occurrences of which the access patterns conflict with the storage organization. Obviously, each lookup induces substantial overhead because in the worst case all entries in a whole row or column must be scanned in order to obtain the address of an entry, or to conclude that the element to be fetched is not an entry. Furthermore, no reduction in the number of iterations is obtained in case lookups are required. For example, SI is executed M . N times in the second fragment, but only NZ times in the first, where NZ indicates the total number of entries in A.

However, interchanging the loops that surround S , enables generation of the following code:

row-wise storage of A : D O I = l , M DO AD = ALOW(I), AHIGH(1)

SI: C(1) = C(1) + AVAL(AD) * E(J) J = AIND(AD)

ENDDO m w DO I = 1, M/2 DO AD = ALOW(2*I), AHIGH(2*1) J = AIND(AD)

Sz . IF (MOD(J, 2) = 0) ACC = ACC + AVAL(AD) ENDDO ENDDO

In Table 4, we present some timings of the two versions of the previous section and the reshaped variant on one CPU of a CRAY C98/4256 for some matrices of the Harwell- Boeing Sparse Matrix Collection [23] in the appropriate storage format. The row-wise variant is preferable over the column-wise variant, because the lookup is executed less frequently in the first fragment than in the second, namely


Matrix gre-1107

jagmeshl

d . M . N and M . N times respectively. However, the reshaped version is clearly superior, due to the elimination of all lookups.

Row Column Reshaped 0.4 1.5 -3

0.3 1.1 2.1 ’ 10

1.7. l o 3

TABLE 4 EXECUTION TIME IN SECONDS (CRAY C98/4256)

orani678

steam2

do 150 k = 1, n t = 0 . 0 do 1 0 0 j = i l ( k ) , i l ( k + l ) -1

t = t + a l ( j ) * x ( j l ( j ) ) 100 continue

150 continue x ( k ) = x ( k ) - t

J

6.4.

1.4.

2.2 8.8

0.1 0.5

This experiment illustrates the most important objective in the generation of sparse codes, namely that the number of operations performed must be kept proportional to the number of entries in the sparse matrix [17], 1221, [32]. Skip- ping operations on zeros by means of conditionals is useless. Scanning sparse data structure to obtain an entry must be avoided as much as possible. Hence, the automatic data structure selection and sparse code generation method can only become feasible by the use of well-defined strategies accounting for this objective.

A qualitative comparison can be made using the following code performing forward substitution:

DO I = 2 , N DO J = 1 , 1-1

B ( 1 ) = B(1) - A(1,J) * B ( J ) ENDDO

ENDDO

For a general sparse matrix, we can select sparse row- wise storage of the strict lower triangular part of the matrix to avoid the test ‘J < 1-1’ after guard encapsulation. However, if column-wise storage of this region is more convenient in other parts of the program, the corresponding conflict can be resolved by reshaping the access patterns by loop interchanging, which automatically derives the outer product formulation. Eventually, one of the following codes results having row- and column-wise storage respectively:

DO I = 2 , N DO AD = ALOW(1). AHIGH(1)

J = AIND(AD) X(1) = X(1) - AVAL(AD) * X(J)

ENDDO ENDDO

DO J = 1, N-1 DO AD = ALOW(J), AHIGH(J) I = AIND(AD) X(1) = X(1) - AVAL(AD) * X(J)

ENDDO ENDDO

Comparing the resulting versions with hand-coded versions for row- and column-wise storage found in SPARK [34] (arrays al, jl and il contain the numerical values, column/row index and pointer index to the rows/columns respectively) indicates that, apart from some standard optimizations, the same kind of code can be derived:

do 150 k = 1, n-1 t = x ( k ) do 1 0 0 j = i l ( k ) , i l ( k + l ) - 1

x ( j l ( j ) ) = x ( j l ( j ) ) - t * al(j) 1 0 0 continue 150 continue

8 CONCLUSIONS AND FUTURE RESEARCH

In this paper we have explored a method enabling restructuring compilers to get a grip on sparse matrix computations. The method relies on the fact that information being obscured in sparse matrix codes by indirect addressing and complicated branch structures can be presented in a much cleaner form to the compiler by corresponding dense programs. Combining this approach with standard program optimizations and new techniques like guard encapsulation and data structure selection results in a method which not only seems to be powerful enough to produces efficient sparse matrix code, but also allows enough flexibility to re-target the code to different architectures and problem instances.

Currently, these techniques are being incorporated in an existing prototype restructuring compiler MT1 [6], [15]. More research is required in the development of conflict resolving strategies. Other sparsity related issues, such as the incorporation of existing re-orderings method to reduce the amount of resulting fill-in or the effective use of nonzero structure information, must also be dealt with in order to make the method successful.

ACKNOWLEDGMENTS

The authors would like to thank Peter Knijnenburg and Arnold Niessen for thoroughly proof reading this paper.

Support was provided by the Foundation for Computer Science (SION) of the Dutch Organization for Scientific Research (NWO) and the EC Esprit Agency DG XI11 under Grant No. APPARC 6634 BRA Ill.

REFERENCES

A.V. Aho, R. Sethi, and J.D. Ullman, Compilers Principles, Techniques and Tools. Addison-Wesley, 1986. V. Balasundaram, lnteractive Parallelization of Numerical Scientific Programs. PhD thesis, Department of Computer Science, Rice Univ., 1989. V. Balasundaram, “A mechanism for keeping useful internal information in parallel programming tools: The data access descriptor,” J. Parallel and Distributed Computing, vol. 9, pp. 154- 170,1990. U. Banejee, ”Unimodular transformations of double loops,” Proc. Third Workshop on Languages and Compilers for Parallel Computing, 1990. U. Banejee, Loop Transformations for Restructuring Compilers: The Foundations. Boston: Kluwer Academic Publishers, 1993. A.J.C. Bik, A prototype restructuring compiler. Master‘s thesis, Utrecht Univ., 1992. INF/SCR-92-11.


A.J.C. Bik, P.M.W. Knijnenburg, and H.A.G. Wijshoff, ”Reshaping access patterns for generating sparse codes, kcture Notes Computer Science, no. 892, pp. 406-422. Springer-Verlag, 1995. A.J.C. Bik and H.A.G. Wijshoff, ”Advanced compiler optimizations for sparse computations,’’ Proc. Supercomputing 93, pp.4?&439, 1993. A.J.C. Bik and H.A.G. Wijshoff, ”Compilation techniques for sparse matrix computations,’’ Proc. Int’l Con5 Supercomputing, pp. 416424,1993. A.J.C. Bik and H.A.G. Wijshoff, ”Nonzero structure analysis,” Proc. Int‘l Con$ Supercomputing, pp. 226-235,1994. A.J.C. Bik and H.A.G. Wijshoff, “A note on dealing with submutines and functions in the automatic generation of sparse codes,” Tech- nical Report pp. 94-43, Department of Computer Science, Leiden University, 1994. A.J.C. Bik and H.A.G. Wijshoff, ”On automatic data structure selection and code generation for sparse computations,” U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, eds., Lecture Notes in Computer Science, no. 768, pages 57-75. Springer-Verlag, 1994. A.J.C. Bik and H.A.G. Wijshoff, ”Construction of representative simple sections,” Proc. 1995 Int’l Con5 Parallel Processing, to appear. A.J.C. Bik and H.A.G. Wijshoff, ”On strategies for generating sparse codes,” Technical Report no 95-01, Department. of Com- puter Science, Leiden University, 1995. P. Brinkhaus, Compiler analysis of procedure calls. Master’s thesis, Utrecht University, INF/SCR-93-13,1993. K.D. Cooper, M.W. Hall, and K. Kennedy, “Procedure cloning,” Proc. 1992 IEEE Int’l Conf Computer Language, pp. 9&105,1992. D.S. Dodson, R.G. Grimes, and J.G. Lewis, “Algorithm 692: Model implementation and test package for the sparse linear algebra subprograms,” ACM Trans. Mathematical Software,

D.S. Dodson, R.G. Grimes, and J.G. Lewis, ”Sparse extensions to the FORTRAN basic linear algebra subprograms,” ACM Trans. Mathematical Software, vol. 17, pp. 253-263,1991. J.J. Dongarra, IS. Duff, D.C. Sorensen, and H.A. van der Vorst, Solving Linear Systems on Vector and Shared Memouy Computers. Soc. for Industrial and Auulied Mathematics, 1991.

vol. 17, pp. 264-272,1991.

1 IS. Duff, “A sparse futurkp IS. Duff, ed., Siarse Matrices and their Uses, pp. 1-29, Academic Press, London, 1981.

[21] IS. Duff, ”Data structures, algorithms and software for sparse matrices,” D.J. Evans, ed., Sparsity and Its Applications, pp. 1-29. Cambridge University Press, 1985.

[22] I.S. Duff, A.M. Erisman, and J.K. Reid, Direct Methods for Sparse Matrices. Oxford Science Publications, 1990.

[23] IS. Duff, R.G. Grimes, and J.G. Lewis, ”Sparse matrix test problems,” ACM Trans. Mathematical Software, vol. 15, pp. 1-14,1989.

[24] IS. Duff and J.K. Reid, ”Some design features of a sparse matrix code,“ ACM Trans. Mathematical Software, pp. 18-35,1979.

[25] J. Engelfriet, ”Attribute grammars: Attribute evaluation methods,” B. Lorho, ed., Methods and Tools for Compiler Construction, pp. 103-138, Cambridge University Press, 1984.

[26] K.A. Gallivan, B.A. Marsolf, and H.A.G. Wijshoff, ”MCSPARSE A parallel sparse unsymmetric linear system solver,” Technical Report no. 1142, Center for Supercomputing Research and Development, University of Illinios, 1991.

1271 A. George and J.W.H. Liu, Computer Solution o f k r g e Sparse Positive Dejnite Systems. Prentice-Hall, 1981.

[28] F.G. Gustavson, ”Two fast algorithms for sparse matrices: Multi- plication and permuted transposition,” ACM Trans. Mathematical Software, vol. 4, pp. 250-269,1978.

[29] W. Li and K. Pingali, “A singular loop transformation framework based on non-singular matrices,” Proc. F@h Workshop on Languages and Compilers for Parallel Computing, 1992.

[30] K. J. Mann, ”Inversion of large sparse matrices: Direct methods,” J. Noye, ed., Numerical Solutions ofPa7tial Diimtial Equations, pp. 3134%. Amsterdam: North-Holland Publishing Company, 1982.

[31] D.A. Padua and M.J. Wolfe, “Advanced compiler optimizations for supercomputers,” Comm. ACM, pp. 1,1844,201,1986.

[32] S. Pissanetsky, Sparse Matrix Technology. Academic Press, London, 1984.

1331 C.D. Polychronoupolos, Parallel Programming and Compilers. Boston, Mass: Kluwer Academic Publishers, 1988.

[34] Y. Saad and H.A.G. Wijshoff, “Spark A benchmark package for sparse computations,“ Proc. 1990 Int‘l Conf on Supercomputing, pp. 239-253,1990.

[35] J.H. Saltz, K. Crowley, R. Mirchandaney, and H. Berryman, “Run- time scheduling and execution of loops on message passing machines,” 1. Parallel and Distributed Computing, vol. 8, pp. 303- 312,1990.

[36] J.H. Salk, R Mirchandaney, and K. Crowley, ”The doconsider loop,” ACM conf: Pmc., third Int’l Conf Supwwmputing, pp. 2940,1989.

[37] J.H. Stz, R Mirchandaney, and K. Cmwley, ”Run-lime parallelization and scheduling of loops,“ E E E Pam. Computers, pp. 60=12,1991.

1381 R.P. Tewarson. Sparse Matrices. Academic Press, New York, 1973. [39] H.A.G. Wijshoff, ”Implementing sparse’ BLAS primitives on con-

current/vector processors: a case study,” Technical Report no. 843, Center for Supercomputing Research and Development, Univ. of Illinios, 1989.

[40] M.E. Wolf and M.S. Lam, “A loop transformation theory and an algorithm to maximize parallelism,” IEEE Trans. Parallel and Dis- tributed Algorithms, pp. 452471,1991.

[41] M. J. Wolfe, Optimizing Supercompilers for Supercomputers. London: Pitman, 1989.

[42] H. Zima, Supercompilers for Parallel and Vector Computers. New York ACM Press, 1990.

[43] Z. Zlatev, Cmputational Methodsfor General Sparse Matrices. Kluwer Academic Publishers, 1991.

Aart J.C. Bik received the MS degree (cum laude) in computer science from the Utrecht University, the Netherlands, in 1992. He is currently working as a PhD student at the Leiden University, the Netherlands. His research interests include restructuring compilers and sparse matrix computations.

Harry A.G. Wijshoff received the MS degree (cum laude) in mathematics and computer science in 1983, and the PhD degree in 1987, from the Utrecht University, the Netherlands. From 1987 until 1990 he was a visiting senior computer scientist at the Center for Supercom- puting Research and Development at the Uni- versity of Illinois. He was an associate professor at the Department of Computer Science, Utrecht University, until July 1992. Currently, he is a professor in the Department of Computer

Science at the Leiden University, the Netherlands. His current research interests include performance evaluation, sparse matrix algorithms, programming environments for parallel computers, and parallel numerical algorithm development.

automatic data structure selection and transformation for sparse matrix computations

Documents