Format Abstraction for Sparse Tensor Algebra...

Format Abstraction for Sparse Tensor Algebra Compilers Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe

Sparse tensors are a natural way of representing real-world data

Quality

DurablePoor

Kindle

Dubliners

The Iliad

Monitor

Sweater

Laptop

Candide

et …

Products

Quality

DurablePoor

Kindle

Dubliners

The Iliad

Monitor

Sweater

Laptop

Candide

et …

Products

Dense storage: 107 exabytes Sparse storage: 13 gigabytes

ELLPACK

Coordinate matrix

Dense array matrix

Block DIA

SELLSkyline BELLLIL

Many different formats for storing tensors exist

ELLPACK

Coordinate matrix

Dense array matrix

Block DIA

SELLSkyline BELLLIL

Hash mapsSparse vector Dense array vector

ELLPACK

Coordinate matrix

Dense array matrix

Block DIA

SELLSkyline BELLLIL

CSFCoordinate tensor

Dense array tensor

Mode-generic tensorHiCOO

Thermal simulation

ELLPACK

Coordinate matrix

Dense array matrix

Block DIA

SELLSkyline BELLLIL

Dense array tensor

Unstructured mesh simulation[Bell and Garland 2009]

Thermal simulation

ELLPACK

Coordinate matrix

Dense array matrix

Block DIA

SELLSkyline BELLLIL

Dense array tensor

Thermal simulation

CNN with block-sparse weights[Gray et al. 2017]

ELLPACK

Coordinate matrix

Dense array matrix

Block DIA

SELLSkyline BELLLIL

Dense array tensor

Thermal simulation

Image processing

ELLPACK

Coordinate matrix

Dense array matrix

Block DIA

SELLSkyline BELLLIL

Dense array tensor

Thermal simulation

Data analytics

Image processing

ELLPACK

Coordinate matrix

Dense array matrix

Block DIA

SELLSkyline BELLLIL

Dense array tensor

There is no universally superior tensor format

77:22 Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe

DD DS SD SS

Normalized run timeNormalized storage

(a) Dense

DD DS SD SS

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

6270× rrruuu 7780× rurrrrrtruu 6144× ruu 6145×

(c) Thermal

DD DS SD SS

17150×rrrrrrruu 88× rrrruuurrr 26542× rr 123×

(d) Hypersparse

DS SD SS SDᵀTT

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

(f) Blocked

Fig. 18. Performance of matrix-vector multiplication on various matrices with distinct sparsity patterns usingtaco. The left half of each subfigure depicts the sparsity pattern of the matrix, while the right half showsthe normalized storage costs and normalized average execution times (relative to the optimal format) ofmatrix-vector multiplication using the storage formats labeled on the horizontal axis to store the matrix. Thestorage format labels follow the scheme described in Section 3; for instance, DS is short for (densed1,sparsed2),while SDᵀ is equivalent to (sparsed2,densed1). The dense matrix input has a density of 0.95, the hypersparsematrix has a density of 2.5 × 10−5, the row-slicing and column-slicing matrices have densities of 9.5 × 10−3,and the thermal and blocked matrices have densities of 1.0 × 10−3.

and tensor-times-vector multiplication with matrices and 3rd-order tensors of varying sparsitiesas inputs. The tensors are randomly generated with every component having some probability dof being non-zero, where d is the density of the tensor (i.e. the fraction of components that arenon-zero, and the complement of sparsity). As Fig. 17 shows, while computing with sparse tensorstorage formats incurs some performance penalty as compared to the same computation withdense formats when the inputs are highly dense, the performance penalty decreases and eventuallyturns into performance gain as the sparsity of the inputs increases. For the two computations weevaluate, we observe that input sparsity of as low as approximately 35% is actually sufficient tomake sparse formats that compress out all zeros—including DCSR and CSF—perform better thandense formats, which further emphasizes the practicality of sparse tensor storage formats.

Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 77. Publication date: October 2017.

DD DS SD SS

(a) Dense

DD DS SD SS

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

(c) Thermal

DD DS SD SS

(d) Hypersparse

DS SD SS SDᵀTT

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

(f) Blocked

DD DS SD SS

(a) Dense

DD DS SD SS

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

(c) Thermal

DD DS SD SS

(d) Hypersparse

DS SD SS SDᵀTT

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

(f) Blocked

DD DS SD SS

(a) Dense

DD DS SD SS

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

(c) Thermal

DD DS SD SS

(d) Hypersparse

DS SD SS SDᵀTT

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

(f) Blocked

DD DS SD SS

(a) Dense

DD DS SD SS

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

(c) Thermal

DD DS SD SS

(d) Hypersparse

DS SD SS SDᵀTT

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

(f) Blocked

DD DS SD SS

(a) Dense

DD DS SD SS

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

(c) Thermal

DD DS SD SS

(d) Hypersparse

DS SD SS SDᵀTT

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

(f) Blocked

CSR DIA

y = Ax

CSR DIA

CSR BCSR

DD DS SD SS

(a) Dense

DD DS SD SS

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

(c) Thermal

DD DS SD SS

(d) Hypersparse

DS SD SS SDᵀTT

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

(f) Blocked

DD DS SD SS

(a) Dense

DD DS SD SS

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

(c) Thermal

DD DS SD SS

(d) Hypersparse

DS SD SS SDᵀTT

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

(f) Blocked

CSR DIA

y = Ax

CSR DIA

CSR BCSR

Format Abstraction for Sparse Tensor Algebra Compilers Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe

[Kjolstad et al. 2017] This work

Code Generator

4Npos 0 2 44

crd 0 1 10 0

Code Generator

offset -3 0 1-1

crd 0 1 0 30 41

W 6crd 0 1 46 -1-1

offset -3 0 1-1

pos 0 2 44

crd 0 1 10 0

Code Generator

4Npos 0 2 44

crd 0 1 10 0

Code Generator

offset -3 0 1-1

crd 0 1 0 30 41

W 6crd 0 1 46 -1-1

offset -3 0 1-1

pos 0 2 44

crd 0 1 10 0

Format abstraction

Code Generator

4Npos 0 2 44

crd 0 1 10 0

Code Generator

offset -3 0 1-1

crd 0 1 0 30 41

W 6crd 0 1 46 -1-1

offset -3 0 1-1

pos 0 2 44

crd 0 1 10 0

FC D E

0 1 2 3

Storing sparse tensors efficiently requires additional metadata

FC D E

A B0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3

FC D E

A B0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3

FC D E

A B0 1 2 3 4 5 6 7 8 9 10 11

row(6) = 6 / 4 = 1

col(6) = 6 % 4 = 2

0 1 2 3

FC D E

A B0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3

FC D E

A B0 1 2 3 4 5 6 7 8 9 10 11

locate(1,2) = 1 * 4 + 2

0 1 2 3

FC D E

A B C D E F0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3

A B C D E F0 1 2 3 4 5

0 1 2 3

A B C D E F

row(3) = ???

col(3) = ???

0 1 2 3 4 5

0 1 2 3

A B C D E F0 1 2 3 4 5

0 1 2 3

Coordinates of tensor elements can be encoded in many ways

A B C D E F

0 2 1 2 3 3

0 0 1 1 1 2rows

0 1 2 3 4 5

0 1 2 3

Coordinate

A B C D E F

0 2 1 2 3 3

0 0 1 1 1 2rows

0 1 2 3 4 5

0 1 2 3

Coordinate

A B C D E F

0 2 1 2 3 3

0 2 5 6

0 1 2 3 4 5

0 1 2 3

CSR0 1 2 3

A B C D E F

0 2 1 2 3 3

0 2 5 6

0 1 2 3 4 5

0 1 2 3

CSR0 1 2 3

A B C D E F

0 2 1 2 3 3

0 2 5 6

0 1 2 3 4 5

0 1 2 3

CSR0 1 2 3

A B C D E F

0 2 1 2 3 3

0 2 5 6

0 1 2 3 4 5

0 1 2 3

CSR0 1 2 3

A B C D E F

0 2 1 2 3 3

0 2 5 6

0 1 2 3 4 5

0 1 2 3

CSR0 1 2 3

Computing with different formats can require very different codeA = B ∘ C

for (int pB = B1_pos[0]; pB < B1_pos[1]; pB++) { int i = B1_crd[pB]; int j = B2_crd[pB]; int pC = i * N + j; int pA = i * N + j; A[pA] = B[pB] * C[pC]; }

Coordinate ✕ Dense array

Coordinate ✕ Dense array CSR ✕ Dense array

for (int i = 0; i < M; i++) { for (int pB = B2_pos[i]; pB < B2_pos[i + 1]; pB++) { int j = B2_crd[pB]; int pC = i * N + j; int pA = i * N + j; A[pA] = B[pB] * C[pC]; }}

int pC1 = C1_pos[0];while (pC1 < C1_pos[1]) { int i = C1_crd[pC1]; int C1_segend = pC1 + 1; while (C1_segend < C1_pos[1] && C1_crd[C1_segend] == i) C1_segend++; int pB2 = B2_pos[i]; int pC2 = pC1; while (pB2 < B2_pos[i + 1] && pC2 < C1_segend) { int jB2 = B2_crd[pB2]; int jC2 = C2_crd[pC2]; int j = min(jB2, jC2); int pA = i * N + j; if (jB2 == j && jC2 == j) A[pA] = B[pB2] * C[pC2]; if (jB2 == j) pB2++; if (jC2 == j) pC2++; } pC1 = C1_segend;}

Coordinate ✕ Dense array CSR ✕ Dense array CSR ✕ Coordinate

for (int i = 0; i < M; i++) { for (int pB = B2_pos[i]; pB < B2_pos[i + 1]; pB++) { int j = B2_crd[pB]; int pC = i * N + j; int pA = i * N + j; A[pA] = B[pB] * C[pC]; }}

Hand-coding support for a wide range of formats is infeasibleA = B ∘ C

Coordinate ✕ Dense array CSR ✕ Dense array CSR ✕ Coordinate

Coordinate ✕ Dense arrayCSR ✕ Dense arrayCSR ✕ Coordinate

Coordinate ✕ Dense arrayCSR ✕ Dense arrayCSR ✕ CoordinateDense array ✕ Dense arrayCoordinate ✕ CoordinateCSR ✕ CSRDIA ✕ DIADIA ✕ Dense arrayDIA ✕ CoordinateDIA ✕ CSRELLPACK ✕ ELLPACKELLPACK ✕ Dense arrayELLPACK ✕ CoordinateELLPACK ✕ CSRELLPACK ✕ DIABCSR ✕ BCSRBCSR ✕ Dense arrayBCSR ✕ Coordinate 11

Coordinate ✕ Dense arrayCSR ✕ Dense arrayCSR ✕ CoordinateDense array ✕ Dense arrayCoordinate ✕ CoordinateCSR ✕ CSRDIA ✕ DIADIA ✕ Dense arrayDIA ✕ CoordinateDIA ✕ CSRELLPACK ✕ ELLPACKELLPACK ✕ Dense arrayELLPACK ✕ CoordinateELLPACK ✕ CSRELLPACK ✕ DIABCSR ✕ BCSRBCSR ✕ Dense arrayBCSR ✕ Coordinate

A = B ∘ C ∘ DDense array ✕ CSR ✕ CSRCoordinate ✕ CSR ✕ CSRCSR ✕ CSR ✕ CSRDense array ✕ Coordinate ✕ CSRDense array ✕ Dense array ✕ CSRCoordinate ✕ Coordinate ✕ CSRDIA ✕ Coordinate ✕ Dense arrayDIA ✕ Coordinate ✕ CSRDIA ✕ Dense array ✕ CSRDIA ✕ CSR ✕ CSRDIA ✕ Coordinate ✕ CoordinateDIA ✕ Dense array ✕ Dense arrayDIA ✕ DIA ✕ CSRDIA ✕ DIA ✕ CoordinateDIA ✕ DIA ✕ Dense array

ELLPACK ✕ ELLPACK ✕ DIAELLPACK ✕ CSR ✕ DIAELLPACK ✕ BCSR ✕ DIA

DIA ✕ DIA ✕ DIA

Coordinate ✕ Dense arrayCSR ✕ Dense arrayCSR ✕ CoordinateDense array ✕ Dense arrayCoordinate ✕ CoordinateCSR ✕ CSRDIA ✕ DIADIA ✕ Dense arrayDIA ✕ CoordinateDIA ✕ CSRELLPACK ✕ ELLPACKELLPACK ✕ Dense arrayELLPACK ✕ CoordinateELLPACK ✕ CSRELLPACK ✕ DIABCSR ✕ BCSRBCSR ✕ Dense arrayBCSR ✕ Coordinate

A = B ∘ C ∘ DDense array ✕ CSR ✕ CSRCoordinate ✕ CSR ✕ CSRCSR ✕ CSR ✕ CSRDense array ✕ Coordinate ✕ CSRDense array ✕ Dense array ✕ CSRCoordinate ✕ Coordinate ✕ CSRDIA ✕ Coordinate ✕ Dense arrayDIA ✕ Coordinate ✕ CSRDIA ✕ Dense array ✕ CSRDIA ✕ CSR ✕ CSRDIA ✕ Coordinate ✕ CoordinateDIA ✕ Dense array ✕ Dense arrayDIA ✕ DIA ✕ CSRDIA ✕ DIA ✕ CoordinateDIA ✕ DIA ✕ Dense array

ELLPACK ✕ ELLPACK ✕ DIAELLPACK ✕ CSR ✕ DIAELLPACK ✕ BCSR ✕ DIA

DIA ✕ DIA ✕ DIA

y = Ax + zDense array ✕ Dense array ✕ Dense arrayDense array ✕ Dense array ✕ Sparse vectorDense array ✕ Dense array ✕ Hash mapDense array ✕ Sparse vector ✕ Sparse vectorDense array ✕ Sparse vector ✕ Hash mapDense array ✕ Hash map ✕ Sparse vectorDense array ✕ Sparse vector ✕ Dense arrayCoordinate ✕ Dense array ✕ Dense arrayCoordinate ✕ Sparse vector ✕ Dense arrayCoordinate ✕ Dense array ✕ Hash mapCoordinate ✕ Sparse vector ✕ Hash mapCoordinate ✕ Hash map ✕ Sparse vectorCSR ✕ Dense array ✕ Dense arrayCSR ✕ Dense array ✕ Sparse vectorCSR ✕ Hash map ✕ Sparse vectorCSR ✕ Hash map ✕ Dense arrayCSR ✕ Sparse vector ✕ Dense arrayDIA ✕ Dense array ✕ Dense arrayDIA ✕ Hash map ✕ Dense arrayELLPACK ✕ Dense array ✕ Sparse vector 11

Evaluation

Mode-generic tensorCompressed

Singleton

DIADenseRangeOffset

123:16 Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe

x supports locate?

y supports locate?

x unordered and y ordered?

co-iterate over x and y

iterate over x and locate into y

iterate over y and locate into x

no yes

yes yes

x/y ordered or accessed with locate?

output unordered or

supports insert?

x/y unique?

reorder x/y

aggregate duplicates in x/y

converted iteratorover x/y

iterator over x/y

co-iterating over x and y?

x/y unique?

Fig. 8. The most efficient strategies for computing the intersection merge of two vectors x andy, depending onwhether they support the locate capability and whether they are ordered and unique. The sparsity structureof y is assumed to not be a strict subset of the sparsity structure of x . The flowchart on the right describes,for each operand, what iterator conversions are needed at runtime to compute the merge.

If one of the input vectors,y, supports the locate capability (e.g., it is a dense array), we can insteadjust iterate over the nonzero components of x and, for each component, locate the component withthe same coordinate in y. Lines 2–9 in Figure 1b shows another example of this method applied tomerge the column dimensions of a CSR matrix and a dense matrix. This alternative method reducesthe merge complexity from O(nnz(x) + nnz(y)) to O(nnz(x)) assuming locate runs in constanttime. Moreover, this method does not require enumerating the coordinates of y in order. We do noteven need to enumerate the coordinates of x in order, as long as there are no duplicates and we donot need to compute output components in order (e.g., if the output supports the insert capability).This method is thus ideal for computing intersection merges of unordered levels.

We can generalize and combine the two methods described above to compute arbitrarily complexmerges involving unions and intersections of any number of tensor operands. At a high level, anymerge can be computed by co-iterating over some subset of its operands and, for every enumeratedcoordinate, locating that same coordinate in all the remaining operands with calls to locate. Whichoperands need to be co-iterated can be identified recursively from the expression expr that we wantto compute. In particular, for each subexpression e = e1 op e2 in expr , let Coiter (e) denote the setof operand coordinate hierarchy levels that need to be co-iterated in order to compute e . If op is anoperation that requires a union merge (e.g., addition), then computing e requires co-iterating overall the levels that would have to be co-iterated in order to separately compute e1 and e2; in otherwords, Coiter (e) = Coiter (e1) ∪Coiter (e2). On the other hand, if op is an operation that requiresan intersection merge (e.g., multiplication), then the set of coordinates of nonzeros in the resulte must be a subset of the coordinates of nonzeros in either operand e1 or e2. Thus, in order toenumerate the coordinates of all nonzeros in the result, it suffices to co-iterate over all the levelsmerged by just one of the operands. Without loss of generality, this lets us compute e withouthaving to co-iterate over levels merged by e2 that can instead be accessed with locate; in otherwords,Coiter (e) = Coiter (e1)∪ (Coiter (e2) \LocateCapable(e2)), where LocateCapable(e2) denotesthe set of levels merged by e2 that support the locate capability.

4.5 Code Generation Algorithm

Figure 9a shows our code generation algorithm, which incorporates all of the concepts we presentedin the previous subsections. Each part of the algorithm is labeled from 1 to 11; throughout the

Proc. ACM Program. Lang., Vol. 2, No. OOPSLA, Article 123. Publication date: November 2018.

Format Abstraction & Code Generation

Evaluation

Singleton

DIADenseRangeOffset

x supports locate?

y supports locate?

no yes

yes yes

output unordered or

supports insert?

x/y unique?

reorder x/y

iterator over x/y

x/y unique?

Evaluation

Singleton

DIADenseRangeOffset

x supports locate?

y supports locate?

no yes

yes yes

output unordered or

supports insert?

x/y unique?

reorder x/y

iterator over x/y

x/y unique?

Tensor formats can be viewed as compositions of level formats

0 3 5 9

0 1 1 0 1 0 0 1 1

0 0 2 3 1 1 3 0 3