Format Abstraction for Sparse Tensor Algebra...

Post on 24-Jan-2021

10 views 0 download

Transcript of Format Abstraction for Sparse Tensor Algebra...

Format Abstraction for Sparse Tensor Algebra Compilers Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe

2

Sparse tensors are a natural way of representing real-world data

2

Sparse tensors are a natural way of representing real-world data

2

10

Quality

DurablePoor

2

1 1

1

1

1

2

3

1 1

Kindle

Dubliners

The Iliad

Monitor

Sweater

Laptop

Candide

Jack

et …

Peter

Paul

Mary

Bob

Sam

Billy

Lilly

Hilde

Users

Words

Products

Sparse tensors are a natural way of representing real-world data

2

10

Quality

DurablePoor

2

1 1

1

1

1

2

3

1 1

Kindle

Dubliners

The Iliad

Monitor

Sweater

Laptop

Candide

Jack

et …

Peter

Paul

Mary

Bob

Sam

Billy

Lilly

Hilde

Users

Words

Products

Sparse tensors are a natural way of representing real-world data

Dense storage: 107 exabytes Sparse storage: 13 gigabytes

3

ELLPACK

DIA

CSR

Coordinate matrix

DCSC

CSB

DCSR

CSC

Dense array matrix

Block DIA

BCSR

BCOO

SELLSkyline BELLLIL

Many different formats for storing tensors exist

3

ELLPACK

DIA

CSR

Coordinate matrix

DCSC

CSB

DCSR

CSC

Dense array matrix

Block DIA

BCSR

BCOO

SELLSkyline BELLLIL

Hash mapsSparse vector Dense array vector

Many different formats for storing tensors exist

3

ELLPACK

DIA

CSR

Coordinate matrix

DCSC

CSB

DCSR

CSC

Dense array matrix

Block DIA

BCSR

BCOO

SELLSkyline BELLLIL

Hash mapsSparse vector Dense array vector

CSFCoordinate tensor

Dense array tensor

Mode-generic tensorHiCOO

F-COO

Many different formats for storing tensors exist

Thermal simulation

3

ELLPACK

DIA

CSR

Coordinate matrix

DCSC

CSB

DCSR

CSC

Dense array matrix

Block DIA

BCSR

BCOO

SELLSkyline BELLLIL

Hash mapsSparse vector Dense array vector

CSFCoordinate tensor

Dense array tensor

Mode-generic tensorHiCOO

F-COO

Many different formats for storing tensors exist

Unstructured mesh simulation[Bell and Garland 2009]

Thermal simulation

3

ELLPACK

DIA

CSR

Coordinate matrix

DCSC

CSB

DCSR

CSC

Dense array matrix

Block DIA

BCSR

BCOO

SELLSkyline BELLLIL

Hash mapsSparse vector Dense array vector

CSFCoordinate tensor

Dense array tensor

Mode-generic tensorHiCOO

F-COO

Many different formats for storing tensors exist

Unstructured mesh simulation[Bell and Garland 2009]

Thermal simulation

3

CNN with block-sparse weights[Gray et al. 2017]

ELLPACK

DIA

CSR

Coordinate matrix

DCSC

CSB

DCSR

CSC

Dense array matrix

Block DIA

BCSR

BCOO

SELLSkyline BELLLIL

Hash mapsSparse vector Dense array vector

CSFCoordinate tensor

Dense array tensor

Mode-generic tensorHiCOO

F-COO

Many different formats for storing tensors exist

Unstructured mesh simulation[Bell and Garland 2009]

Thermal simulation

Image processing

3

CNN with block-sparse weights[Gray et al. 2017]

ELLPACK

DIA

CSR

Coordinate matrix

DCSC

CSB

DCSR

CSC

Dense array matrix

Block DIA

BCSR

BCOO

SELLSkyline BELLLIL

Hash mapsSparse vector Dense array vector

CSFCoordinate tensor

Dense array tensor

Mode-generic tensorHiCOO

F-COO

Many different formats for storing tensors exist

Unstructured mesh simulation[Bell and Garland 2009]

Thermal simulation

Data analytics

Image processing

3

CNN with block-sparse weights[Gray et al. 2017]

ELLPACK

DIA

CSR

Coordinate matrix

DCSC

CSB

DCSR

CSC

Dense array matrix

Block DIA

BCSR

BCOO

SELLSkyline BELLLIL

Hash mapsSparse vector Dense array vector

CSFCoordinate tensor

Dense array tensor

Mode-generic tensorHiCOO

F-COO

Many different formats for storing tensors exist

4

There is no universally superior tensor format

77:22 Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe

DD DS SD SS

Normalized run timeNormalized storage

(a) Dense

DD DS SD SS

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

6270× rrruuu 7780× rurrrrrtruu 6144× ruu 6145×

(c) Thermal

DD DS SD SS

17150×rrrrrrruu 88× rrrruuurrr 26542× rr 123×

(d) Hypersparse

DS SD SS SDᵀTT

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

(f) Blocked

Fig. 18. Performance of matrix-vector multiplication on various matrices with distinct sparsity patterns usingtaco. The left half of each subfigure depicts the sparsity pattern of the matrix, while the right half showsthe normalized storage costs and normalized average execution times (relative to the optimal format) ofmatrix-vector multiplication using the storage formats labeled on the horizontal axis to store the matrix. Thestorage format labels follow the scheme described in Section 3; for instance, DS is short for (densed1,sparsed2),while SDᵀ is equivalent to (sparsed2,densed1). The dense matrix input has a density of 0.95, the hypersparsematrix has a density of 2.5 × 10−5, the row-slicing and column-slicing matrices have densities of 9.5 × 10−3,and the thermal and blocked matrices have densities of 1.0 × 10−3.

and tensor-times-vector multiplication with matrices and 3rd-order tensors of varying sparsitiesas inputs. The tensors are randomly generated with every component having some probability dof being non-zero, where d is the density of the tensor (i.e. the fraction of components that arenon-zero, and the complement of sparsity). As Fig. 17 shows, while computing with sparse tensorstorage formats incurs some performance penalty as compared to the same computation withdense formats when the inputs are highly dense, the performance penalty decreases and eventuallyturns into performance gain as the sparsity of the inputs increases. For the two computations weevaluate, we observe that input sparsity of as low as approximately 35% is actually sufficient tomake sparse formats that compress out all zeros—including DCSR and CSF—perform better thandense formats, which further emphasizes the practicality of sparse tensor storage formats.

Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 77. Publication date: October 2017.

77:22 Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe

DD DS SD SS

Normalized run timeNormalized storage

(a) Dense

DD DS SD SS

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

6270× rrruuu 7780× rurrrrrtruu 6144× ruu 6145×

(c) Thermal

DD DS SD SS

17150×rrrrrrruu 88× rrrruuurrr 26542× rr 123×

(d) Hypersparse

DS SD SS SDᵀTT

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

(f) Blocked

Fig. 18. Performance of matrix-vector multiplication on various matrices with distinct sparsity patterns usingtaco. The left half of each subfigure depicts the sparsity pattern of the matrix, while the right half showsthe normalized storage costs and normalized average execution times (relative to the optimal format) ofmatrix-vector multiplication using the storage formats labeled on the horizontal axis to store the matrix. Thestorage format labels follow the scheme described in Section 3; for instance, DS is short for (densed1,sparsed2),while SDᵀ is equivalent to (sparsed2,densed1). The dense matrix input has a density of 0.95, the hypersparsematrix has a density of 2.5 × 10−5, the row-slicing and column-slicing matrices have densities of 9.5 × 10−3,and the thermal and blocked matrices have densities of 1.0 × 10−3.

and tensor-times-vector multiplication with matrices and 3rd-order tensors of varying sparsitiesas inputs. The tensors are randomly generated with every component having some probability dof being non-zero, where d is the density of the tensor (i.e. the fraction of components that arenon-zero, and the complement of sparsity). As Fig. 17 shows, while computing with sparse tensorstorage formats incurs some performance penalty as compared to the same computation withdense formats when the inputs are highly dense, the performance penalty decreases and eventuallyturns into performance gain as the sparsity of the inputs increases. For the two computations weevaluate, we observe that input sparsity of as low as approximately 35% is actually sufficient tomake sparse formats that compress out all zeros—including DCSR and CSF—perform better thandense formats, which further emphasizes the practicality of sparse tensor storage formats.

Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 77. Publication date: October 2017.

4

There is no universally superior tensor format

4

There is no universally superior tensor format

77:22 Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe

DD DS SD SS

Normalized run timeNormalized storage

(a) Dense

DD DS SD SS

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

6270× rrruuu 7780× rurrrrrtruu 6144× ruu 6145×

(c) Thermal

DD DS SD SS

17150×rrrrrrruu 88× rrrruuurrr 26542× rr 123×

(d) Hypersparse

DS SD SS SDᵀTT

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

(f) Blocked

Fig. 18. Performance of matrix-vector multiplication on various matrices with distinct sparsity patterns usingtaco. The left half of each subfigure depicts the sparsity pattern of the matrix, while the right half showsthe normalized storage costs and normalized average execution times (relative to the optimal format) ofmatrix-vector multiplication using the storage formats labeled on the horizontal axis to store the matrix. Thestorage format labels follow the scheme described in Section 3; for instance, DS is short for (densed1,sparsed2),while SDᵀ is equivalent to (sparsed2,densed1). The dense matrix input has a density of 0.95, the hypersparsematrix has a density of 2.5 × 10−5, the row-slicing and column-slicing matrices have densities of 9.5 × 10−3,and the thermal and blocked matrices have densities of 1.0 × 10−3.

and tensor-times-vector multiplication with matrices and 3rd-order tensors of varying sparsitiesas inputs. The tensors are randomly generated with every component having some probability dof being non-zero, where d is the density of the tensor (i.e. the fraction of components that arenon-zero, and the complement of sparsity). As Fig. 17 shows, while computing with sparse tensorstorage formats incurs some performance penalty as compared to the same computation withdense formats when the inputs are highly dense, the performance penalty decreases and eventuallyturns into performance gain as the sparsity of the inputs increases. For the two computations weevaluate, we observe that input sparsity of as low as approximately 35% is actually sufficient tomake sparse formats that compress out all zeros—including DCSR and CSF—perform better thandense formats, which further emphasizes the practicality of sparse tensor storage formats.

Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 77. Publication date: October 2017.

4

There is no universally superior tensor format

77:22 Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe

DD DS SD SS

Normalized run timeNormalized storage

(a) Dense

DD DS SD SS

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

6270× rrruuu 7780× rurrrrrtruu 6144× ruu 6145×

(c) Thermal

DD DS SD SS

17150×rrrrrrruu 88× rrrruuurrr 26542× rr 123×

(d) Hypersparse

DS SD SS SDᵀTT

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

(f) Blocked

Fig. 18. Performance of matrix-vector multiplication on various matrices with distinct sparsity patterns usingtaco. The left half of each subfigure depicts the sparsity pattern of the matrix, while the right half showsthe normalized storage costs and normalized average execution times (relative to the optimal format) ofmatrix-vector multiplication using the storage formats labeled on the horizontal axis to store the matrix. Thestorage format labels follow the scheme described in Section 3; for instance, DS is short for (densed1,sparsed2),while SDᵀ is equivalent to (sparsed2,densed1). The dense matrix input has a density of 0.95, the hypersparsematrix has a density of 2.5 × 10−5, the row-slicing and column-slicing matrices have densities of 9.5 × 10−3,and the thermal and blocked matrices have densities of 1.0 × 10−3.

and tensor-times-vector multiplication with matrices and 3rd-order tensors of varying sparsitiesas inputs. The tensors are randomly generated with every component having some probability dof being non-zero, where d is the density of the tensor (i.e. the fraction of components that arenon-zero, and the complement of sparsity). As Fig. 17 shows, while computing with sparse tensorstorage formats incurs some performance penalty as compared to the same computation withdense formats when the inputs are highly dense, the performance penalty decreases and eventuallyturns into performance gain as the sparsity of the inputs increases. For the two computations weevaluate, we observe that input sparsity of as low as approximately 35% is actually sufficient tomake sparse formats that compress out all zeros—including DCSR and CSF—perform better thandense formats, which further emphasizes the practicality of sparse tensor storage formats.

Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 77. Publication date: October 2017.

4

There is no universally superior tensor format

77:22 Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe

DD DS SD SS

Normalized run timeNormalized storage

(a) Dense

DD DS SD SS

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

6270× rrruuu 7780× rurrrrrtruu 6144× ruu 6145×

(c) Thermal

DD DS SD SS

17150×rrrrrrruu 88× rrrruuurrr 26542× rr 123×

(d) Hypersparse

DS SD SS SDᵀTT

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

(f) Blocked

Fig. 18. Performance of matrix-vector multiplication on various matrices with distinct sparsity patterns usingtaco. The left half of each subfigure depicts the sparsity pattern of the matrix, while the right half showsthe normalized storage costs and normalized average execution times (relative to the optimal format) ofmatrix-vector multiplication using the storage formats labeled on the horizontal axis to store the matrix. Thestorage format labels follow the scheme described in Section 3; for instance, DS is short for (densed1,sparsed2),while SDᵀ is equivalent to (sparsed2,densed1). The dense matrix input has a density of 0.95, the hypersparsematrix has a density of 2.5 × 10−5, the row-slicing and column-slicing matrices have densities of 9.5 × 10−3,and the thermal and blocked matrices have densities of 1.0 × 10−3.

and tensor-times-vector multiplication with matrices and 3rd-order tensors of varying sparsitiesas inputs. The tensors are randomly generated with every component having some probability dof being non-zero, where d is the density of the tensor (i.e. the fraction of components that arenon-zero, and the complement of sparsity). As Fig. 17 shows, while computing with sparse tensorstorage formats incurs some performance penalty as compared to the same computation withdense formats when the inputs are highly dense, the performance penalty decreases and eventuallyturns into performance gain as the sparsity of the inputs increases. For the two computations weevaluate, we observe that input sparsity of as low as approximately 35% is actually sufficient tomake sparse formats that compress out all zeros—including DCSR and CSF—perform better thandense formats, which further emphasizes the practicality of sparse tensor storage formats.

Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 77. Publication date: October 2017.

77:22 Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe

DD DS SD SS

Normalized run timeNormalized storage

(a) Dense

DD DS SD SS

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

6270× rrruuu 7780× rurrrrrtruu 6144× ruu 6145×

(c) Thermal

DD DS SD SS

17150×rrrrrrruu 88× rrrruuurrr 26542× rr 123×

(d) Hypersparse

DS SD SS SDᵀTT

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

(f) Blocked

Fig. 18. Performance of matrix-vector multiplication on various matrices with distinct sparsity patterns usingtaco. The left half of each subfigure depicts the sparsity pattern of the matrix, while the right half showsthe normalized storage costs and normalized average execution times (relative to the optimal format) ofmatrix-vector multiplication using the storage formats labeled on the horizontal axis to store the matrix. Thestorage format labels follow the scheme described in Section 3; for instance, DS is short for (densed1,sparsed2),while SDᵀ is equivalent to (sparsed2,densed1). The dense matrix input has a density of 0.95, the hypersparsematrix has a density of 2.5 × 10−5, the row-slicing and column-slicing matrices have densities of 9.5 × 10−3,and the thermal and blocked matrices have densities of 1.0 × 10−3.

and tensor-times-vector multiplication with matrices and 3rd-order tensors of varying sparsitiesas inputs. The tensors are randomly generated with every component having some probability dof being non-zero, where d is the density of the tensor (i.e. the fraction of components that arenon-zero, and the complement of sparsity). As Fig. 17 shows, while computing with sparse tensorstorage formats incurs some performance penalty as compared to the same computation withdense formats when the inputs are highly dense, the performance penalty decreases and eventuallyturns into performance gain as the sparsity of the inputs increases. For the two computations weevaluate, we observe that input sparsity of as low as approximately 35% is actually sufficient tomake sparse formats that compress out all zeros—including DCSR and CSF—perform better thandense formats, which further emphasizes the practicality of sparse tensor storage formats.

Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 77. Publication date: October 2017.

Nor

mal

ized

tim

e

0.0

0.5

1.0

1.5

CSR DIA

y = Ax

CSR DIA

186x

CSR BCSR

4

There is no universally superior tensor format

77:22 Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe

DD DS SD SS

Normalized run timeNormalized storage

(a) Dense

DD DS SD SS

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

6270× rrruuu 7780× rurrrrrtruu 6144× ruu 6145×

(c) Thermal

DD DS SD SS

17150×rrrrrrruu 88× rrrruuurrr 26542× rr 123×

(d) Hypersparse

DS SD SS SDᵀTT

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

(f) Blocked

Fig. 18. Performance of matrix-vector multiplication on various matrices with distinct sparsity patterns usingtaco. The left half of each subfigure depicts the sparsity pattern of the matrix, while the right half showsthe normalized storage costs and normalized average execution times (relative to the optimal format) ofmatrix-vector multiplication using the storage formats labeled on the horizontal axis to store the matrix. Thestorage format labels follow the scheme described in Section 3; for instance, DS is short for (densed1,sparsed2),while SDᵀ is equivalent to (sparsed2,densed1). The dense matrix input has a density of 0.95, the hypersparsematrix has a density of 2.5 × 10−5, the row-slicing and column-slicing matrices have densities of 9.5 × 10−3,and the thermal and blocked matrices have densities of 1.0 × 10−3.

and tensor-times-vector multiplication with matrices and 3rd-order tensors of varying sparsitiesas inputs. The tensors are randomly generated with every component having some probability dof being non-zero, where d is the density of the tensor (i.e. the fraction of components that arenon-zero, and the complement of sparsity). As Fig. 17 shows, while computing with sparse tensorstorage formats incurs some performance penalty as compared to the same computation withdense formats when the inputs are highly dense, the performance penalty decreases and eventuallyturns into performance gain as the sparsity of the inputs increases. For the two computations weevaluate, we observe that input sparsity of as low as approximately 35% is actually sufficient tomake sparse formats that compress out all zeros—including DCSR and CSF—perform better thandense formats, which further emphasizes the practicality of sparse tensor storage formats.

Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 77. Publication date: October 2017.

77:22 Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe

DD DS SD SS

Normalized run timeNormalized storage

(a) Dense

DD DS SD SS

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

6270× rrruuu 7780× rurrrrrtruu 6144× ruu 6145×

(c) Thermal

DD DS SD SS

17150×rrrrrrruu 88× rrrruuurrr 26542× rr 123×

(d) Hypersparse

DS SD SS SDᵀTT

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

(f) Blocked

Fig. 18. Performance of matrix-vector multiplication on various matrices with distinct sparsity patterns usingtaco. The left half of each subfigure depicts the sparsity pattern of the matrix, while the right half showsthe normalized storage costs and normalized average execution times (relative to the optimal format) ofmatrix-vector multiplication using the storage formats labeled on the horizontal axis to store the matrix. Thestorage format labels follow the scheme described in Section 3; for instance, DS is short for (densed1,sparsed2),while SDᵀ is equivalent to (sparsed2,densed1). The dense matrix input has a density of 0.95, the hypersparsematrix has a density of 2.5 × 10−5, the row-slicing and column-slicing matrices have densities of 9.5 × 10−3,and the thermal and blocked matrices have densities of 1.0 × 10−3.

and tensor-times-vector multiplication with matrices and 3rd-order tensors of varying sparsitiesas inputs. The tensors are randomly generated with every component having some probability dof being non-zero, where d is the density of the tensor (i.e. the fraction of components that arenon-zero, and the complement of sparsity). As Fig. 17 shows, while computing with sparse tensorstorage formats incurs some performance penalty as compared to the same computation withdense formats when the inputs are highly dense, the performance penalty decreases and eventuallyturns into performance gain as the sparsity of the inputs increases. For the two computations weevaluate, we observe that input sparsity of as low as approximately 35% is actually sufficient tomake sparse formats that compress out all zeros—including DCSR and CSF—perform better thandense formats, which further emphasizes the practicality of sparse tensor storage formats.

Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 77. Publication date: October 2017.

Nor

mal

ized

tim

e

0.0

0.5

1.0

1.5

CSR DIA

y = Ax

CSR DIA

186x

CSR BCSR

Format Abstraction for Sparse Tensor Algebra Compilers Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe

6

[Kjolstad et al. 2017] This work

Code Generator

4Npos 0 2 44

crd 0 1 10 0

7

3 4

Code Generator

6M

4N

offset -3 0 1-1

crd 0 1 0 30 41

W 6crd 0 1 46 -1-1

offset -3 0 1-1

4N

pos 0 2 44

crd 0 1 10 0

7

3 4

6

[Kjolstad et al. 2017] This work

Code Generator

4Npos 0 2 44

crd 0 1 10 0

7

3 4

Code Generator

6M

4N

offset -3 0 1-1

crd 0 1 0 30 41

W 6crd 0 1 46 -1-1

offset -3 0 1-1

4N

pos 0 2 44

crd 0 1 10 0

7

3 4

Format abstraction

6

[Kjolstad et al. 2017] This work

Code Generator

4Npos 0 2 44

crd 0 1 10 0

7

3 4

Code Generator

6M

4N

offset -3 0 1-1

crd 0 1 0 30 41

W 6crd 0 1 46 -1-1

offset -3 0 1-1

4N

pos 0 2 44

crd 0 1 10 0

7

3 4

FC D E

A B

7

A B

0 1 2 3

0

1

2 F

C D E

Storing sparse tensors efficiently requires additional metadata

FC D E

A B

7

A B0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3

0

1

2

FC D E

Storing sparse tensors efficiently requires additional metadata

FC D E

A B

7

A B0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3

0

1

2

FC D E

Storing sparse tensors efficiently requires additional metadata

FC D E

A B

7

A B0 1 2 3 4 5 6 7 8 9 10 11

row(6) = 6 / 4 = 1

col(6) = 6 % 4 = 2

0 1 2 3

0

1

2

FC D E

Storing sparse tensors efficiently requires additional metadata

FC D E

A B

7

A B0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3

0

1

2

FC D E

Storing sparse tensors efficiently requires additional metadata

FC D E

A B

7

A B0 1 2 3 4 5 6 7 8 9 10 11

locate(1,2) = 1 * 4 + 2

0 1 2 3

0

1

2

FC D E

= 6

Storing sparse tensors efficiently requires additional metadata

8

A B C D E F0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3

0

1

2 F

C D E

A B

Storing sparse tensors efficiently requires additional metadata

8

A B C D E F0 1 2 3 4 5

0 1 2 3

0

1

2 F

C D E

A B

Storing sparse tensors efficiently requires additional metadata

8

A B C D E F

row(3) = ???

col(3) = ???

0 1 2 3 4 5

0 1 2 3

0

1

2 F

C D E

A B

Storing sparse tensors efficiently requires additional metadata

9

A B C D E F0 1 2 3 4 5

0 1 2 3

0

1

2 F

C D E

A B

Coordinates of tensor elements can be encoded in many ways

9

A B C D E F

0 2 1 2 3 3

0 0 1 1 1 2rows

cols

0 1 2 3 4 5

0 1 2 3

0

1

2 F

C D E

A B

Coordinates of tensor elements can be encoded in many ways

Coordinate

9

A B C D E F

0 2 1 2 3 3

0 0 1 1 1 2rows

cols

0 1 2 3 4 5

0 1 2 3

0

1

2 F

C D E

A B

Coordinates of tensor elements can be encoded in many ways

Coordinate

9

A B C D E F

0 2 1 2 3 3

0 2 5 6

cols

pos

0 1 2 3 4 5

0 1 2 3

0

1

2 F

C D E

A B

Coordinates of tensor elements can be encoded in many ways

CSR0 1 2 3

9

A B C D E F

0 2 1 2 3 3

0 2 5 6

cols

pos

0 1 2 3 4 5

0 1 2 3

0

1

2 F

C D E

A B

Coordinates of tensor elements can be encoded in many ways

CSR0 1 2 3

9

A B C D E F

0 2 1 2 3 3

0 2 5 6

cols

pos

0 1 2 3 4 5

0 1 2 3

0

1

2 F

C D E

A B

Coordinates of tensor elements can be encoded in many ways

CSR0 1 2 3

9

A B C D E F

0 2 1 2 3 3

0 2 5 6

cols

pos

0 1 2 3 4 5

0 1 2 3

0

1

2 F

C D E

A B

Coordinates of tensor elements can be encoded in many ways

CSR0 1 2 3

9

A B C D E F

0 2 1 2 3 3

0 2 5 6

cols

pos

0 1 2 3 4 5

0 1 2 3

0

1

2 F

C D E

A B

Coordinates of tensor elements can be encoded in many ways

CSR0 1 2 3

10

Computing with different formats can require very different codeA = B ∘ C

10

for (int pB = B1_pos[0]; pB < B1_pos[1]; pB++) { int i = B1_crd[pB]; int j = B2_crd[pB]; int pC = i * N + j; int pA = i * N + j; A[pA] = B[pB] * C[pC]; }

Coordinate ✕ Dense array

Computing with different formats can require very different codeA = B ∘ C

10

for (int pB = B1_pos[0]; pB < B1_pos[1]; pB++) { int i = B1_crd[pB]; int j = B2_crd[pB]; int pC = i * N + j; int pA = i * N + j; A[pA] = B[pB] * C[pC]; }

Coordinate ✕ Dense array CSR ✕ Dense array

Computing with different formats can require very different codeA = B ∘ C

for (int i = 0; i < M; i++) { for (int pB = B2_pos[i]; pB < B2_pos[i + 1]; pB++) { int j = B2_crd[pB]; int pC = i * N + j; int pA = i * N + j; A[pA] = B[pB] * C[pC]; }}

10

for (int pB = B1_pos[0]; pB < B1_pos[1]; pB++) { int i = B1_crd[pB]; int j = B2_crd[pB]; int pC = i * N + j; int pA = i * N + j; A[pA] = B[pB] * C[pC]; }

int pC1 = C1_pos[0];while (pC1 < C1_pos[1]) { int i = C1_crd[pC1]; int C1_segend = pC1 + 1; while (C1_segend < C1_pos[1] && C1_crd[C1_segend] == i) C1_segend++; int pB2 = B2_pos[i]; int pC2 = pC1; while (pB2 < B2_pos[i + 1] && pC2 < C1_segend) { int jB2 = B2_crd[pB2]; int jC2 = C2_crd[pC2]; int j = min(jB2, jC2); int pA = i * N + j; if (jB2 == j && jC2 == j) A[pA] = B[pB2] * C[pC2]; if (jB2 == j) pB2++; if (jC2 == j) pC2++; } pC1 = C1_segend;}

Coordinate ✕ Dense array CSR ✕ Dense array CSR ✕ Coordinate

Computing with different formats can require very different codeA = B ∘ C

for (int i = 0; i < M; i++) { for (int pB = B2_pos[i]; pB < B2_pos[i + 1]; pB++) { int j = B2_crd[pB]; int pC = i * N + j; int pA = i * N + j; A[pA] = B[pB] * C[pC]; }}

Hand-coding support for a wide range of formats is infeasibleA = B ∘ C

Coordinate ✕ Dense array CSR ✕ Dense array CSR ✕ Coordinate

11

Hand-coding support for a wide range of formats is infeasibleA = B ∘ C

Coordinate ✕ Dense arrayCSR ✕ Dense arrayCSR ✕ Coordinate

11

Hand-coding support for a wide range of formats is infeasibleA = B ∘ C

Coordinate ✕ Dense arrayCSR ✕ Dense arrayCSR ✕ CoordinateDense array ✕ Dense arrayCoordinate ✕ CoordinateCSR ✕ CSRDIA ✕ DIADIA ✕ Dense arrayDIA ✕ CoordinateDIA ✕ CSRELLPACK ✕ ELLPACKELLPACK ✕ Dense arrayELLPACK ✕ CoordinateELLPACK ✕ CSRELLPACK ✕ DIABCSR ✕ BCSRBCSR ✕ Dense arrayBCSR ✕ Coordinate 11

Hand-coding support for a wide range of formats is infeasibleA = B ∘ C

Coordinate ✕ Dense arrayCSR ✕ Dense arrayCSR ✕ CoordinateDense array ✕ Dense arrayCoordinate ✕ CoordinateCSR ✕ CSRDIA ✕ DIADIA ✕ Dense arrayDIA ✕ CoordinateDIA ✕ CSRELLPACK ✕ ELLPACKELLPACK ✕ Dense arrayELLPACK ✕ CoordinateELLPACK ✕ CSRELLPACK ✕ DIABCSR ✕ BCSRBCSR ✕ Dense arrayBCSR ✕ Coordinate

A = B ∘ C ∘ DDense array ✕ CSR ✕ CSRCoordinate ✕ CSR ✕ CSRCSR ✕ CSR ✕ CSRDense array ✕ Coordinate ✕ CSRDense array ✕ Dense array ✕ CSRCoordinate ✕ Coordinate ✕ CSRDIA ✕ Coordinate ✕ Dense arrayDIA ✕ Coordinate ✕ CSRDIA ✕ Dense array ✕ CSRDIA ✕ CSR ✕ CSRDIA ✕ Coordinate ✕ CoordinateDIA ✕ Dense array ✕ Dense arrayDIA ✕ DIA ✕ CSRDIA ✕ DIA ✕ CoordinateDIA ✕ DIA ✕ Dense array

ELLPACK ✕ ELLPACK ✕ DIAELLPACK ✕ CSR ✕ DIAELLPACK ✕ BCSR ✕ DIA

DIA ✕ DIA ✕ DIA

11

Hand-coding support for a wide range of formats is infeasibleA = B ∘ C

Coordinate ✕ Dense arrayCSR ✕ Dense arrayCSR ✕ CoordinateDense array ✕ Dense arrayCoordinate ✕ CoordinateCSR ✕ CSRDIA ✕ DIADIA ✕ Dense arrayDIA ✕ CoordinateDIA ✕ CSRELLPACK ✕ ELLPACKELLPACK ✕ Dense arrayELLPACK ✕ CoordinateELLPACK ✕ CSRELLPACK ✕ DIABCSR ✕ BCSRBCSR ✕ Dense arrayBCSR ✕ Coordinate

A = B ∘ C ∘ DDense array ✕ CSR ✕ CSRCoordinate ✕ CSR ✕ CSRCSR ✕ CSR ✕ CSRDense array ✕ Coordinate ✕ CSRDense array ✕ Dense array ✕ CSRCoordinate ✕ Coordinate ✕ CSRDIA ✕ Coordinate ✕ Dense arrayDIA ✕ Coordinate ✕ CSRDIA ✕ Dense array ✕ CSRDIA ✕ CSR ✕ CSRDIA ✕ Coordinate ✕ CoordinateDIA ✕ Dense array ✕ Dense arrayDIA ✕ DIA ✕ CSRDIA ✕ DIA ✕ CoordinateDIA ✕ DIA ✕ Dense array

ELLPACK ✕ ELLPACK ✕ DIAELLPACK ✕ CSR ✕ DIAELLPACK ✕ BCSR ✕ DIA

DIA ✕ DIA ✕ DIA

y = Ax + zDense array ✕ Dense array ✕ Dense arrayDense array ✕ Dense array ✕ Sparse vectorDense array ✕ Dense array ✕ Hash mapDense array ✕ Sparse vector ✕ Sparse vectorDense array ✕ Sparse vector ✕ Hash mapDense array ✕ Hash map ✕ Sparse vectorDense array ✕ Sparse vector ✕ Dense arrayCoordinate ✕ Dense array ✕ Dense arrayCoordinate ✕ Sparse vector ✕ Dense arrayCoordinate ✕ Dense array ✕ Hash mapCoordinate ✕ Sparse vector ✕ Hash mapCoordinate ✕ Hash map ✕ Sparse vectorCSR ✕ Dense array ✕ Dense arrayCSR ✕ Dense array ✕ Sparse vectorCSR ✕ Hash map ✕ Sparse vectorCSR ✕ Hash map ✕ Dense arrayCSR ✕ Sparse vector ✕ Dense arrayDIA ✕ Dense array ✕ Dense arrayDIA ✕ Hash map ✕ Dense arrayELLPACK ✕ Dense array ✕ Sparse vector 11

12

Evaluation

Mode-generic tensorCompressed

Singleton

Dense

Dense

DIADenseRangeOffset

123:16 Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe

x supports locate?

y supports locate?

y supports locate?

x unordered and y ordered?

co-iterate over x and y

iterate over x and locate into y

iterate over y and locate into x

no yes

no

yes yes

no

no

yes

x/y ordered or accessed with locate?

output unordered or

supports insert?

x/y unique?

reorder x/y

aggregate duplicates in x/y

no

yes

yes

yes

no

converted iteratorover x/y

iterator over x/y

co-iterating over x and y?

yes

x/y unique?

no

yes

no

Fig. 8. The most efficient strategies for computing the intersection merge of two vectors x andy, depending onwhether they support the locate capability and whether they are ordered and unique. The sparsity structureof y is assumed to not be a strict subset of the sparsity structure of x . The flowchart on the right describes,for each operand, what iterator conversions are needed at runtime to compute the merge.

If one of the input vectors,y, supports the locate capability (e.g., it is a dense array), we can insteadjust iterate over the nonzero components of x and, for each component, locate the component withthe same coordinate in y. Lines 2–9 in Figure 1b shows another example of this method applied tomerge the column dimensions of a CSR matrix and a dense matrix. This alternative method reducesthe merge complexity from O(nnz(x) + nnz(y)) to O(nnz(x)) assuming locate runs in constanttime. Moreover, this method does not require enumerating the coordinates of y in order. We do noteven need to enumerate the coordinates of x in order, as long as there are no duplicates and we donot need to compute output components in order (e.g., if the output supports the insert capability).This method is thus ideal for computing intersection merges of unordered levels.

We can generalize and combine the two methods described above to compute arbitrarily complexmerges involving unions and intersections of any number of tensor operands. At a high level, anymerge can be computed by co-iterating over some subset of its operands and, for every enumeratedcoordinate, locating that same coordinate in all the remaining operands with calls to locate. Whichoperands need to be co-iterated can be identified recursively from the expression expr that we wantto compute. In particular, for each subexpression e = e1 op e2 in expr , let Coiter (e) denote the setof operand coordinate hierarchy levels that need to be co-iterated in order to compute e . If op is anoperation that requires a union merge (e.g., addition), then computing e requires co-iterating overall the levels that would have to be co-iterated in order to separately compute e1 and e2; in otherwords, Coiter (e) = Coiter (e1) ∪Coiter (e2). On the other hand, if op is an operation that requiresan intersection merge (e.g., multiplication), then the set of coordinates of nonzeros in the resulte must be a subset of the coordinates of nonzeros in either operand e1 or e2. Thus, in order toenumerate the coordinates of all nonzeros in the result, it suffices to co-iterate over all the levelsmerged by just one of the operands. Without loss of generality, this lets us compute e withouthaving to co-iterate over levels merged by e2 that can instead be accessed with locate; in otherwords,Coiter (e) = Coiter (e1)∪ (Coiter (e2) \LocateCapable(e2)), where LocateCapable(e2) denotesthe set of levels merged by e2 that support the locate capability.

4.5 Code Generation Algorithm

Figure 9a shows our code generation algorithm, which incorporates all of the concepts we presentedin the previous subsections. Each part of the algorithm is labeled from 1 to 11; throughout the

Proc. ACM Program. Lang., Vol. 2, No. OOPSLA, Article 123. Publication date: November 2018.

Format Abstraction & Code Generation

12

Evaluation

Mode-generic tensorCompressed

Singleton

Dense

Dense

DIADenseRangeOffset

123:16 Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe

x supports locate?

y supports locate?

y supports locate?

x unordered and y ordered?

co-iterate over x and y

iterate over x and locate into y

iterate over y and locate into x

no yes

no

yes yes

no

no

yes

x/y ordered or accessed with locate?

output unordered or

supports insert?

x/y unique?

reorder x/y

aggregate duplicates in x/y

no

yes

yes

yes

no

converted iteratorover x/y

iterator over x/y

co-iterating over x and y?

yes

x/y unique?

no

yes

no

Fig. 8. The most efficient strategies for computing the intersection merge of two vectors x andy, depending onwhether they support the locate capability and whether they are ordered and unique. The sparsity structureof y is assumed to not be a strict subset of the sparsity structure of x . The flowchart on the right describes,for each operand, what iterator conversions are needed at runtime to compute the merge.

If one of the input vectors,y, supports the locate capability (e.g., it is a dense array), we can insteadjust iterate over the nonzero components of x and, for each component, locate the component withthe same coordinate in y. Lines 2–9 in Figure 1b shows another example of this method applied tomerge the column dimensions of a CSR matrix and a dense matrix. This alternative method reducesthe merge complexity from O(nnz(x) + nnz(y)) to O(nnz(x)) assuming locate runs in constanttime. Moreover, this method does not require enumerating the coordinates of y in order. We do noteven need to enumerate the coordinates of x in order, as long as there are no duplicates and we donot need to compute output components in order (e.g., if the output supports the insert capability).This method is thus ideal for computing intersection merges of unordered levels.

We can generalize and combine the two methods described above to compute arbitrarily complexmerges involving unions and intersections of any number of tensor operands. At a high level, anymerge can be computed by co-iterating over some subset of its operands and, for every enumeratedcoordinate, locating that same coordinate in all the remaining operands with calls to locate. Whichoperands need to be co-iterated can be identified recursively from the expression expr that we wantto compute. In particular, for each subexpression e = e1 op e2 in expr , let Coiter (e) denote the setof operand coordinate hierarchy levels that need to be co-iterated in order to compute e . If op is anoperation that requires a union merge (e.g., addition), then computing e requires co-iterating overall the levels that would have to be co-iterated in order to separately compute e1 and e2; in otherwords, Coiter (e) = Coiter (e1) ∪Coiter (e2). On the other hand, if op is an operation that requiresan intersection merge (e.g., multiplication), then the set of coordinates of nonzeros in the resulte must be a subset of the coordinates of nonzeros in either operand e1 or e2. Thus, in order toenumerate the coordinates of all nonzeros in the result, it suffices to co-iterate over all the levelsmerged by just one of the operands. Without loss of generality, this lets us compute e withouthaving to co-iterate over levels merged by e2 that can instead be accessed with locate; in otherwords,Coiter (e) = Coiter (e1)∪ (Coiter (e2) \LocateCapable(e2)), where LocateCapable(e2) denotesthe set of levels merged by e2 that support the locate capability.

4.5 Code Generation Algorithm

Figure 9a shows our code generation algorithm, which incorporates all of the concepts we presentedin the previous subsections. Each part of the algorithm is labeled from 1 to 11; throughout the

Proc. ACM Program. Lang., Vol. 2, No. OOPSLA, Article 123. Publication date: November 2018.

Format Abstraction & Code Generation

12

Evaluation

Mode-generic tensorCompressed

Singleton

Dense

Dense

DIADenseRangeOffset

123:16 Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe

x supports locate?

y supports locate?

y supports locate?

x unordered and y ordered?

co-iterate over x and y

iterate over x and locate into y

iterate over y and locate into x

no yes

no

yes yes

no

no

yes

x/y ordered or accessed with locate?

output unordered or

supports insert?

x/y unique?

reorder x/y

aggregate duplicates in x/y

no

yes

yes

yes

no

converted iteratorover x/y

iterator over x/y

co-iterating over x and y?

yes

x/y unique?

no

yes

no

Fig. 8. The most efficient strategies for computing the intersection merge of two vectors x andy, depending onwhether they support the locate capability and whether they are ordered and unique. The sparsity structureof y is assumed to not be a strict subset of the sparsity structure of x . The flowchart on the right describes,for each operand, what iterator conversions are needed at runtime to compute the merge.

If one of the input vectors,y, supports the locate capability (e.g., it is a dense array), we can insteadjust iterate over the nonzero components of x and, for each component, locate the component withthe same coordinate in y. Lines 2–9 in Figure 1b shows another example of this method applied tomerge the column dimensions of a CSR matrix and a dense matrix. This alternative method reducesthe merge complexity from O(nnz(x) + nnz(y)) to O(nnz(x)) assuming locate runs in constanttime. Moreover, this method does not require enumerating the coordinates of y in order. We do noteven need to enumerate the coordinates of x in order, as long as there are no duplicates and we donot need to compute output components in order (e.g., if the output supports the insert capability).This method is thus ideal for computing intersection merges of unordered levels.

We can generalize and combine the two methods described above to compute arbitrarily complexmerges involving unions and intersections of any number of tensor operands. At a high level, anymerge can be computed by co-iterating over some subset of its operands and, for every enumeratedcoordinate, locating that same coordinate in all the remaining operands with calls to locate. Whichoperands need to be co-iterated can be identified recursively from the expression expr that we wantto compute. In particular, for each subexpression e = e1 op e2 in expr , let Coiter (e) denote the setof operand coordinate hierarchy levels that need to be co-iterated in order to compute e . If op is anoperation that requires a union merge (e.g., addition), then computing e requires co-iterating overall the levels that would have to be co-iterated in order to separately compute e1 and e2; in otherwords, Coiter (e) = Coiter (e1) ∪Coiter (e2). On the other hand, if op is an operation that requiresan intersection merge (e.g., multiplication), then the set of coordinates of nonzeros in the resulte must be a subset of the coordinates of nonzeros in either operand e1 or e2. Thus, in order toenumerate the coordinates of all nonzeros in the result, it suffices to co-iterate over all the levelsmerged by just one of the operands. Without loss of generality, this lets us compute e withouthaving to co-iterate over levels merged by e2 that can instead be accessed with locate; in otherwords,Coiter (e) = Coiter (e1)∪ (Coiter (e2) \LocateCapable(e2)), where LocateCapable(e2) denotesthe set of levels merged by e2 that support the locate capability.

4.5 Code Generation Algorithm

Figure 9a shows our code generation algorithm, which incorporates all of the concepts we presentedin the previous subsections. Each part of the algorithm is labeled from 1 to 11; throughout the

Proc. ACM Program. Lang., Vol. 2, No. OOPSLA, Article 123. Publication date: November 2018.

Format Abstraction & Code Generation

13

J

GH

3210

0

1

1

0

2

B C

A

F

E

D

Tensor formats can be viewed as compositions of level formats

13

J

GH

3210

0

1

1

0

2

B C

A

F

E

D

0 3 5 9

3

0 1 1 0 1 0 0 1 1

0 0 2 3 1 1 3 0 3

A B C D E F G H J0 1 2 3 4 5 6 7 8

Tensor formats can be viewed as compositions of level formats

13

J

GH

3210

0

1

1

0

2

B C

A

F

E

D0 3 5 9

3

0 1 1 0 1 0 0 1 1

0 0 2 3 1 1 3 0 3

A B C D E F G H J0 1 2 3 4 5 6 7 8

Tensor formats can be viewed as compositions of level formats

13

J

GH

3210

0

1

1

0

2

B C

A

F

E

D0 3 5 9

3

0 1 1 0 1 0 0 1 1

0 0 2 3 1 1 3 0 3

A B C D E F G H J0 1 2 3 4 5 6 7 8

Slices

Tensor formats can be viewed as compositions of level formats

13

J

GH

3210

0

1

1

0

2

B C

A

F

E

D0 3 5 9

3

0 1 1 0 1 0 0 1 1

0 0 2 3 1 1 3 0 3

A B C D E F G H J0 1 2 3 4 5 6 7 8

Rows

Tensor formats can be viewed as compositions of level formats

13

J

GH

3210

0

1

1

0

2

B C

A

F

E

D0 3 5 9

3

0 1 1 0 1 0 0 1 1

0 0 2 3 1 1 3 0 3

A B C D E F G H J0 1 2 3 4 5 6 7 8

Columns

Tensor formats can be viewed as compositions of level formats

13

J

GH

3210

0

1

1

0

2

B C

A

F

E

D0 3 5 9

3

0 1 1 0 1 0 0 1 1

0 0 2 3 1 1 3 0 3

A B C D E F G H J0 1 2 3 4 5 6 7 8

Compressed

Singleton

Dense

Tensor formats can be viewed as compositions of level formats

14

Dense Compressed Singleton

The same level formats can be composed in many ways

14

Dense Compressed Singleton

A B

0 1 2 3

0

1

2 F

C D E

The same level formats can be composed in many ways

14

Dense Compressed Singleton

A B

0 1 2 3

0

1

2 F

C D E

3Dense

The same level formats can be composed in many ways

14

A B C D E F

Dense Compressed Singleton

A B

0 1 2 3

0

1

2 F

C D E

3

0 2 2 3 310 2 5 6

Dense

Compressed

The same level formats can be composed in many ways

14

A B C D E F

Dense Compressed Singleton

A B

0 1 2 3

0

1

2 F

C D E

3

0 2 2 3 310 2 5 6

The same level formats can be composed in many ways

CSR{

15

Dense Compressed Singleton

A B

0 1 2 3

0

1

2 F

C D E

The same level formats can be composed in many ways

15

Dense Compressed Singleton

0 0 1 1 210 6

A B

0 1 2 3

0

1

2 F

C D E

Compressed

The same level formats can be composed in many ways

15

A B C D E F

0 2 2 3 31

Dense Compressed Singleton

0 0 1 1 210 6

A B

0 1 2 3

0

1

2 F

C D E

Compressed

Singleton

The same level formats can be composed in many ways

15

A B C D E F

0 2 2 3 31

Dense Compressed Singleton

0 0 1 1 210 6

A B

0 1 2 3

0

1

2 F

C D E

The same level formats can be composed in many ways

Coordinate{

16

Dense Compressed Singleton

The same level formats can be composed in many ways

Tensorformats

Level formats

16

Coordinate matrixCompressed

Singleton

CSRDense

Compressed

Dense Compressed Singleton

[Tinney and Walker, 1967]

The same level formats can be composed in many ways

Tensorformats

Level formats

16

Coordinate matrixCompressed

Singleton

CSRDense

Compressed

Coordinate tensorCompressed

Singleton

Singleton

Dense Compressed Singleton

Mode-generic tensorCompressed

Singleton

Dense

Dense

[Baskaran et al. 2012]

[Tinney and Walker, 1967]

Dense array tensorDenseDenseDense

The same level formats can be composed in many ways

Tensorformats

Level formats

16

Coordinate matrixCompressed

Singleton

CSRDense

Compressed

Coordinate tensorCompressed

Singleton

Singleton

BCSRDense

Compressed

Dense

Dense

Dense Compressed Singleton

ELLPACKDense

Dense

Singleton

Mode-generic tensorCompressed

Singleton

Dense

Dense

[Baskaran et al. 2012]

CSBDense

Dense

Compressed

Singleton [Kincaid et al. 1989][Buluç et al. 2009]

[Tinney and Walker, 1967]

[Im and Yelick 1998]

Dense array tensorDenseDenseDense

The same level formats can be composed in many ways

Tensorformats

Level formats

16

Hashed Range Offset

Coordinate matrixCompressed

Singleton

CSRDense

Compressed

Coordinate tensorCompressed

Singleton

Singleton

BCSRDense

Compressed

Dense

Dense

Dense Compressed Singleton

ELLPACKDense

Dense

Singleton

Mode-generic tensorCompressed

Singleton

Dense

Dense

[Baskaran et al. 2012]

CSBDense

Dense

Compressed

Singleton [Kincaid et al. 1989][Buluç et al. 2009]

[Tinney and Walker, 1967]

[Im and Yelick 1998]

Dense array tensorDenseDenseDense

The same level formats can be composed in many ways

Tensorformats

Level formats

16

Hashed Range Offset

Coordinate matrixCompressed

Singleton

CSRDense

Compressed

Coordinate tensorCompressed

Singleton

Singleton

BCSRDense

Compressed

Dense

Dense

DIADenseRangeOffset

Block DIADenseRangeOffsetDenseDense

Hash map vectorHashed

Dense Compressed Singleton

ELLPACKDense

Dense

Singleton

Mode-generic tensorCompressed

Singleton

Dense

Dense

[Baskaran et al. 2012]

Hash map matrixHashedHashed

CSBDense

Dense

Compressed

Singleton [Kincaid et al. 1989][Buluç et al. 2009]

[Tinney and Walker, 1967]

[Im and Yelick 1998]

[Saad 2003]

[Patwary et al. 2015]

Dense array tensorDenseDenseDense

The same level formats can be composed in many ways

17

for (int i = 0; i < m; i++) {

for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1 + 1]; pB2++) { int j = B2_idx[pB2]; int pA2 = (i * n) + j;

int pB3 = B3_pos[pB2]; int pc1 = c1_pos[0]; while (pB3 < B3_pos[pB2 + 1] && pc1 < c1_pos[1]) { int kB = B3_idx[pB3]; int kc = c1_idx[pc1]; int k = min(kB, kc); if (kB == k && kc == k) { a[pA2] += b[pB3] * c[pc1]; } if (kB == k) pB3++; if (kc == k) pc1++; } } }

Tensor Algebra Compiler (taco)

c : (compressed)<latexit sha1_base64="znx8HB33iu6URMn3Pu43WWiyqTY=">AAACAHicbVC7SgNBFJ31GeNr1cLCZjAIsQm7IihWQRvLCOYBSQizk7vJkNmdZeauGJY0/oqNhSK2foadf+PkUWjigYHDOfdw554gkcKg5307S8srq2vruY385tb2zq67t18zKtUcqlxJpRsBMyBFDFUUKKGRaGBRIKEeDG7Gfv0BtBEqvsdhAu2I9WIRCs7QSh33kNMrWmwhPGLGVWSzxkB3dNpxC17Jm4AuEn9GCmSGSsf9anUVTyOIkUtmTNP3EmxnTKPgEkb5VmogYXzAetC0NGYRmHY2OWBET6zSpaHS9sVIJ+rvRMYiY4ZRYCcjhn0z743F/7xmiuFlOxNxkiLEfLooTCVFRcdt0K7QwFEOLWFcC/tXyvtMM462s7wtwZ8/eZHUzkq+5XfnhfL1rI4cOSLHpEh8ckHK5JZUSJVwMiLP5JW8OU/Oi/PufExHl5xZ5oD8gfP5Awmdlg0=</latexit><latexit sha1_base64="znx8HB33iu6URMn3Pu43WWiyqTY=">AAACAHicbVC7SgNBFJ31GeNr1cLCZjAIsQm7IihWQRvLCOYBSQizk7vJkNmdZeauGJY0/oqNhSK2foadf+PkUWjigYHDOfdw554gkcKg5307S8srq2vruY385tb2zq67t18zKtUcqlxJpRsBMyBFDFUUKKGRaGBRIKEeDG7Gfv0BtBEqvsdhAu2I9WIRCs7QSh33kNMrWmwhPGLGVWSzxkB3dNpxC17Jm4AuEn9GCmSGSsf9anUVTyOIkUtmTNP3EmxnTKPgEkb5VmogYXzAetC0NGYRmHY2OWBET6zSpaHS9sVIJ+rvRMYiY4ZRYCcjhn0z743F/7xmiuFlOxNxkiLEfLooTCVFRcdt0K7QwFEOLWFcC/tXyvtMM462s7wtwZ8/eZHUzkq+5XfnhfL1rI4cOSLHpEh8ckHK5JZUSJVwMiLP5JW8OU/Oi/PufExHl5xZ5oD8gfP5Awmdlg0=</latexit><latexit sha1_base64="znx8HB33iu6URMn3Pu43WWiyqTY=">AAACAHicbVC7SgNBFJ31GeNr1cLCZjAIsQm7IihWQRvLCOYBSQizk7vJkNmdZeauGJY0/oqNhSK2foadf+PkUWjigYHDOfdw554gkcKg5307S8srq2vruY385tb2zq67t18zKtUcqlxJpRsBMyBFDFUUKKGRaGBRIKEeDG7Gfv0BtBEqvsdhAu2I9WIRCs7QSh33kNMrWmwhPGLGVWSzxkB3dNpxC17Jm4AuEn9GCmSGSsf9anUVTyOIkUtmTNP3EmxnTKPgEkb5VmogYXzAetC0NGYRmHY2OWBET6zSpaHS9sVIJ+rvRMYiY4ZRYCcjhn0z743F/7xmiuFlOxNxkiLEfLooTCVFRcdt0K7QwFEOLWFcC/tXyvtMM462s7wtwZ8/eZHUzkq+5XfnhfL1rI4cOSLHpEh8ckHK5JZUSJVwMiLP5JW8OU/Oi/PufExHl5xZ5oD8gfP5Awmdlg0=</latexit><latexit sha1_base64="znx8HB33iu6URMn3Pu43WWiyqTY=">AAACAHicbVC7SgNBFJ31GeNr1cLCZjAIsQm7IihWQRvLCOYBSQizk7vJkNmdZeauGJY0/oqNhSK2foadf+PkUWjigYHDOfdw554gkcKg5307S8srq2vruY385tb2zq67t18zKtUcqlxJpRsBMyBFDFUUKKGRaGBRIKEeDG7Gfv0BtBEqvsdhAu2I9WIRCs7QSh33kNMrWmwhPGLGVWSzxkB3dNpxC17Jm4AuEn9GCmSGSsf9anUVTyOIkUtmTNP3EmxnTKPgEkb5VmogYXzAetC0NGYRmHY2OWBET6zSpaHS9sVIJ+rvRMYiY4ZRYCcjhn0z743F/7xmiuFlOxNxkiLEfLooTCVFRcdt0K7QwFEOLWFcC/tXyvtMM462s7wtwZ8/eZHUzkq+5XfnhfL1rI4cOSLHpEh8ckHK5JZUSJVwMiLP5JW8OU/Oi/PufExHl5xZ5oD8gfP5Awmdlg0=</latexit>

B : (dense, compressed, compressed)<latexit sha1_base64="4q1Sk4lxncbqFoMX5S1GLw3t+GQ=">AAACIXicbVDLSgMxFM34rPVVdekmWIQKIjMiWFyVunFZwT6gLSWTuW2DmcyQ3BHL0F9x46+4caFId+LPmD4W2vZA4OSce29yjx9LYdB1v52V1bX1jc3MVnZ7Z3dvP3dwWDNRojlUeSQj3fCZASkUVFGghEasgYW+hLr/eDv260+gjYjUAw5iaIesp0RXcIZW6uSKZXpDCy2EZ0wDUAaG53R641FoJxkDwTLprJPLuxfuBHSReDOSJzNUOrlRK4h4EoJCLpkxTc+NsZ0yjYJLGGZbiYGY8UfWg6alioVg2ulkwyE9tUpAu5G2RyGdqH87UhYaMwh9Wxky7Jt5bywu85oJdovtVKg4QVB8+lA3kRQjOo6LBkIDRzmwhHEt7F8p7zPNONpQszYEb37lRVK7vPAsv7/Kl8qzODLkmJyQAvHINSmRO1IhVcLJC3kjH+TTeXXenS9nNC1dcWY9R+QfnJ9fTfykRA==</latexit><latexit sha1_base64="4q1Sk4lxncbqFoMX5S1GLw3t+GQ=">AAACIXicbVDLSgMxFM34rPVVdekmWIQKIjMiWFyVunFZwT6gLSWTuW2DmcyQ3BHL0F9x46+4caFId+LPmD4W2vZA4OSce29yjx9LYdB1v52V1bX1jc3MVnZ7Z3dvP3dwWDNRojlUeSQj3fCZASkUVFGghEasgYW+hLr/eDv260+gjYjUAw5iaIesp0RXcIZW6uSKZXpDCy2EZ0wDUAaG53R641FoJxkDwTLprJPLuxfuBHSReDOSJzNUOrlRK4h4EoJCLpkxTc+NsZ0yjYJLGGZbiYGY8UfWg6alioVg2ulkwyE9tUpAu5G2RyGdqH87UhYaMwh9Wxky7Jt5bywu85oJdovtVKg4QVB8+lA3kRQjOo6LBkIDRzmwhHEt7F8p7zPNONpQszYEb37lRVK7vPAsv7/Kl8qzODLkmJyQAvHINSmRO1IhVcLJC3kjH+TTeXXenS9nNC1dcWY9R+QfnJ9fTfykRA==</latexit><latexit sha1_base64="4q1Sk4lxncbqFoMX5S1GLw3t+GQ=">AAACIXicbVDLSgMxFM34rPVVdekmWIQKIjMiWFyVunFZwT6gLSWTuW2DmcyQ3BHL0F9x46+4caFId+LPmD4W2vZA4OSce29yjx9LYdB1v52V1bX1jc3MVnZ7Z3dvP3dwWDNRojlUeSQj3fCZASkUVFGghEasgYW+hLr/eDv260+gjYjUAw5iaIesp0RXcIZW6uSKZXpDCy2EZ0wDUAaG53R641FoJxkDwTLprJPLuxfuBHSReDOSJzNUOrlRK4h4EoJCLpkxTc+NsZ0yjYJLGGZbiYGY8UfWg6alioVg2ulkwyE9tUpAu5G2RyGdqH87UhYaMwh9Wxky7Jt5bywu85oJdovtVKg4QVB8+lA3kRQjOo6LBkIDRzmwhHEt7F8p7zPNONpQszYEb37lRVK7vPAsv7/Kl8qzODLkmJyQAvHINSmRO1IhVcLJC3kjH+TTeXXenS9nNC1dcWY9R+QfnJ9fTfykRA==</latexit><latexit sha1_base64="4q1Sk4lxncbqFoMX5S1GLw3t+GQ=">AAACIXicbVDLSgMxFM34rPVVdekmWIQKIjMiWFyVunFZwT6gLSWTuW2DmcyQ3BHL0F9x46+4caFId+LPmD4W2vZA4OSce29yjx9LYdB1v52V1bX1jc3MVnZ7Z3dvP3dwWDNRojlUeSQj3fCZASkUVFGghEasgYW+hLr/eDv260+gjYjUAw5iaIesp0RXcIZW6uSKZXpDCy2EZ0wDUAaG53R641FoJxkDwTLprJPLuxfuBHSReDOSJzNUOrlRK4h4EoJCLpkxTc+NsZ0yjYJLGGZbiYGY8UfWg6alioVg2ulkwyE9tUpAu5G2RyGdqH87UhYaMwh9Wxky7Jt5bywu85oJdovtVKg4QVB8+lA3kRQjOo6LBkIDRzmwhHEt7F8p7zPNONpQszYEb37lRVK7vPAsv7/Kl8qzODLkmJyQAvHINSmRO1IhVcLJC3kjH+TTeXXenS9nNC1dcWY9R+QfnJ9fTfykRA==</latexit>

A : (dense, dense)

Aij =X

k

Bijkck

[Kjolstad et al. 2017]

18

for (int i = 0; i < m; i++) { for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1 + 1]; pB2++) { int j = B2_idx[pB2]; int pA2 = (i * n) + j; int pB3 = B3_pos[pB2]; int pc1 = c1_pos[0]; while (pB3 < B3_pos[pB2 + 1] && pc1 < c1_pos[1]) { int kB = B3_crd[pB3]; int kc = c1_crd[pc1]; int k = min(kB, kc); if (kB == k && kc == k) { A[pA2] += B[pB3] * c[pc1]; } if (kB == k) pB3++; if (kc == k) pc1++; } } }

Aij =X

k

Bijk · ck<latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="ck8pdC+ekZH4nUmSP+ZG7r8lEyk=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odOn4MoA7ncAFXEMIN3MEDdKALAhJ4hXdv4r15H6uuat66tDP4I+/zBzjGijg=</latexit><latexit sha1_base64="n5e1ptEuTL43cmV9t1ImHPSoNvM=">AAACf3icbZFda9swFIZl76Ndlm1pb3sjWga7aINkN7FNGXTbTS87aNpCbIysKKkW+QNJLgShH7O/tLv+m8ppClu6A4KH9+jVOTqnaARXGqEHz3/1+s3bnd13vff9Dx8/Dfb616puJWUTWota3hZEMcErNtFcC3bbSEbKQrCbYvmjy9/cM6l4XV3pVcOykiwqPueUaCflg98mXT8ylYsiM2iI4hEKT4/RcBwmCMcOEI7COLDfcsN/WQu/wlS1ZW6W1mxZ8Sg8TQLnCKMxDpGDOE5QgO33zrp03pTOag23bBEaB3HSFVqHgwiHUTyyhubOZPPB0XMOvgS8gSOwict88Ced1bQtWaWpIEpNMWp0ZojUnApme2mrWEPokizY1GFFSqYys+7Jws9OmcF5Ld2pNFyrfzsMKZValYW7WRJ9p7Zznfi/3LTV8zgzvGpazSr6VGjeCqhr2O0FzrhkVIuVA0Ild71Cekckodptr+eGgLe//BKugyF2/BOBXXAADsEXgEEEzsEFuAQTQL0d78Qbe5Hf9wM/eRqX723mtg/+Cf/sEWMFufg=</latexit><latexit sha1_base64="n5e1ptEuTL43cmV9t1ImHPSoNvM=">AAACf3icbZFda9swFIZl76Ndlm1pb3sjWga7aINkN7FNGXTbTS87aNpCbIysKKkW+QNJLgShH7O/tLv+m8ppClu6A4KH9+jVOTqnaARXGqEHz3/1+s3bnd13vff9Dx8/Dfb616puJWUTWota3hZEMcErNtFcC3bbSEbKQrCbYvmjy9/cM6l4XV3pVcOykiwqPueUaCflg98mXT8ylYsiM2iI4hEKT4/RcBwmCMcOEI7COLDfcsN/WQu/wlS1ZW6W1mxZ8Sg8TQLnCKMxDpGDOE5QgO33zrp03pTOag23bBEaB3HSFVqHgwiHUTyyhubOZPPB0XMOvgS8gSOwict88Ced1bQtWaWpIEpNMWp0ZojUnApme2mrWEPokizY1GFFSqYys+7Jws9OmcF5Ld2pNFyrfzsMKZValYW7WRJ9p7Zznfi/3LTV8zgzvGpazSr6VGjeCqhr2O0FzrhkVIuVA0Ild71Cekckodptr+eGgLe//BKugyF2/BOBXXAADsEXgEEEzsEFuAQTQL0d78Qbe5Hf9wM/eRqX723mtg/+Cf/sEWMFufg=</latexit><latexit sha1_base64="9QtCk7M63G3IchN+Z5Vs/Le7raY=">AAACinicbZHfatswFMZld1u7rNvS9nI3YmGwiy1IdhPblELX7WKXLSxtITZGVpRUi/wHSS4EoYfpK+1ubzM5zWBLe0Dw4zvn0znSKRrBlUbot+fvPHv+YnfvZe/V/us3b/sHh1eqbiVlE1qLWt4URDHBKzbRXAt200hGykKw62L5tctf3zGpeF390KuGZSVZVHzOKdFOyvv3Jl1fMpWLIjNoiOIRCo8/oeE4TBCOHSAchXFgv+SG/7QWnsJUtWVultZsWfEoPE4C5wijMQ6RgzhOUIDteWddOm9KZ7WGW7YIjYM46Rqtw0GEwygeWUNzZ7J5f/A3Bx8D3sAAbOIi7/9KZzVtS1ZpKohSU4wanRkiNaeC2V7aKtYQuiQLNnVYkZKpzKxnsvCDU2ZwXkt3Kg3X6r8OQ0qlVmXhKkuib9V2rhOfyk1bPY8zw6um1ayiD43mrYC6ht1e4IxLRrVYOSBUcjcrpLdEEqrd9nruE/D2kx/DVTDEji/R4Ox88x174B14Dz4CDCJwBr6DCzAB1Nv1PntjL/L3/cBP/JOHUt/beI7Af+F/+wOGoLrq</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit>

taco generates code dimension by dimension

18

for (int i = 0; i < m; i++) { for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1 + 1]; pB2++) { int j = B2_idx[pB2]; int pA2 = (i * n) + j; int pB3 = B3_pos[pB2]; int pc1 = c1_pos[0]; while (pB3 < B3_pos[pB2 + 1] && pc1 < c1_pos[1]) { int kB = B3_crd[pB3]; int kc = c1_crd[pc1]; int k = min(kB, kc); if (kB == k && kc == k) { A[pA2] += B[pB3] * c[pc1]; } if (kB == k) pB3++; if (kc == k) pc1++; } } }

Aij =X

k

Bijk · ck<latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="ck8pdC+ekZH4nUmSP+ZG7r8lEyk=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odOn4MoA7ncAFXEMIN3MEDdKALAhJ4hXdv4r15H6uuat66tDP4I+/zBzjGijg=</latexit><latexit sha1_base64="n5e1ptEuTL43cmV9t1ImHPSoNvM=">AAACf3icbZFda9swFIZl76Ndlm1pb3sjWga7aINkN7FNGXTbTS87aNpCbIysKKkW+QNJLgShH7O/tLv+m8ppClu6A4KH9+jVOTqnaARXGqEHz3/1+s3bnd13vff9Dx8/Dfb616puJWUTWota3hZEMcErNtFcC3bbSEbKQrCbYvmjy9/cM6l4XV3pVcOykiwqPueUaCflg98mXT8ylYsiM2iI4hEKT4/RcBwmCMcOEI7COLDfcsN/WQu/wlS1ZW6W1mxZ8Sg8TQLnCKMxDpGDOE5QgO33zrp03pTOag23bBEaB3HSFVqHgwiHUTyyhubOZPPB0XMOvgS8gSOwict88Ced1bQtWaWpIEpNMWp0ZojUnApme2mrWEPokizY1GFFSqYys+7Jws9OmcF5Ld2pNFyrfzsMKZValYW7WRJ9p7Zznfi/3LTV8zgzvGpazSr6VGjeCqhr2O0FzrhkVIuVA0Ild71Cekckodptr+eGgLe//BKugyF2/BOBXXAADsEXgEEEzsEFuAQTQL0d78Qbe5Hf9wM/eRqX723mtg/+Cf/sEWMFufg=</latexit><latexit sha1_base64="n5e1ptEuTL43cmV9t1ImHPSoNvM=">AAACf3icbZFda9swFIZl76Ndlm1pb3sjWga7aINkN7FNGXTbTS87aNpCbIysKKkW+QNJLgShH7O/tLv+m8ppClu6A4KH9+jVOTqnaARXGqEHz3/1+s3bnd13vff9Dx8/Dfb616puJWUTWota3hZEMcErNtFcC3bbSEbKQrCbYvmjy9/cM6l4XV3pVcOykiwqPueUaCflg98mXT8ylYsiM2iI4hEKT4/RcBwmCMcOEI7COLDfcsN/WQu/wlS1ZW6W1mxZ8Sg8TQLnCKMxDpGDOE5QgO33zrp03pTOag23bBEaB3HSFVqHgwiHUTyyhubOZPPB0XMOvgS8gSOwict88Ced1bQtWaWpIEpNMWp0ZojUnApme2mrWEPokizY1GFFSqYys+7Jws9OmcF5Ld2pNFyrfzsMKZValYW7WRJ9p7Zznfi/3LTV8zgzvGpazSr6VGjeCqhr2O0FzrhkVIuVA0Ild71Cekckodptr+eGgLe//BKugyF2/BOBXXAADsEXgEEEzsEFuAQTQL0d78Qbe5Hf9wM/eRqX723mtg/+Cf/sEWMFufg=</latexit><latexit sha1_base64="9QtCk7M63G3IchN+Z5Vs/Le7raY=">AAACinicbZHfatswFMZld1u7rNvS9nI3YmGwiy1IdhPblELX7WKXLSxtITZGVpRUi/wHSS4EoYfpK+1ubzM5zWBLe0Dw4zvn0znSKRrBlUbot+fvPHv+YnfvZe/V/us3b/sHh1eqbiVlE1qLWt4URDHBKzbRXAt200hGykKw62L5tctf3zGpeF390KuGZSVZVHzOKdFOyvv3Jl1fMpWLIjNoiOIRCo8/oeE4TBCOHSAchXFgv+SG/7QWnsJUtWVultZsWfEoPE4C5wijMQ6RgzhOUIDteWddOm9KZ7WGW7YIjYM46Rqtw0GEwygeWUNzZ7J5f/A3Bx8D3sAAbOIi7/9KZzVtS1ZpKohSU4wanRkiNaeC2V7aKtYQuiQLNnVYkZKpzKxnsvCDU2ZwXkt3Kg3X6r8OQ0qlVmXhKkuib9V2rhOfyk1bPY8zw6um1ayiD43mrYC6ht1e4IxLRrVYOSBUcjcrpLdEEqrd9nruE/D2kx/DVTDEji/R4Ox88x174B14Dz4CDCJwBr6DCzAB1Nv1PntjL/L3/cBP/JOHUt/beI7Af+F/+wOGoLrq</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit>

taco generates code dimension by dimension

18

for (int i = 0; i < m; i++) { for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1 + 1]; pB2++) { int j = B2_idx[pB2]; int pA2 = (i * n) + j; int pB3 = B3_pos[pB2]; int pc1 = c1_pos[0]; while (pB3 < B3_pos[pB2 + 1] && pc1 < c1_pos[1]) { int kB = B3_crd[pB3]; int kc = c1_crd[pc1]; int k = min(kB, kc); if (kB == k && kc == k) { A[pA2] += B[pB3] * c[pc1]; } if (kB == k) pB3++; if (kc == k) pc1++; } } }

Aij =X

k

Bijk · ck<latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="ck8pdC+ekZH4nUmSP+ZG7r8lEyk=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odOn4MoA7ncAFXEMIN3MEDdKALAhJ4hXdv4r15H6uuat66tDP4I+/zBzjGijg=</latexit><latexit sha1_base64="n5e1ptEuTL43cmV9t1ImHPSoNvM=">AAACf3icbZFda9swFIZl76Ndlm1pb3sjWga7aINkN7FNGXTbTS87aNpCbIysKKkW+QNJLgShH7O/tLv+m8ppClu6A4KH9+jVOTqnaARXGqEHz3/1+s3bnd13vff9Dx8/Dfb616puJWUTWota3hZEMcErNtFcC3bbSEbKQrCbYvmjy9/cM6l4XV3pVcOykiwqPueUaCflg98mXT8ylYsiM2iI4hEKT4/RcBwmCMcOEI7COLDfcsN/WQu/wlS1ZW6W1mxZ8Sg8TQLnCKMxDpGDOE5QgO33zrp03pTOag23bBEaB3HSFVqHgwiHUTyyhubOZPPB0XMOvgS8gSOwict88Ced1bQtWaWpIEpNMWp0ZojUnApme2mrWEPokizY1GFFSqYys+7Jws9OmcF5Ld2pNFyrfzsMKZValYW7WRJ9p7Zznfi/3LTV8zgzvGpazSr6VGjeCqhr2O0FzrhkVIuVA0Ild71Cekckodptr+eGgLe//BKugyF2/BOBXXAADsEXgEEEzsEFuAQTQL0d78Qbe5Hf9wM/eRqX723mtg/+Cf/sEWMFufg=</latexit><latexit sha1_base64="n5e1ptEuTL43cmV9t1ImHPSoNvM=">AAACf3icbZFda9swFIZl76Ndlm1pb3sjWga7aINkN7FNGXTbTS87aNpCbIysKKkW+QNJLgShH7O/tLv+m8ppClu6A4KH9+jVOTqnaARXGqEHz3/1+s3bnd13vff9Dx8/Dfb616puJWUTWota3hZEMcErNtFcC3bbSEbKQrCbYvmjy9/cM6l4XV3pVcOykiwqPueUaCflg98mXT8ylYsiM2iI4hEKT4/RcBwmCMcOEI7COLDfcsN/WQu/wlS1ZW6W1mxZ8Sg8TQLnCKMxDpGDOE5QgO33zrp03pTOag23bBEaB3HSFVqHgwiHUTyyhubOZPPB0XMOvgS8gSOwict88Ced1bQtWaWpIEpNMWp0ZojUnApme2mrWEPokizY1GFFSqYys+7Jws9OmcF5Ld2pNFyrfzsMKZValYW7WRJ9p7Zznfi/3LTV8zgzvGpazSr6VGjeCqhr2O0FzrhkVIuVA0Ild71Cekckodptr+eGgLe//BKugyF2/BOBXXAADsEXgEEEzsEFuAQTQL0d78Qbe5Hf9wM/eRqX723mtg/+Cf/sEWMFufg=</latexit><latexit sha1_base64="9QtCk7M63G3IchN+Z5Vs/Le7raY=">AAACinicbZHfatswFMZld1u7rNvS9nI3YmGwiy1IdhPblELX7WKXLSxtITZGVpRUi/wHSS4EoYfpK+1ubzM5zWBLe0Dw4zvn0znSKRrBlUbot+fvPHv+YnfvZe/V/us3b/sHh1eqbiVlE1qLWt4URDHBKzbRXAt200hGykKw62L5tctf3zGpeF390KuGZSVZVHzOKdFOyvv3Jl1fMpWLIjNoiOIRCo8/oeE4TBCOHSAchXFgv+SG/7QWnsJUtWVultZsWfEoPE4C5wijMQ6RgzhOUIDteWddOm9KZ7WGW7YIjYM46Rqtw0GEwygeWUNzZ7J5f/A3Bx8D3sAAbOIi7/9KZzVtS1ZpKohSU4wanRkiNaeC2V7aKtYQuiQLNnVYkZKpzKxnsvCDU2ZwXkt3Kg3X6r8OQ0qlVmXhKkuib9V2rhOfyk1bPY8zw6um1ayiD43mrYC6ht1e4IxLRrVYOSBUcjcrpLdEEqrd9nruE/D2kx/DVTDEji/R4Ox88x174B14Dz4CDCJwBr6DCzAB1Nv1PntjL/L3/cBP/JOHUt/beI7Af+F/+wOGoLrq</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit>

taco generates code dimension by dimension

18

for (int i = 0; i < m; i++) { for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1 + 1]; pB2++) { int j = B2_idx[pB2]; int pA2 = (i * n) + j; int pB3 = B3_pos[pB2]; int pc1 = c1_pos[0]; while (pB3 < B3_pos[pB2 + 1] && pc1 < c1_pos[1]) { int kB = B3_crd[pB3]; int kc = c1_crd[pc1]; int k = min(kB, kc); if (kB == k && kc == k) { A[pA2] += B[pB3] * c[pc1]; } if (kB == k) pB3++; if (kc == k) pc1++; } } }

Aij =X

k

Bijk · ck<latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="ck8pdC+ekZH4nUmSP+ZG7r8lEyk=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odOn4MoA7ncAFXEMIN3MEDdKALAhJ4hXdv4r15H6uuat66tDP4I+/zBzjGijg=</latexit><latexit sha1_base64="n5e1ptEuTL43cmV9t1ImHPSoNvM=">AAACf3icbZFda9swFIZl76Ndlm1pb3sjWga7aINkN7FNGXTbTS87aNpCbIysKKkW+QNJLgShH7O/tLv+m8ppClu6A4KH9+jVOTqnaARXGqEHz3/1+s3bnd13vff9Dx8/Dfb616puJWUTWota3hZEMcErNtFcC3bbSEbKQrCbYvmjy9/cM6l4XV3pVcOykiwqPueUaCflg98mXT8ylYsiM2iI4hEKT4/RcBwmCMcOEI7COLDfcsN/WQu/wlS1ZW6W1mxZ8Sg8TQLnCKMxDpGDOE5QgO33zrp03pTOag23bBEaB3HSFVqHgwiHUTyyhubOZPPB0XMOvgS8gSOwict88Ced1bQtWaWpIEpNMWp0ZojUnApme2mrWEPokizY1GFFSqYys+7Jws9OmcF5Ld2pNFyrfzsMKZValYW7WRJ9p7Zznfi/3LTV8zgzvGpazSr6VGjeCqhr2O0FzrhkVIuVA0Ild71Cekckodptr+eGgLe//BKugyF2/BOBXXAADsEXgEEEzsEFuAQTQL0d78Qbe5Hf9wM/eRqX723mtg/+Cf/sEWMFufg=</latexit><latexit sha1_base64="n5e1ptEuTL43cmV9t1ImHPSoNvM=">AAACf3icbZFda9swFIZl76Ndlm1pb3sjWga7aINkN7FNGXTbTS87aNpCbIysKKkW+QNJLgShH7O/tLv+m8ppClu6A4KH9+jVOTqnaARXGqEHz3/1+s3bnd13vff9Dx8/Dfb616puJWUTWota3hZEMcErNtFcC3bbSEbKQrCbYvmjy9/cM6l4XV3pVcOykiwqPueUaCflg98mXT8ylYsiM2iI4hEKT4/RcBwmCMcOEI7COLDfcsN/WQu/wlS1ZW6W1mxZ8Sg8TQLnCKMxDpGDOE5QgO33zrp03pTOag23bBEaB3HSFVqHgwiHUTyyhubOZPPB0XMOvgS8gSOwict88Ced1bQtWaWpIEpNMWp0ZojUnApme2mrWEPokizY1GFFSqYys+7Jws9OmcF5Ld2pNFyrfzsMKZValYW7WRJ9p7Zznfi/3LTV8zgzvGpazSr6VGjeCqhr2O0FzrhkVIuVA0Ild71Cekckodptr+eGgLe//BKugyF2/BOBXXAADsEXgEEEzsEFuAQTQL0d78Qbe5Hf9wM/eRqX723mtg/+Cf/sEWMFufg=</latexit><latexit sha1_base64="9QtCk7M63G3IchN+Z5Vs/Le7raY=">AAACinicbZHfatswFMZld1u7rNvS9nI3YmGwiy1IdhPblELX7WKXLSxtITZGVpRUi/wHSS4EoYfpK+1ubzM5zWBLe0Dw4zvn0znSKRrBlUbot+fvPHv+YnfvZe/V/us3b/sHh1eqbiVlE1qLWt4URDHBKzbRXAt200hGykKw62L5tctf3zGpeF390KuGZSVZVHzOKdFOyvv3Jl1fMpWLIjNoiOIRCo8/oeE4TBCOHSAchXFgv+SG/7QWnsJUtWVultZsWfEoPE4C5wijMQ6RgzhOUIDteWddOm9KZ7WGW7YIjYM46Rqtw0GEwygeWUNzZ7J5f/A3Bx8D3sAAbOIi7/9KZzVtS1ZpKohSU4wanRkiNaeC2V7aKtYQuiQLNnVYkZKpzKxnsvCDU2ZwXkt3Kg3X6r8OQ0qlVmXhKkuib9V2rhOfyk1bPY8zw6um1ayiD43mrYC6ht1e4IxLRrVYOSBUcjcrpLdEEqrd9nruE/D2kx/DVTDEji/R4Ox88x174B14Dz4CDCJwBr6DCzAB1Nv1PntjL/L3/cBP/JOHUt/beI7Af+F/+wOGoLrq</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit>

taco generates code dimension by dimension

19

Dense Compressed⇥Dense Compressed+

Hand-coding support for a wide range of level formats is also infeasible

19...

Dense Compressed⇥Dense Compressed+

Compressed ⇥ Hashed

Compressed + Singleton

Dense + Range

Compressed + Offset

Dense Compressed⇥ ⇥ Hashed

Compressed⇥ ⇥ HashedSingleton

Compressed ⇥Singleton + Dense

CompressedSingleton + Dense+

Hand-coding support for a wide range of level formats is also infeasible

20

Code generation is performed in two stages

20

Code generation is performed in two stages

Compressed ⇥ Hashed

20

Code generation is performed in two stages

High-level algorithm

Compressed ⇥ Hashed

20

Code generation is performed in two stages

High-level algorithm

Runnable code

Compressed ⇥ Hashed

20

Code generation is performed in two stages

High-level algorithm

Runnable code

Compressed ⇥ Hashed

How to compute with different data structures

20

Code generation is performed in two stages

High-level algorithm

Runnable code

Compressed ⇥ Hashed

How to compute with different data structures

How to compute with multiple operands

21

Tensor algebra computations can be expressed in terms of high-level operations on tensor operands

OM N∘C D E

B C

F P

K

Q R

A B

G H

I J

S

L

T U

21

Tensor algebra computations can be expressed in terms of high-level operations on tensor operands

OM N∘C D E

B C

F P

K

Q R

A B

G H

I J

S

L

T U

21

Tensor algebra computations can be expressed in terms of high-level operations on tensor operands

OM N∘C D E

B C

F P

K

Q R

A B

G H

I J

S

L

T U

21

Tensor algebra computations can be expressed in terms of high-level operations on tensor operands

OM N∘C D E

B C

F P

K

Q R

A B

G H

I J

S

L

T U

22

Level formats declare whether they support various high-level operations

Dense

Compressed

Hashed

Singleton

Range

Offset

22

Level formats declare whether they support various high-level operations

Random access Iteration

Dense

Compressed

Hashed

Singleton

Range

Offset

22

Level formats declare whether they support various high-level operations

Random access Iteration

Dense

Compressed

Hashed

Singleton

Range

Offset

22

Level formats declare whether they support various high-level operations

Random access Iteration

Dense

Compressed

Hashed

Singleton

Range

Offset

22

Level formats declare whether they support various high-level operations

Random access Iteration

Dense

Compressed

Hashed

Singleton

Range

Offset

23

Compiler constructs efficient algorithm by reasoning about whether operands support required high-level operations

Dense

Compressed

Hashed

Singleton

Range

Offset

Random access

23

Compiler constructs efficient algorithm by reasoning about whether operands support required high-level operations

Dense

Compressed

Hashed

Singleton

Range

Offset

Random access

B C∘

23

Compiler constructs efficient algorithm by reasoning about whether operands support required high-level operations

Compressed ⇥ Hashed

Dense

Compressed

Hashed

Singleton

Range

Offset

Random access

B C∘

23

Compiler constructs efficient algorithm by reasoning about whether operands support required high-level operations

Compressed ⇥ Hashed

Dense

Compressed

Hashed

Singleton

Range

Offset

Random access

B C∘

Iterate over B and random access C

23

Compiler constructs efficient algorithm by reasoning about whether operands support required high-level operations

Dense

Compressed

Hashed

Singleton

Range

Offset

Random access

B C∘Compressed ⇥ Singleton

23

Compiler constructs efficient algorithm by reasoning about whether operands support required high-level operations

Dense

Compressed

Hashed

Singleton

Range

Offset

Random access

B C∘Compressed ⇥ Singleton

Simultaneously iterate over B and C

24

Random access

Dense

Hashed

Level formats also specify how they support high-level operations

Compressed

Iteration

...

24

int pB2 = pB1 * N + j;

Random access

Dense

Hashed

Level formats also specify how they support high-level operations

Compressed

Iteration

...

24

int pB2 = pB1 * N + j;

int pB2 = j % W + pB1 * W; if (crd[pB2] != j && crd[pB2] != -1) { int end = pB2; do { pB2 = (pB2 + 1) % W; } while (crd[pB2] != j && crd[pB2] != -1 && pB2 != end); } if (crd[pB2] == j) {

Random access

Dense

Hashed

Level formats also specify how they support high-level operations

Compressed

Iteration

...

24

int pB2 = pB1 * N + j;

int pB2 = j % W + pB1 * W; if (crd[pB2] != j && crd[pB2] != -1) { int end = pB2; do { pB2 = (pB2 + 1) % W; } while (crd[pB2] != j && crd[pB2] != -1 && pB2 != end); } if (crd[pB2] == j) {

Random access

Dense

Hashed

Level formats also specify how they support high-level operations

Compressed

Iteration

for (int j = 0; j < N; j++) { int pB2 = pB1 * N + j;

for (int pB2 = pos[pB1]; pB2 < pos[pB1+1]; pB2++) { int j = crd[pB2];

for (int pB2 = pB1 * W; pB2 < (pB1 + 1) * W; pB2++) { int j = crd[pB2]; if (j != -1) {

...

25

Compiler specializes constructed algorithm to operand formats by inlining code that implements required high-level operations

Compressed ⇥ Hashed

B C∘

25

Compiler specializes constructed algorithm to operand formats by inlining code that implements required high-level operations

for every element b in B: find corresponding element c in C A[i][j] = b * c;

Compressed ⇥ Hashed

B C∘

25

Compiler specializes constructed algorithm to operand formats by inlining code that implements required high-level operations

for every element b in B: find corresponding element c in C A[i][j] = b * c;

Compressed ⇥ Hashed

B C∘

25

Compiler specializes constructed algorithm to operand formats by inlining code that implements required high-level operations

for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1+1]; pB2++) { int j = B2_crd[pB2]; find corresponding element c in C A[i][j] = B[pB2] * c; }

Compressed ⇥ Hashed

B C∘

25

Compiler specializes constructed algorithm to operand formats by inlining code that implements required high-level operations

for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1+1]; pB2++) { int j = B2_crd[pB2]; find corresponding element c in C A[i][j] = B[pB2] * c; }

Compressed ⇥ Hashed

B C∘

25

Compiler specializes constructed algorithm to operand formats by inlining code that implements required high-level operations

Compressed ⇥ Hashed

B C∘

for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1+1]; pB2++) { int j = B2_crd[pB2]; int pC2 = j % W + pC1 * W; if (C2_crd[pC2] != j && C2_crd[pC2] != -1) { int end = pC2; do { pC2 = (pC2 + 1) % W; } while (C2_crd[pB2] != j && C2_crd[pB2] != -1 && pC2 != end); } if (C2_crd[pC2] == j) { A[i][j] = B[pB2] * C[pC2]; } }

26

The same process can be repeated dimension by dimensionAijk = Bijk + Cijk

26

The same process can be repeated dimension by dimension

int iB = 0; int C0_pos = C0_pos_arr[0]; while (C0_pos < C0_pos_arr[1]) { int iC = C0_idx_arr[C0_pos]; int C0_end = C0_pos + 1; if (iC == iB) while ((C0_end < C0_pos_arr[1]) && (C0_idx_arr[C0_end] == iB)) { C0_end++; } if (iC == iB) { int B1_pos = B1_pos_arr[iB]; int C1_pos = C0_pos; while ((B1_pos < B1_pos_arr[iB + 1]) && (C1_pos < C0_end)) { int jB = B1_idx_arr[B1_pos]; int jC = C1_idx_arr[C1_pos]; int j = min(jB, jC); int A1_pos = (iB * A1_size) + j; int C1_end = C1_pos + 1; if (jC == j) while ((C1_end < C0_end) && (C1_idx_arr[C1_end] == j)) { C1_end++; } if ((jB == j) && (jC == j)) { int B2_pos = B2_pos_arr[B1_pos]; int C2_pos = C1_pos; while ((B2_pos < B2_pos_arr[B1_pos + 1]) && (C2_pos < C1_end)) { int kB = B2_idx_arr[B2_pos]; int kC = C2_idx_arr[C2_pos]; int k = min(kB, kC); int A2_pos = (A1_pos * A2_size) + k; if ((kB == k) && (kC == k)) { A_val_arr[A2_pos] = B_val_arr[B2_pos] + C_val_arr[C2_pos]; } else if (kB == k) { A_val_arr[A2_pos] = B_val_arr[B2_pos]; } else { A_val_arr[A2_pos] = C_val_arr[C2_pos]; } if (kB == k) B2_pos++; if (kC == k) C2_pos++; } while (B2_pos < B2_pos_arr[B1_pos + 1]) { int kB0 = B2_idx_arr[B2_pos]; int A2_pos0 = (A1_pos * A2_size) + kB0; A_val_arr[A2_pos0] = B_val_arr[B2_pos]; B2_pos++; } while (C2_pos < C1_end) { int kC0 = C2_idx_arr[C2_pos]; int A2_pos1 = (A1_pos * A2_size) + kC0; A_val_arr[A2_pos1] = C_val_arr[C2_pos]; C2_pos++; } } else if (jB == j) { for (int B2_pos0 = B2_pos_arr[B1_pos]; B2_pos0 < B2_pos_arr[B1_pos + 1]; B2_pos0++) { int kB1 = B2_idx_arr[B2_pos0]; int A2_pos2 = (A1_pos * A2_size) + kB1; A_val_arr[A2_pos2] = B_val_arr[B2_pos0]; } } else { for (int C2_pos0 = C1_pos; C2_pos0 < C1_end; C2_pos0++) { int kC1 = C2_idx_arr[C2_pos0]; int A2_pos3 = (A1_pos * A2_size) + kC1; A_val_arr[A2_pos3] = C_val_arr[C2_pos0]; } } if (jB == j) B1_pos++; if (jC == j) C1_pos = C1_end; }

while (B1_pos < B1_pos_arr[iB + 1]) { int jB0 = B1_idx_arr[B1_pos]; int A1_pos0 = (iB * A1_size) + jB0; for (int B2_pos1 = B2_pos_arr[B1_pos]; B2_pos1 < B2_pos_arr[B1_pos + 1]; B2_pos1++) { int kB2 = B2_idx_arr[B2_pos1]; int A2_pos4 = (A1_pos0 * A2_size) + kB2; A_val_arr[A2_pos4] = B_val_arr[B2_pos1]; } B1_pos++; } while (C1_pos < C0_end) { int jC0 = C1_idx_arr[C1_pos]; int A1_pos1 = (iB * A1_size) + jC0; int C1_end0 = C1_pos + 1; while ((C1_end0 < C0_end) && (C1_idx_arr[C1_end0] == jC0)) { C1_end0++; } for (int C2_pos1 = C1_pos; C2_pos1 < C1_end0; C2_pos1++) { int kC2 = C2_idx_arr[C2_pos1]; int A2_pos5 = (A1_pos1 * A2_size) + kC2; A_val_arr[A2_pos5] = C_val_arr[C2_pos1]; } C1_pos = C1_end0; } } else { for (int B1_pos0 = B1_pos_arr[iB]; B1_pos0 < B1_pos_arr[iB + 1]; B1_pos0++) { int jB1 = B1_idx_arr[B1_pos0]; int A1_pos2 = (iB * A1_size) + jB1; for (int B2_pos2 = B2_pos_arr[B1_pos0]; B2_pos2 < B2_pos_arr[B1_pos0 + 1]; B2_pos2++) { int kB3 = B2_idx_arr[B2_pos2]; int A2_pos6 = (A1_pos2 * A2_size) + kB3; A_val_arr[A2_pos6] = B_val_arr[B2_pos2]; } } } if (iC == iB) C0_pos = C0_end; iB++; } while (iB < B0_size) { for (int B1_pos1 = B1_pos_arr[iB]; B1_pos1 < B1_pos_arr[iB + 1]; B1_pos1++) { int jB2 = B1_idx_arr[B1_pos1]; int A1_pos3 = (iB * A1_size) + jB2; for (int B2_pos3 = B2_pos_arr[B1_pos1]; B2_pos3 < B2_pos_arr[B1_pos1 + 1]; B2_pos3++) { int kB4 = B2_idx_arr[B2_pos3]; int A2_pos7 = (A1_pos3 * A2_size) + kB4; A_val_arr[A2_pos7] = B_val_arr[B2_pos3]; } } iB++; }

Aijk = Bijk + Cijk

26

The same process can be repeated dimension by dimension

int iB = 0; int C0_pos = C0_pos_arr[0]; while (C0_pos < C0_pos_arr[1]) { int iC = C0_idx_arr[C0_pos]; int C0_end = C0_pos + 1; if (iC == iB) while ((C0_end < C0_pos_arr[1]) && (C0_idx_arr[C0_end] == iB)) { C0_end++; } if (iC == iB) { int B1_pos = B1_pos_arr[iB]; int C1_pos = C0_pos; while ((B1_pos < B1_pos_arr[iB + 1]) && (C1_pos < C0_end)) { int jB = B1_idx_arr[B1_pos]; int jC = C1_idx_arr[C1_pos]; int j = min(jB, jC); int A1_pos = (iB * A1_size) + j; int C1_end = C1_pos + 1; if (jC == j) while ((C1_end < C0_end) && (C1_idx_arr[C1_end] == j)) { C1_end++; } if ((jB == j) && (jC == j)) { int B2_pos = B2_pos_arr[B1_pos]; int C2_pos = C1_pos; while ((B2_pos < B2_pos_arr[B1_pos + 1]) && (C2_pos < C1_end)) { int kB = B2_idx_arr[B2_pos]; int kC = C2_idx_arr[C2_pos]; int k = min(kB, kC); int A2_pos = (A1_pos * A2_size) + k; if ((kB == k) && (kC == k)) { A_val_arr[A2_pos] = B_val_arr[B2_pos] + C_val_arr[C2_pos]; } else if (kB == k) { A_val_arr[A2_pos] = B_val_arr[B2_pos]; } else { A_val_arr[A2_pos] = C_val_arr[C2_pos]; } if (kB == k) B2_pos++; if (kC == k) C2_pos++; } while (B2_pos < B2_pos_arr[B1_pos + 1]) { int kB0 = B2_idx_arr[B2_pos]; int A2_pos0 = (A1_pos * A2_size) + kB0; A_val_arr[A2_pos0] = B_val_arr[B2_pos]; B2_pos++; } while (C2_pos < C1_end) { int kC0 = C2_idx_arr[C2_pos]; int A2_pos1 = (A1_pos * A2_size) + kC0; A_val_arr[A2_pos1] = C_val_arr[C2_pos]; C2_pos++; } } else if (jB == j) { for (int B2_pos0 = B2_pos_arr[B1_pos]; B2_pos0 < B2_pos_arr[B1_pos + 1]; B2_pos0++) { int kB1 = B2_idx_arr[B2_pos0]; int A2_pos2 = (A1_pos * A2_size) + kB1; A_val_arr[A2_pos2] = B_val_arr[B2_pos0]; } } else { for (int C2_pos0 = C1_pos; C2_pos0 < C1_end; C2_pos0++) { int kC1 = C2_idx_arr[C2_pos0]; int A2_pos3 = (A1_pos * A2_size) + kC1; A_val_arr[A2_pos3] = C_val_arr[C2_pos0]; } } if (jB == j) B1_pos++; if (jC == j) C1_pos = C1_end; }

while (B1_pos < B1_pos_arr[iB + 1]) { int jB0 = B1_idx_arr[B1_pos]; int A1_pos0 = (iB * A1_size) + jB0; for (int B2_pos1 = B2_pos_arr[B1_pos]; B2_pos1 < B2_pos_arr[B1_pos + 1]; B2_pos1++) { int kB2 = B2_idx_arr[B2_pos1]; int A2_pos4 = (A1_pos0 * A2_size) + kB2; A_val_arr[A2_pos4] = B_val_arr[B2_pos1]; } B1_pos++; } while (C1_pos < C0_end) { int jC0 = C1_idx_arr[C1_pos]; int A1_pos1 = (iB * A1_size) + jC0; int C1_end0 = C1_pos + 1; while ((C1_end0 < C0_end) && (C1_idx_arr[C1_end0] == jC0)) { C1_end0++; } for (int C2_pos1 = C1_pos; C2_pos1 < C1_end0; C2_pos1++) { int kC2 = C2_idx_arr[C2_pos1]; int A2_pos5 = (A1_pos1 * A2_size) + kC2; A_val_arr[A2_pos5] = C_val_arr[C2_pos1]; } C1_pos = C1_end0; } } else { for (int B1_pos0 = B1_pos_arr[iB]; B1_pos0 < B1_pos_arr[iB + 1]; B1_pos0++) { int jB1 = B1_idx_arr[B1_pos0]; int A1_pos2 = (iB * A1_size) + jB1; for (int B2_pos2 = B2_pos_arr[B1_pos0]; B2_pos2 < B2_pos_arr[B1_pos0 + 1]; B2_pos2++) { int kB3 = B2_idx_arr[B2_pos2]; int A2_pos6 = (A1_pos2 * A2_size) + kB3; A_val_arr[A2_pos6] = B_val_arr[B2_pos2]; } } } if (iC == iB) C0_pos = C0_end; iB++; } while (iB < B0_size) { for (int B1_pos1 = B1_pos_arr[iB]; B1_pos1 < B1_pos_arr[iB + 1]; B1_pos1++) { int jB2 = B1_idx_arr[B1_pos1]; int A1_pos3 = (iB * A1_size) + jB2; for (int B2_pos3 = B2_pos_arr[B1_pos1]; B2_pos3 < B2_pos_arr[B1_pos1 + 1]; B2_pos3++) { int kB4 = B2_idx_arr[B2_pos3]; int A2_pos7 = (A1_pos3 * A2_size) + kB4; A_val_arr[A2_pos7] = B_val_arr[B2_pos3]; } } iB++; }

Aijk = Bijk + Cijk

27

Evaluation

Mode-generic tensorCompressed

Singleton

Dense

Dense

DIADenseRangeOffset

123:16 Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe

x supports locate?

y supports locate?

y supports locate?

x unordered and y ordered?

co-iterate over x and y

iterate over x and locate into y

iterate over y and locate into x

no yes

no

yes yes

no

no

yes

x/y ordered or accessed with locate?

output unordered or

supports insert?

x/y unique?

reorder x/y

aggregate duplicates in x/y

no

yes

yes

yes

no

converted iteratorover x/y

iterator over x/y

co-iterating over x and y?

yes

x/y unique?

no

yes

no

Fig. 8. The most efficient strategies for computing the intersection merge of two vectors x andy, depending onwhether they support the locate capability and whether they are ordered and unique. The sparsity structureof y is assumed to not be a strict subset of the sparsity structure of x . The flowchart on the right describes,for each operand, what iterator conversions are needed at runtime to compute the merge.

If one of the input vectors,y, supports the locate capability (e.g., it is a dense array), we can insteadjust iterate over the nonzero components of x and, for each component, locate the component withthe same coordinate in y. Lines 2–9 in Figure 1b shows another example of this method applied tomerge the column dimensions of a CSR matrix and a dense matrix. This alternative method reducesthe merge complexity from O(nnz(x) + nnz(y)) to O(nnz(x)) assuming locate runs in constanttime. Moreover, this method does not require enumerating the coordinates of y in order. We do noteven need to enumerate the coordinates of x in order, as long as there are no duplicates and we donot need to compute output components in order (e.g., if the output supports the insert capability).This method is thus ideal for computing intersection merges of unordered levels.

We can generalize and combine the two methods described above to compute arbitrarily complexmerges involving unions and intersections of any number of tensor operands. At a high level, anymerge can be computed by co-iterating over some subset of its operands and, for every enumeratedcoordinate, locating that same coordinate in all the remaining operands with calls to locate. Whichoperands need to be co-iterated can be identified recursively from the expression expr that we wantto compute. In particular, for each subexpression e = e1 op e2 in expr , let Coiter (e) denote the setof operand coordinate hierarchy levels that need to be co-iterated in order to compute e . If op is anoperation that requires a union merge (e.g., addition), then computing e requires co-iterating overall the levels that would have to be co-iterated in order to separately compute e1 and e2; in otherwords, Coiter (e) = Coiter (e1) ∪Coiter (e2). On the other hand, if op is an operation that requiresan intersection merge (e.g., multiplication), then the set of coordinates of nonzeros in the resulte must be a subset of the coordinates of nonzeros in either operand e1 or e2. Thus, in order toenumerate the coordinates of all nonzeros in the result, it suffices to co-iterate over all the levelsmerged by just one of the operands. Without loss of generality, this lets us compute e withouthaving to co-iterate over levels merged by e2 that can instead be accessed with locate; in otherwords,Coiter (e) = Coiter (e1)∪ (Coiter (e2) \LocateCapable(e2)), where LocateCapable(e2) denotesthe set of levels merged by e2 that support the locate capability.

4.5 Code Generation Algorithm

Figure 9a shows our code generation algorithm, which incorporates all of the concepts we presentedin the previous subsections. Each part of the algorithm is labeled from 1 to 11; throughout the

Proc. ACM Program. Lang., Vol. 2, No. OOPSLA, Article 123. Publication date: November 2018.

Format Abstraction & Code Generation

28

Our technique supports a wide range of disparate tensor formats

tacoIntel MKL SciPy MTL4 Tensor Toolbox TensorFlow

This work [Kjolstad et al. 2017]Sparse vector

Hash map vectorCoordinate matrix

CSRDCSRELLDIA

BCSRCSBDOKLIL

SkylineBanded

Coordinate tensorCSF

Mode-generic

28

Our technique supports a wide range of disparate tensor formats

tacoIntel MKL SciPy MTL4 Tensor Toolbox TensorFlow

This work [Kjolstad et al. 2017]Sparse vector

Hash map vectorCoordinate matrix

CSRDCSRELLDIA

BCSRCSBDOKLIL

SkylineBanded

Coordinate tensorCSF

Mode-generic

28

Our technique supports a wide range of disparate tensor formats

tacoIntel MKL SciPy MTL4 Tensor Toolbox TensorFlow

This work [Kjolstad et al. 2017]Sparse vector

Hash map vectorCoordinate matrix

CSRDCSRELLDIA

BCSRCSBDOKLIL

SkylineBanded

Coordinate tensorCSF

Mode-generic

28

Our technique supports a wide range of disparate tensor formats

tacoIntel MKL SciPy MTL4 Tensor Toolbox TensorFlow

This work [Kjolstad et al. 2017]Sparse vector

Hash map vectorCoordinate matrix

CSRDCSRELLDIA

BCSRCSBDOKLIL

SkylineBanded

Coordinate tensorCSF

Mode-generic

28

Our technique supports a wide range of disparate tensor formats

tacoIntel MKL SciPy MTL4 Tensor Toolbox TensorFlow

This work [Kjolstad et al. 2017]Sparse vector

Hash map vectorCoordinate matrix

CSRDCSRELLDIA

BCSRCSBDOKLIL

SkylineBanded

Coordinate tensorCSF

Mode-generic

28

Our technique supports a wide range of disparate tensor formats

tacoIntel MKL SciPy MTL4 Tensor Toolbox TensorFlow

This work [Kjolstad et al. 2017]Sparse vector

Hash map vectorCoordinate matrix

CSRDCSRELLDIA

BCSRCSBDOKLIL

SkylineBanded

Coordinate tensorCSF

Mode-generic

28

Our technique supports a wide range of disparate tensor formats

tacoIntel MKL SciPy MTL4 Tensor Toolbox TensorFlow

This work [Kjolstad et al. 2017]Sparse vector

Hash map vectorCoordinate matrix

CSRDCSRELLDIA

BCSRCSBDOKLIL

SkylineBanded

Coordinate tensorCSF

Mode-generic

28

Our technique supports a wide range of disparate tensor formats

tacoIntel MKL SciPy MTL4 Tensor Toolbox TensorFlow

This work [Kjolstad et al. 2017]Sparse vector

Hash map vectorCoordinate matrix

CSRDCSRELLDIA

BCSRCSBDOKLIL

SkylineBanded

Coordinate tensorCSF

Mode-generic

29

Our technique generates efficient code

29

Our technique generates efficient codeCoordinate SpMV

Nor

mal

ized

tim

e

0.0

0.5

1.0

1.5

This work SciPy Intel MKL MTL4 TensorFlow

DIA SpMV

This work SciPy Intel MKL

29

Our technique generates efficient codeCoordinate SpMV

Nor

mal

ized

tim

e

0.0

0.5

1.0

1.5

This work SciPy Intel MKL MTL4 TensorFlow

DIA SpMV

This work SciPy Intel MKL

CSR Addition

Nor

mal

ized

tim

e

0.02.04.06.08.0

10.0

This work SciPy Intel MKL MTL4

Coordinate MTTKRP

This work Tensor Toolbox

29

Our technique generates efficient codeCoordinate SpMV

Nor

mal

ized

tim

e

0.0

0.5

1.0

1.5

This work SciPy Intel MKL MTL4 TensorFlow

DIA SpMV

This work SciPy Intel MKL

CSR Addition

Nor

mal

ized

tim

e

0.02.04.06.08.0

10.0

This work SciPy Intel MKL MTL4

Coordinate MTTKRP

This work Tensor Toolbox

29

Our technique generates efficient codeCoordinate SpMV

Nor

mal

ized

tim

e

0.0

0.5

1.0

1.5

This work SciPy Intel MKL MTL4 TensorFlow

DIA SpMV

This work SciPy Intel MKL

CSR Addition

Nor

mal

ized

tim

e

0.02.04.06.08.0

10.0

This work SciPy Intel MKL MTL4

Coordinate MTTKRP

This work Tensor Toolbox

29

Our technique generates efficient codeCoordinate SpMV

Nor

mal

ized

tim

e

0.0

0.5

1.0

1.5

This work SciPy Intel MKL MTL4 TensorFlow

DIA SpMV

This work SciPy Intel MKL

CSR Addition

Nor

mal

ized

tim

e

0.02.04.06.08.0

10.0

This work SciPy Intel MKL MTL4

Coordinate MTTKRP

This work Tensor Toolbox

30

In conclusion…

We can automatically generate kernels that compute with disparate tensor formats

30

In conclusion…

We can automatically generate kernels that compute with disparate tensor formats

Adding support for even more tensor formats is straightforward

30

In conclusion…

Supporting many disparate tensor formats is essential for performance

We can automatically generate kernels that compute with disparate tensor formats

Adding support for even more tensor formats is straightforward

30

In conclusion…

Supporting many disparate tensor formats is essential for performance

We can automatically generate kernels that compute with disparate tensor formats

Adding support for even more tensor formats is straightforward

This work supported by:

tensor-compiler.org