Linear algebra over dense matrices over GF(2) and small extensions
Transcript of Linear algebra over dense matrices over GF(2) and small extensions
Linear algebra over dense matrices over F2 andsmall extensions
Martin R. Albrecht
SIAM AG11, October 7, 2011
F2 with SSE2
We can perform 128 finite fields operations in one instruction,i.e., memory access is the expensive operation, not arithmetic.
Hence, optimising memory access is crucial.
Strassen-Winograd [Str69] Multiplication
All multiplications in this work use Strassen-Winograd matrixmultiplication, but with different base case implementations.
I fastest known pratical algorithm
I complexity: O(nlog2 7
)I → linear algebra constant: ω = log2 7
I for efficiency: crossover to base case at some dimension
In our work we focus on optimising the base case.
M4RM [ADKF70] I
Consider C = A · B, where A is m × ` and B is `× n.
A can be divided into `/k vertical “stripes”
A0 . . .A(`−1)/k
of k columns each. B can be divided into `/k horizontal “stripes”
B0 . . .B(`−1)/k
of k rows each. We have:
C = A · B =
(`−1)/k∑0
Ai · Bi .
M4RM [ADKF70] II
A =
1 1 0 10 0 0 01 1 1 10 1 1 1
,B =
1 0 1 10 1 1 00 1 1 00 1 0 1
,A0 =
1 10 01 10 1
A1 =
0 10 01 11 1
,B0 =
(1 0 1 10 1 1 0
),B1 =
(0 1 1 00 1 0 1
)
A0 · B0 =
1 1 0 10 0 0 01 1 0 10 1 1 0
,A1 · B1 =
0 1 0 10 0 0 00 0 1 10 0 1 1
M4RM: Algorithm O(n3/ log n
)begin1
C ←− create an m × n matrix with all entries 0;2
k ←− log n;3
for 0 ≤ i < (`/k) do4
// create table of 2k − 1 linear combinationsT ← MakeTable(B, i × k , 0, k);5
for 0 ≤ j < m do6
// read index for table Tid ←− ReadBits(A, j , i × k , k);7
add row id from T to row j of C ;8
return C ;9
end10
Algorithm 1: M4RM
M4RM: Cache friendliness I
begin1
C ←− create an m × n matrix with all entries 0;2
k ←− log n;3
for 0 ≤ i < (`/k) do4
// this is cheap in terms of memory accessT ← MakeTable(B, i × k , 0, k);5
for 0 ≤ j < m do6
// we load each row to take care of only k bitsid ←− ReadBits(A, j , i × k, k);7
add row id from T to row j of C ;8
return C ;9
end10
Algorithm 2: Memory access pattern of M4RM
M4RM: Cache friendliness II
begin1
C ←− create an m × n matrix with all entries 0;2
k ←− blog nc;3
for 0 ≤ start < m/bs do4
for 0 ≤ i < (`/k) do5
// we regenerate T for each blockT ← MakeTable(B, i × k , 0, k);6
for 0 ≤ s < bs do7
j ←− start × bs + s;8
id ←− ReadBits(A, j , i × k , k);9
add row id from T to row j of C ;10
return C ;11
end12
Algorithm 3: Cache friendly M4RM
t > 1 Gray Code Tables
I arithmetic is quite cheap compared to memory access
I the cost of memory access depends on location in memory
I → try to fill all of L1 with Gray code tables.
I Example: k = 10 and 1 table → 10 bits at a time. k = 9 and2 tables, same memory but 18 bits at once.
I price: one extra row addition
Matrix Dimensions t = 1 t = 2 t = 8
10, 000× 10, 000 4.141 1.982 1.59916, 384× 16, 384 16.434 7.258 6.03420, 000× 20, 000 29.520 14.655 11.655
Table: Strassen with different base cases on 2.33Ghz Core 2 Duo
Results: Multiplication
1000 5000 9000 13000 17000 21000 25000 290000
5
10
15
20
25
30
35
execu
tion t
ime t
MagmaSageMagma/Sage
0.8
0.6
0.4
0.2
0.0
0.2
0.4
0.6
log 2
rela
tion
Multiplication
Figure: 2.66 Ghz Intel i7, 4GB RAM
PLE Decomposition
Definition (PLE)
Let A be a m × n matrix over a field K .A PLE decomposition of A is a triple ofmatrices P, L and E such that P is am ×m permutation matrix, L is unitlower triangular matrix, and E is am× nmatrix in row-echelon form, and
A = PLE .
PLE decomposition can be in-place, that is L and E are stored inA and P is stored as an m-vector.
Algorithms
Iterative Gaussian elimination where we write L below and onthe main diagonal.
Block Iterative Variant of PLE-style Gaussian elimination; inspiredby the M4RI [Bar07] algorithm.
Block Recursive Variant of asymptotically fast PLUQ factorisation[IMH82]; reduces to matrix multiplication.
We implemented the block recursive algorithm with block iterativeas base case; we focus on optimisation of the latter.
Results: Reduced Row Echelon Form
1000 5000 9000 13000 17000 21000 25000 290000
5
10
15
20
25
30
execu
tion t
ime t
MagmaSageMagma/Sage
1.0
1.5
2.0
log 2
rela
tion
Elimination
Figure: 2.66 Ghz Intel i7, 4GB RAM
Motivation
System Time
Sage 4.7.1 3267.27sNTL 5.4.2 1005.73sGAP 4.4.12 16.94s
Magma 2.15 3.40sLinBox over F13 4.67s
this work 0.71s
Table: RREF of a 4, 000× 4, 000 matrix over F24 .
Representation of Elements I
Elements in F2e∼= F2[x ]/f can be written as
a0α0 + a1α
1 + · · ·+ ae−1αe−1.
We identify the bitstring a0, . . . , ae−1 with
I the element∑e−1
i=0 aiαi ∈ F2e and
I the integer∑e−1
i=0 ai2i .
We pack several of those bitstrings into one machine word:
a0,0,0, . . . , a0,0,e−1, padding , . . . , a0,n−1,0, . . . , a0,n−1,e−1, padding .
This representation is used in the matrix type mzed t.
Representation of Elements II
I We can rewrite matrices over F2e as an e-tuple of matrix“slices” over F2.
I One slice for each degree of α.
I Instead of considering matrices of polynomials, we canconsider polynomials of matrices.
I This representation is used in the matrix type mzd slice t.
Tomas J. Boothby and Robert Bradshaw.Bitslicing and the Method of Four RussiansOver Larger Finite Fields.CoRR, abs/0901.1413, 2009.
Representation of Elements III
Example:
(α2 + α + 1 α + 1
α2 1
)mzed_t:
|-111-011...|
|-100-001...|
mzd_slice_t:
0: |11...| 1: |11...| 2: |10...|
|01...| |00...| |10...|
Travolta tables: The idea I
In both representations:
I Scaling/multiplying is expensive: either table lookups perelement or lots of bit operations on words.
I Adding is cheap, one XOR per word.
Thus, we really, really prefer additions over multiplications.
Travolta tables: The idea II
Input: A – m × n matrixInput: B – n × k matrixbegin1
for 0 ≤ i < m do2
for 0 ≤ j < n do3
Cj ←− Cj + Aj ,i × Bi ;4
return C ;5
end6
Travolta tables: The idea III
Input: A – m × n matrixInput: B – n × k matrixbegin1
for 0 ≤ i < m do2
for 0 ≤ j < n do3
Cj ←− Cj + Aj ,i × Bi ; // cheap4
return C ;5
end6
Travolta tables: The idea IV
Input: A – m × n matrixInput: B – n × k matrixbegin1
for 0 ≤ i < m do2
for 0 ≤ j < n do3
Cj ←− Cj + Aj ,i×Bi ; // expensive4
return C ;5
end6
Travolta tables: The idea V
Input: A – m × n matrixInput: B – n × k matrixbegin1
for 0 ≤ i < m do2
for 0 ≤ j < n do3
Cj ←− Cj + Aj ,i×Bi ; // expensive4
return C ;5
end6
. . . but there are only 2e possible multiples of Bi .
Travolta tables: The idea VI
begin1
Input: A – m × n matrixInput: B – n × k matrixfor 0 ≤ i < m do2
// use Gray codes herefor 0 ≤ j < 2e do3
Tj ←− j × Bi ;4
for 0 ≤ j < n do5
x ←− Aj ,i ;6
Cj ←− Cj + Tx ;7
return C ;8
end9
Matrix multiplication:
I m · (2e + n) · k additions
I m · e · k multiplications
Gaussian elimination:
I r · (n + 2e) · n additions
I r · (e + 1) · n multiplications
Karatsuba multiplication: the idea
I Consider F22 with the primitive polynomial f = x2 + x + 1.
I We want to compute C = AB.
I Rewrite A as A0x + A1 and B as B0x + B1.
I The product is
C = A0B0x2 + (A0B1 + A1B0)x + A1B1.
I Reduction modulo f gives
C = (A0B0 + A0B1 + A1B0)x + A1B1 + A0B0.
I This last expression can be rewritten as
C = ((A0 + A1)(B0 + B1) + A1B1)x + A1B1 + A0B0.
Thus the cost is 3 multiplications and 4 additions over F2.
Is it worth it?
e Trav+Stras # of M4RI mults schoolbook [Mon05] M4RIE2 0.624s 6.47 4 3 3.093 1.324s 13.73 9 6 6.124 1.480s 15.35 16 9 9.585 2.776s 28.79 25 13 –6 3.180s 32.98 36 17 –7 3.992s 41.41 49 22 –8 6.304s 65.39 64 27 –9 26.737s 277.35 81 34 –
10 34.526s 358.15 100 39 –
Table: Multiplication of 4, 000× 4, 000 matrices over F2e .
Peter L. MontgomeryFive, six, and seven-term Karatsuba-like formulaeIEEE Trans. on Computers, 53(3):362–369, 2005.
Results: Reduced Row Echelon Forms I
Putting it all together:
I PLE and TRSM are reduced to asymptotically fast matrixmultiplication (Karatsuba or Travolta-Strassen).
I Both the PLE and the TRSM base case use Travolta tables,though not fully optimised yet.
I Note, there is some overhead because auf representationswitching (mzed t vs. mzd slice t).
Results: Reduced Row Echelon Forms II
1000 3000 5000 7000 9000m
0
2
4
6
8
10
12
wall
tim
e
Elimination (2.66Ghz, Intel i7), (e=2)
pletravolta
1000 3000 5000 7000 9000m
0
5
10
15
20
25
wall
tim
e
Elimination (2.66Ghz, Intel i7), (e=3)
pletravolta
1000 3000 5000 7000 9000m
0
5
10
15
20
25
wall
tim
e
Elimination (2.66Ghz, Intel i7), (e=4)
pletravolta
1000 3000 5000 7000 9000m
0.02
0.04
0.06
0.08
0.10
0.12
cc/(
2m
nr^
0.8
07)
Elimination (2.66Ghz, Intel i7), (e=2)
pletravolta
1000 3000 5000 7000 9000m
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
0.22
cc/(
2m
nr^
0.8
07)
Elimination (2.66Ghz, Intel i7), (e=3)
pletravolta
1000 3000 5000 7000 9000m
0.05
0.10
0.15
0.20
0.25
cc/(
2m
nr^
0.8
07)
Elimination (2.66Ghz, Intel i7), (e=4)
pletravolta
Figure: Travolta O(n3)
vs. PLE O(nω) on 2.66 Ghz Intel i7, 4GB RAM
Results: Reduced Row Echelon Forms III
100 700 1300 1900 2500
2
3
4
5
6
7
8
9
Elimination: Magma vs. Sage
4.5
3.0
1.5
0.0
1.5
3.0
4.5
Figure: 2.66 Ghz Intel i7, 4GB RAM
Results: Reduced Row Echelon Forms IV
e Magma GAP M4RIE2.15-10 4.4.12 0x6b24b839a46f
2 6.040 162.658 3.3103 14.470 442.522 5.3324 60.370 502.672 6.330
Table: Elimination of 10, 000× 10, 000 matrices on 2.66 Ghz Intel i7
Thank you
I website:I http://m4ri.sagemath.org
I code:I http://bitbucket.org/malb/m4riI http://bitbucket.org/malb/m4rie
I papers:I http://arxiv.org/abs/0811.1714I https://bitbucket.org/cpernet/pluqm4riI https://bitbucket.org/malb/m4rie-paper
V. Arlazarov, E. Dinic, M. Kronrod, and I. Faradzev.On economical construction of the transitive closure of adirected graph.Dokl. Akad. Nauk., 194(11), 1970.(in Russian), English Translation in Soviet Math Dokl.
Gregory V. Bard.Algorithms for Solving Linear and Polynomial Systems ofEquations over Finite Fields with Applications toCryptanalysis.PhD thesis, University of Maryland, 2007.
O. Ibarra, S. Moran, and R. Hui.A generalization of the fast LUP matrix decompositionalgorithm and applications.Journal of Algorithms, 3:45—-56, 1982.
Peter L. Montgomery.Five, six, and seven-term Karatsuba-like formulae.IEEE Trans. on Computers, 53(3):362–369, 2005.