Fast Bilinear Algorithms for Convolution · Convolution The discrete convolution between vectors f...

Fast Bilinear Algorithms for Convolution

Caleb Ju

CS598EVS

March 5, 2020

Convolution

The discrete convolution between vectors f ∈ Rr and g ∈ Rn is

yk =∑i

figk−i .

View as a matrix–vector product between matrix T and vector f ,

yk =∑i

gk−i fi =∑j

Tk,j · fj = Tf .

What does the matrix Tlook like?

Denote as T〈g ,r〉, whichis a Toeplitz matrix,whereT〈g ,r〉 ∈ Rn+r−1×r .

T〈g ,r〉 =

g0 0 · · · 0...

. . ....

gn−1. . . 0

. . . g0. . .

...gn−1

Convolution and its Variants

Linear convolution is

yk =

min(k,r−1)∑i=max(0,k−n+1)

figk−i .

The bounds ensure that if we go past either end of vector g , wedon’t compute.

We also have cyclic convolution,

yk =r−1∑i=0

fig(k−i) mod n.

Can also derive correlation,

yk =r−1∑i=0

figk+i .

Applications of Convolution

String matching (Clifford and Clifford, 2007)

Let the pattern be p ∈ Σm and the text be t ∈ Σn.

m−1∑j=0

(pj − ti+j)2 =

m−1∑j=0

(p2j − 2pj ti+j + t2i+j) , ∀ 0 ≤ i ≤ n −m.

Image Processing (Convolutional Neural Network)

Given K filters in tensor F of size r × r , N input images in tensorG of size n × n. Seek to sum over all H channels,

yikxy =H∑

c=1

r∑v=1

r∑u=1

fkcuv · gi ,c,x+u,y+v .

Other applications: cosmological simulation, solutions to partialdifferential equations, signal processing, integer multiplication, . . .

Fast Algorithms for Computing Convolution

A direction computation has O(n2) cost.

Consider complex multiplication,

x × y = (a + bi)× (c + di) = (ac − bd) + (ad + bc)i

= (ac − bd) +(ac + bd − (a− b)(c − d)

)i .

Karatsuba’s Algorithm applies this recursively for O(nlog2(3)) cost.Can also be solved by the discrete Fourier transform,

a ∗ b = IDFT(DFT(a) DFT(b)

).

Using the fast Fourier transform (FFT), can compute linearconvolution in O(nlogn) time.

Other algorithms: Winograd’s minimal filtering method, matrixmultiplication, fast symmetric multiplication

Derivation of Bilinear Algorithms

Recall a bilinear algorithm is

c = F (C)(

(F (A)Ta) (F (B)Tb))

=∑i

∑j

tijkajbk .

The discrete linear convolution of f and g by

yk =

min(k,r−1)∑i=max(0,k−n+1)

fi · gk−i =∑i ,j

tijk figj ,

The tensor T is defined by tijk =

1 : i + j − k = 0

0 : otherwise.

Convolution is Multiplication

How can we derive fast bilinear algorithms for convolution?

Define polynomials a(x) = a0 + a1x + · · ·+ ar−1xr−1 and

b(x) = b0 + b1x + · · · bn−1xn−1. Their product is

c(x) = a(x)b(x) =r+n−2∑k=0

min(k,n−1)∑i=max(0,k−n+1)

(ai · bk−i )xk .

The coefficients of c(x) = c0 + c1x + . . .+ cr+n−2xr+n−2 are

determined by linear convolution.

Convolution as Multiplication

How can we compute c(x)? Suppose we know the value of c(xi )at some nodes x0, . . . , xi , . . . xR−1 and R = deg c(x) + 1. Letcoefficients of c(x) be c . We can get c by

c(xi ) =R−1∑k=0

xki ck = Vi ,:c where V =

x00 . . . xR−10...

...

x0R−1 . . . xR−1R−1

∈ CR×R .

How can we compute c(xi )? Recall c(x) = a(x)b(x). Therefore,

c(xi ) = a(xi )b(xi ).

How can we compute a(xi )? Let a be the coefficients ofpolynomial a(x) (and b for b(x)). Then, computing a(xi ) is aninner product,

a(xi ) =r−1∑k=0

xki ak = Vi ,:a where V is the first r columns of V .

Toom-Cook Algorithm

Toom-Cook

1. Evaluate α = V a and β = V b2. Compute the products ν = α β

3. Interpolate by solving the linear system Vc = ν

Can prescribe this three-step computation as the following bilinearalgorithm,

c = V−1(2n−1×2n−1)

(V(2n−1×n)a V(2n−1×n)b

).

where V is a Vandermonde matrix, V =

x00 . . . xR−10...

...

x0R−1 . . . xR−1R−1

.

Discrete Fourier Transform

(a) Chebyshev Nodes(b) Equispaced Nodes on Unit Cirlce

Discrete Fourier TransformLet ω(n) = exp(−2πi/n), the nth primitive root of unity. Set the

nodes of V as [ω0(n), ω(n), . . . , ω

r−1(n) ]. Then, V is the Fourier matrix

(and V−1 is the inverse Fourier matrix), leading to bilinearalgorithm,

c = F−1(2n−1×2n−1)

(F(2n−1×n)a F(2n−1×n)b

).

Alternative Bilinear Algorithms

The Toom-Cook method and fast Fourier transform work well forsmall and large convolution problems respectively.

I The Toom-Cook is numerically inaccurate for convolutions ofsize greater than four

I The FFT has significant hidden constants

Now we examine alternative algorithms that offer trade-offsbetween computational efficiency and numerical accuracy.

Modular Polynomial Multiplication

Let’s revisit convolution as a polynomial multiplication problem,

c(x) = a(x)b(x) =2n∑k=0

min(k,n−1)∑i=max(0,k−n+1)

(ai · bk−i )xk .

What is the remainder of c(x) divided polynomial M wheredegM > deg c(x)?

c(x) = r(x) ≡ c(x) (mod M).

What if we use a polynomial m where degm ≤ deg c(x)?

c(x) 6= r(x) ≡ c(x) (mod m).


Why use modulo polynomial multiplication? Modulomultiplication decreases size of inputs.

c(x) ≡ a(x)b(x) ≡(a(x) mod m

)(b(x) mod m

)(mod m).

However, this leads to an answer that is congruent to the actualproduct, i.e. not the solution we actually want.

Can we compute the polynomial multiplication using modulopolynomial multiplication?

Yes, using the Chinese Remainder Theorem.

Chinese Remainder Theorem

TheoremLet m(1), . . . ,m(k) be coprime integers and M =

∏ki m

(i). Givenremainders r (1), . . . , r (k) where 0 ≤ r (i) < m(i), the ChineseRemainder Theorem (CRT) asserts that there exists a uniqueinteger x (modulo M) such that

x ≡ r (i) (mod m(i)) ∀i ∈ [k].

Further, this mapping between integer and remainders is a ringisomorphism (structure preserving).

Example

Let m(1) = 3,m(2) = 4, and M = 12. Let x = 7 (mod M), and itsremainders,

x ≡ r (1) ≡ 1 (mod 3) and x ≡ r (2) ≡ 3 (mod 4).

Chinese Remainder Theorem: Example

Let x ≡ 7 (mod 12). Seek to compute (7× 4) (mod 12).

Figure: Ring Isomorphism


x ≡ r (1) ≡ 1 (mod 3) and x ≡ r (2) ≡ 3 (mod 4).



r ′(1) ≡ r (1) × 4 ≡ 4 ≡ 1 (mod 3) and r ′(2) ≡ r (2) × 4 ≡ 0 (mod 4).



y ≡ 28 ≡ 4 (mod 12) satisfiesr ′(1) ≡ 1 (mod 3) and r ′(2) ≡ 0 (mod 4).



Akin to interpolation, modular polynomial multiplication can becomputed via

I Compute the remainders of a(x) and b(x) for a series ofcoprime divisors m(i)

I Multiply the corresponding remainders (can use normalpolynomial multiplication)

I Map remainders back to its (unique) polynomial

How do we recover the polynomial from its remainder?The Chinese Remainder Theorem also tells us how to do so.

Chinese Remainder Theorem (part 2)

TheoremRecall the polynomial divisors m(i) are coprime, M =

∏i m

(i), andwe have a set of remainders r (i). To solve for x , we compute

x =( k∑

i=1

r (i)M(i)N(i))

mod M,

where M(i) = M/m(i) and N(i) and n(i) are arbitrary polynomialssatisfying Bezout’s identity,

M(i)N(i) + m(i)n(i) = 1.

Chinese Remainder Theorem (part 2): Example

Coprimepolynomial divisorsm(i),

whereM =

∏i m

(i),

andM(i) = M/m(i).Let N(i), n(i) suchthat ∀i M(i)N(i) +m(i)n(i) = 1.

Solution is x =( k∑i=1

r (i)M(i)N(i))

mod M.

Compute product y = (4×7) (mod 12).

Have M(1) = 4, m(1) = 3, M(2) = 3,m(2) = 4, and M = 12,

with remainders r ′(1) ≡ 1 (mod 3) andr ′(2) ≡ 0 (mod 4).

See 4N(1) + 3n(1) = 1 and3N(2) + 4n(2) = 1 are satisfied withN(1) = 1, n(1) = −1, N(2) = −1, andn(2) = 1.

So we have∑i

r (i)M(i)N(i) = 1(4)(1) + 0(3)(−1)

= 4 ≡ 28 (mod 12).

Chinese Remainder Theorem (part 2)

x =( k∑

i=1

r (i)M(i)N(i))

mod M

Why does this work?

Since M(i)N(i) = 1−m(i)n(i), then for a fixed i ,

x =∑j

r (j)M(j)N(j) = r (i)(1−m(i)n(i)︸︷︷︸=M(i)N(i)

) = r (i) (mod m(i))

The Chinese Remainder tells us there is bijection betweenremainders and the original polynomial. Therefore, any polynomialsatsifying the remainder equivalences is equivalent to the originalpolynomial (modulo M)!


The Chinese Remainder Theorem required thatM(i)N(i) + m(i)n(i) = 1 for all i . Does there even exist N(i), n(i)?

Theorem (Bezout’s identity)

Let p and q be coprime polynomials (do not share any roots), thenthere exists polynomials u and v such that pu + qv = 1.

Since M(i) and m(i) are coprime, there exists polynomials N(i) andn(i) such that

M(i)N(i) + m(i)n(i) = 1.

Winograd Convolution Algorithm

Let f ∈ Rr and g ∈ Rn be the vectors we seek to convolve. Recallthat we first compute the remainders,

f = r(i)(f )(mod m(i)) and g = r

(i)(g)(mod m(i)).

Next, we compute the product of remainders using a convolutionalgorithm,

r (i) = (r(i)(g) ∗ r

(i)(g))(mod m(i)).

We use the Chinese remainder theorem to recover the uniquesolution,

y =(∑

r (i) ∗M(i) ∗ N(i))(mod M),

where M(i) = M/m(i) and M(i)N(i) + m(i)n(i) = 1.

Toom-Cook vs. Winograd Convolution Algorithm

Toom-Cook

1. Evaluate at a set ofunique integer points

2. Compute the element-wisemultiplication (these areevaluated points of theproduct)

3. Interpolate to recover theproduct polynomial

Winograd ConvolutionAlgorithm

1. Evaluate the remainderwith the set of coprimepolynomial divisors m(i)

2. Compute the element-wisepolynomial multiplication(via convolution)

3. Use the CRT to recover theproduct polynomial moduloM

Evaluate the Remainder of a Polynomial Division

Denote the coefficients of an arbitrary polynomial p as p, e.g.p = 3x2 − 1 is represented as p =

[−1 0 3

]Let p and m be polynomials where deg(m) ≤ deg(p).

Modulo Operation

LemmaLet r = p (mod m), with d = deg p. There exists a matrix X〈m,d〉such that r = X〈m,d〉p.

Evaluate the Remainder of a Polynomial Division

LemmaLet r = p (mod m), with d = deg p. There exists a matrix X〈m,d〉such that r = X〈m,d〉p.

Proof.We know p = mq + r for some polynomial q. Then,

T〈m,r〉q + r =

m0 . . . 0...

. . .

mdegm−1 m0

. . ....

mdegm−1

q + r =

[UL

]q +

[r0

]=

[p(A)

p(B)

].

Solving both systems, we get r = −UL−1p(B) + p(A).

Solve Bezout’s identity

LemmaWrite MN + mn = 1 as[

T〈M,degm−1〉 T〈m,degM−1〉]︸︷︷︸

A

[Nn

]=[1 0 . . .

]T

Proof.Show that the matrix A is invertible.

Winograd Convolution Algorithm

Theorem (Winograd Convolution Algorithm)

Given coprime polynomials m(1),m(2) such that M = m(1)m(2) anddegM = n + r − 1, bilinear algorithms (A(i),B(i),C (i)) for aconvolution of dimension degm(i) for i ∈ 1, 2, then (A,B,C ) isa convolution for vectors of dimension r and n, where

A =[XT〈m(1),r−1〉A

(1) , XT〈m(2),r−1〉A

(2))],

B =[XT〈m(1),n−1〉B

(1) , XT〈m(2),n−1〉B

(2))], and

C =[C (1) , C (2)

],

with C (i) = X〈M,degM+degm(i)−2〉T〈e(i),degm(i)〉X〈m(i),2 degm(i)−1〉C(i)

and polynomial e(i) = M(i)N(i) mod M is defined from Bezout’sidentity.

Rank of Winograd Convolution Algorithm

Given f ∈ Rr and g ∈ Rn, the solution y ∈ Rr+n−1. Therefore,select M to be a (n + r − 1)-degree polynomial.

Remark The bilinear rank R of the Winograd convolutionalgorithm with polynomial divisors m(1), . . . ,m(k) is

k∑i=1

(2 degm(i) − 1).

Observation Increasing the bilinear rank of the Winogradconvolution with (at least one) superpolynomial divisor (degreegreater than one) improves the numerical accuracy of convolution.

Nested and Multidimensional Convolution

Given F ,G ∈ Rn×n, a 2D convolution is defined as

yxy =r∑

i=0

r∑j=0

fijgx+i ,y+j =∑i

∑j

fijguv .

Can nest the tensors,

yab =r∑

i=0

r∑j=0

n∑u=0

n∑v=0

t(A)ixu t

(B)jyv fijguv .

Equivalently, we have the following nested bilinear algorithm,

vec(Y ) = (C ⊗ C )[(

(A⊗ A)T vec(F ))((B ⊗ B)T vec(G )

))],

or otherwise,

Y = C[(ATFA) (BTGB)

]CT .

Overlap Add

We can use multidimensional convolution to solve 1D convolutionproblems.Let the recomposition matrix be

Q(γ,η) =

Iη−11

Iη−1 Iη−11

. . .

Iη−1 Iη−11

Iη−1

.

LemmaLet Y = F ∗ G , where F , G ∈ Rγ×η. Then if f = vec(F ),g = vec(G ), f ∗ g = vec(Q(γ,η)Y ).

Numerical Accuracy

Figure: 1D convolution error

Numerical Accuracy


Properties of Bilinear Algorithms

Matrix Interchange

I How can we build new algorithms with the sameencoding/decoding matrices?

I Can we design new algorithms with the same complexity assimilar bilinear algorithms?

Asymptotic Complexity

I The role of bilinear rank.

I How can we nest bilinear algorithms?

Lower BoundsI What are lower bounds for bilinear algorithms?

Matrix Interchange

Recall the definition of the discrete convolution and correlationalgorithm,

yk =r−1∑i=0

figk−i and yk =r−1∑i=0

figk+i .

Theorem (Matrix Interchange)

Let the bilinear algorithm for discrete convolution f and g bedefined as C

((AT f ) (BTg)

). The correlation algorithm with

output size m = n is

B(

(AT f ) (CTg)).

Matrix Interchange

Let the bilinear algorithm for discrete convolution f and g bedefined as C

((AT f ) (BTg)

). The correlation algorithm with

output size m = n is

B(

(AT f ) (CTg)).

Proof.The tensor T in yk =

∑ijtijk figj is 1 if and only if i + j − k = 0.

Moreover, the tensor T corr in yk =∑ijtcorrijk figj is one if and only if

i − j + k = 0.

We see the role of index j (belonging to encoding matrix B) andindex k (belonging to decoding matrix C ) are swapped.

Bilinear Rank

We will denote the bilinear algorithm,

yk =R−1∑l=0

ckl

( r−1∑i=0

ail fi

)( n−1∑j=0

bjlgj

), i.e., y = C

[(AT f )(BTg)

].

with the triplet (A,B,C ). The variable R determines the numberof element-wise multiplications.

Theorem (Correlation Rank Lower Bound (Winograd, 1980))

Given a filter of size r and output of size m, the minimum rank ofa correlation algorithm is m + r − 1.

Corollary

Given a filter of size r and input of size n, the minimum rank of alinear convolution algorithm is n + r − 1.

Asymptotic Complexity

Like in matrix multiplication, we can recursively compute a largerconvolution using a smaller one.

Given a convolution algorithm that divides the problem by size band has bilinear rank R, the cost of the algorithm is

T (n) = R · T (n/b) + (c · b) · n/b= c · nlogb(R).

Error Bounds

Convolution is an ill-posed problem

Consider the cyclic convolution of1−11−1

...

∗cyclic

1111...

=

0000...

.

Therefore, we will use absolute error rather than relative error.

Error Bounds

Theorem (1D bilinear algorithm convolution error)

Given inputs f ∈ Rr and g ∈ Rn, the absolute error of the bilinearalgorithm

‖δy‖ ≤ 2(‖C‖ · ‖A‖ · ‖B‖ · ‖f ‖ · ‖g‖

)ε+ O(ε2),

where ‖ · ‖ is the 2-norm.

Corollary

A d-nested convolution with F ∈ Rr×···×r and G ∈ Rn×···×n hasan error of

||δY || ≤ 2(||C ||d · ||A||d · ||B||d · ||vec(F)|| · ||vec(G)||

)ε+ O(ε2).

Error Bounds

Proof.We can use the fact ||Ax || ≤ ||A|| · ||x || for the encoding anddecoding step. To bound the error from the element-wise product,we use the fact that

‖x y‖2 =∑i

|xiyi |2 ≤(∑

i

|xi |2)(∑

i

|yi |2)

= ‖x‖2 · ‖y‖2,

which leads to ‖x y‖ ≤ ‖x‖ · ‖y‖.

Error Mitigation

Theorem (Pan 2016)

For a Vandermonde matrix V with s as the large magnitude node,the condition number is proportional to

κ(V ) = Ω(sn−1√

n

).

Need node to find ways to either decrease κ(V ) or use a differentmatrix.

Error Mitigation

Better node choiceNumerical accuracy of interpolation improves buy better nodechoices

I Chebyshev nodes

I Brute force search

Can combine small convolution algorithms into larger convolutionalgorithms. Given matrices A,B where C = A⊗ B, we have

κ(C ) = κ(A)κ(B).

Instead of having ||A|| = Ω(nn), we have ||A|| = Ω(nn

1/d)

.

Numerical Accuracy


Arithmetic Complexity

Let nnz(A) be the number of nonzeros, additions a(A) the numberof additions needed, and m(A) the number of multiplications. Wehave

a(A) ≤(nnz(A)−#row(A)

)and m(A) ≤ nnz(A).

Therefore, the overall cost of a convolution is

a(F ) ≤ a(A)+a(B)+a(C ) and m(F ) ≤ m(A)+m(B)+m(C )+R.

Final Thoughts

Can also use bilinear algorithms to

I Find communication lower bounds

I Discover alternative bilinear algorithms

Concluding Thoughts

We have derived a family of fast bilinear algorithms.

We analyzed the error bounds and arithmetic costs for the differentalgorithms, esepcially bounded vs. unbounded algorithms.

Thanks!

Remaining Questions

I Communication lower bounds for nested convolutionalgorithms

I Error lower bounds with node and polynomial divisors choice

I Do polynomial and interpolation-based algorithms cover theentire class of fast bilinear algorithms?

More information covered in the paper,

Caleb Ju and Edgar Solomonik. Derivation and analysis of fastbilinear algorithms for convolution, arXiv:1910.13367 [math.NA],October 2019.

Fast Bilinear Algorithms for Convolution · Convolution The discrete convolution between vectors f...

Documents

Transcript of Fast Bilinear Algorithms for Convolution · Convolution The discrete convolution between vectors f...