Introduction to Convolution circuits synthesis image processing, speech processing, DSP, polynomial...
-
Upload
jeffery-hodge -
Category
Documents
-
view
229 -
download
1
Transcript of Introduction to Convolution circuits synthesis image processing, speech processing, DSP, polynomial...
Introduction to Convolution Introduction to Convolution circuits synthesiscircuits synthesis
• image processing, • speech processing, • DSP, • polynomial multiplication in robot control.
convolution
FIR-filter like structureFIR-filter like structure
b4 b3 b2 b1
++ +
a4 0 0 0
a4*b4
• Separate input and output• Input and output move synchronized• Weights stay in space
b4 b3 b2 b1
++ +
a4 0 0
a4*b4
a3
a3*b4+a4b3
b4 b3 b2 b1
++ +
a3 a4 0
a4*b4
a2
a3*b4+a4b3 a4*b2+a3*b3+a2*b4
b4 b3 b2 b1
++ +
a2 a3 a4
a4*b4
a1
a3*b4+a4b3 a4*b2+a3*b3+a2*b4
a1*b4+a2*b3+a3*b2+a4*b1
b4 b3 b2 b1
++ +
a1 a2 a3
a4*b4
0
a3*b4+a4b3 a4*b2+a3*b3+a2*b4
a1*b4+a2*b3+a3*b2+a4*b1 a1*b3+a2*b2+a3*b1
We insert Dffs to avoid many levels of logicWe insert Dffs to avoid many levels of logic
b4 b3 b2 b1
++ +
a4a2 a3
a4*b4a4*b3 a4*b2 a4*b1
b4 b3 b2 b1
++ +
a3a1 a2
a4*b4 a4*b3+a3b4 a4*b2+a3b3a4*b1+a3b2 a3b1
b4 b3 b2 b1
++ +
a20 a1
a4*b4 a4*b3+a3b4 a4*b2+a3b3+a2b4 a4*b1+a3b2+a2b3
a3b1+a2b2 a2b1
The disadvantage of this circuit is broadcasting
We insert more Dffs to avoid broadcastingWe insert more Dffs to avoid broadcasting
b4 b3 b2 b1
++ +
a4a2 a3
a4*b40 0 0
0 0 0
b4 b3 b2 b1
++ +
a3a1 a2
a4*b4 a3b4 a4b30
a4 0 0
0
Does not work correctly like this, try something new….
b4 b3 b2 b1
a3a1 a2
a4*b4
a3b4 a4b3
0
a4 0 0
0
a2b4
a1b4
a3b3
a2b3
a1b3
00
0
0
a4b2
a3b2
a2b2
a1b2
0
0
0
a4b1
a3b1
a2b1
First sum
Second sum
FIR-filter like structure, FIR-filter like structure, assume two delaysassume two delays
b4 b3 b2 b1
++ +
b4 b3 b2 b1
++ +
b4 b3 b2 b1
++ +
b4 b3 b2 b1
++ +
b4 b3 b2 b1
++ +
b4 b3 b2 b1
++ +
b4 b3 b2 b1
++ +
b4 b3 b2 b1
++ +
b4 b3 b2 b1
++ +
b4 b3 b2 b1
++ +
b4 b3 b2 b1
++ +
b4 b3 b2 b1
++ +
b4 b3 b2 b1
++ +
b4 b3 b2 b1
++ +
• Convolution Algorithm
Two loops
Patterns of operations on vectors:
1. vector product is not dot product [a0*b0, a1*b1, … an*bn][a0*b0, a1*b1, … an*bn]2. Dot product (scalar product) = vector product with
accumulation a0*b0 + a1*b1 + … + an*bna0*b0 + a1*b1 + … + an*bn3. Polynomial multiplication
Example 3:Example 3:FIR Filter or FIR Filter or ConvolutionConvolution
Example 3: ConvolutionExample 3: Convolution• There are many ways to implement convolution using systolic arrays, one of them is
shown: – u(n) : The input of sequence from left.
– w(n) : The weights preloaded in n PEs.
– y(n) : The sequence from right (Initial value: 0) and having the same speed as u(n).
• In this operation each cell’s function is: – 1. Multiply the inputs coming from left with weights and output the input received to the
next cell.
– 2. Add the final value to the inputs from right.
W0 W1 W2 W3
ui……u0
yi……y00
Wi
ain
bout
aout
bin
aout = ain
bout = bin + ain * wi
• Each cell operation.
W0 W1 W2 W3
ui……u0
yi……y00
Wi
ain
bout
aout
bin
aout = ain
bout = bin + ain * wi
Convolution (cont)Convolution (cont)
• Systolic array.The input of sequence from left.
This is just one solution to this problem
• Weights in space• Inputs and outputs in each
cell
Convolution Convolution • Can be 1D, 2D, 3D, etc.• Is very important in many applications.• Can be implemented efficiently in various
architectures.• Is an excellent example to compare various
computer architectures: – SIMD,– MIMD, – CA, – pipelined, – Systolic.
Various Possible Various Possible ImplementationsImplementations
Convolution is very important, we use it in several Convolution is very important, we use it in several applications. So let us think what are applications. So let us think what are all the possible ways to implement itto implement it
• Convolution Algorithm
Bag of Tricks that Bag of Tricks that can be usedcan be used
• Preload-repeated-value
• Replace-feedback-with-register
• Internalize-data-flow
• Broadcast-common-input
• Propagate-common-input
• Retime-to-eliminate-broadcasting
How to invent such circuits?How to invent such circuits?
1. Let us learn from existing designs
2. Let us learn from our own mistakes
3. Let us check all possibilities of moving every piece of data
Bogus Attempt at Systolic FIRBogus Attempt at Systolic FIRfor i=1 to n in parallel
for j=1 to k in place
yi += wj * x i+j-1
feedback from sequential implementation
Replace with register
Inner loop realized in placeStage 1: directly from equation
Stage 2: feedback = yi = yi
Stage 3:
Internal loop Internal loop in spacein space
Bogus Attempt continued: Bogus Attempt continued: Outer LoopOuter Loopfor i=1 to n in parallel
for j=1 to k in place
yi += wj * x i+j-1
Factorize wjThis could work but it has
broadcast
Bogus Attempt continued: Outer Loop - 2Bogus Attempt continued: Outer Loop - 2
for i=1 to n in parallel
for j=1 to k in place
yi += wj * x i+j-1
Because we do not want to have broadcast, we retime the signal w, this requires also retiming of X j
• Another possibility of retiming
for i=1 to n in parallel
for j=1 to k in place
yi += wj * x i+j-1
Bogus Attempt continued: Outer Loop - 2aBogus Attempt continued: Outer Loop - 2a
• Yet another approach is to broadcast common input x i-1
Bogus Attempt continued: Outer Loop - 3Bogus Attempt continued: Outer Loop - 3for i=1 to n in parallel
for j=1 to k in place
yi += wj * x i+j-1
What we achieved?What we achieved?
• We showed several possible beginnigs of creating architectures.
• They were not successful, but show the principles.
• We will continue to create architectures, but these attempts were not complete waste of time, they can be used in similar problems successfully.
• You have to experiment with ideas!!
Attempt at Systolic FIR: Attempt at Systolic FIR:
now internal loop is in parallelnow internal loop is in parallel
Internal loop Internal loop in parallelin parallel
Attempt at Systolic FIR: now internal loop is Attempt at Systolic FIR: now internal loop is in parallelin parallel
1
2
3
Outer Loop continuation for FIR filterOuter Loop continuation for FIR filter
Continue: Optimize Outer LoopContinue: Optimize Outer LoopPreload-repeated ValuePreload-repeated Value
Based on previous slide we can
preload weights Wi
Continue: Optimize Outer LoopContinue: Optimize Outer LoopBroadcast Common ValueBroadcast Common Value
This design has broadcast.
Some purists tell this is not systolic as systolic should have all short wires.
Continue: Optimize Outer LoopContinue: Optimize Outer LoopRetime to Eliminate BroadcastRetime to Eliminate Broadcast
We delay these signals yi
The design becomes not intuitive. Therefore, we The design becomes not intuitive. Therefore, we have to explain in detail “How it works”have to explain in detail “How it works”
y1=x1w1
y1=x1w1
x1
x2
inputs
outputs
Was it a good idea to combine input and output streams to cells?
Polynomial Polynomial Multiplication of Multiplication of
1-D convolution 1-D convolution problemproblem
Types of systolic structureTypes of systolic structure• Convolution problem
weight : {w1, w2, ..., wk}
inputs : {x1, x2, ..., xn}
results : {y1, y2, ..., yn+k-1}
yi = w1xi + w2xi+1 + ...... + wkxi+k-1
(combining two data streams)
H. T. Kung’s grouping work
assume k = 3
Polynomial Multiplication Polynomial Multiplication of 1-D convolution problemof 1-D convolution problem
A A well-knownwell-known family of systolic family of systolic designs fordesigns for
convolution computationconvolution computation•Given the sequence of weights
{w1 , w2 , . . . , wk}•And the input sequence
{x1 , x2 , . . . , xk} ,•Compute the result sequence
{y1 , y2 , . . . , yn+1-k}
• Defined by
yi = w1 xi + w2 xi+1 + . . . + wk xi+k-1
Design B1Design B1- Broadcast input , - move results systolically, - weights stay- (Semi-systolic convolution arrays with global data communication
-
Design B1Design B1
- Broadcast input , - move results systolically, - weights stay- (Semi-systolic convolution arrays with global data communication
• Previously proposed for
circuits to implement a
pattern matching processor
and for circuit to implement
polynomial multiplication.
-
Types of systolic structure: Types of systolic structure: design design B1B1
• wider systolic path (partial result yi move)
x3 x2 x1
y3 y2 y1 W1 W2 W3
yin
xin
yout
yout = yin + Wxin
W
Please analyze this circuit drawing snapshots like in an animated movie of data in subsequent moments of time
broadcast
Discuss disadvantages of broadcast
Results move out
Design B2Design B2Inputs broadcast
Weights move
Results stay
Types of systolic structure: Types of systolic structure: Design B2Design B2
Inputs broadcast
Weights move
Results stay
• wi circulate
• use multiplier-accumulator hardware
• wi has a tag bit (signals accumulator to output results)
• needs separate bus (or other global network for collecting output)
Win
xin
Wout y = y + Winxin
Wout = Win
y
x3 x2 x1
y1 y2 y3
W2W3W1
Design B2
Broadcast input , move weights , results stay[(Semi-) systolic convolution arrays with
global data communication]
• The path for moving yi’s is wider then wi’s because of yi’s carry more bits then wi’s in numerical accuracy.
• The use of multiplier-accumulators may also help increase precision of the result , since extra bit can be kept in these accumulators with modest cost.
Semisystolic because of broadcast
Design FDesign F
Input move
Weights stay
Partial results fan-in
• needs adder
Types of systolic structure: Types of systolic structure: design Fdesign F
Input move
Weights stay
Partial results fan-in
• needs adder
• applications : signal processing, pattern matching
y1’s
Zout = Wxin
xout = xin
Zout
xoutxin W
x3 x2 x1W3 W2 W1
ADDER
Design F
- Fan-in results, move inputs, weights stay
- Semi-systolic convolution arrays with global data communication
• When number of cell is large , the adder can be implemented as a pipelined adder tree to avoid large delay.
• Design of this type using unbounded fan-in.
Design R1Design R1
Inputs and weights move in the opposite directions
Results stay
• can use tag bit
• no bus (systolic output path is sufficient)
• one-half the cells work at any time
Types of systolic structure:Types of systolic structure: Design R1 Design R1Inputs and weights move in the opposite directions
Results stay
• can use tag bit
• no bus (systolic output path is sufficient)
• one-half the cells are work at any time
• applications : pattern matching
y = y + Winxin
xout = xin
Wout = Win
x1x3 x2
W1 W2
y3 y2 y1
Win
xin
Wout
y
xout
Design R1
- Results stay, inputs and weights move in opposite directions- Pure-systolic convolution arrays with global data communication
• Design R1 has the advan-tage that it dose not require a bus , or any other global net-work , for collecting output from cells.
• The basic ideal of this de-sign has been used to imple-ment a pattern matching chip.
Design R2Design R2
Inputs and weights move in the same direction at different speeds
Results stay
• xj’s move twice as fast as the wj’s
• all cells work at any time
• need additional registers (to hold w value)
Types of systolic structure: Types of systolic structure: design R2design R2
Inputs and weights move in the same direction at different speeds
Results stay
• xj’s move twice as fast as the wj’s
• all cells work at any time
• need additional registers (to hold w value)
• applications : pipeline multiplier
W1
W2
W3
W4
W5
x3 x2 x1 y1 y2 y3
W W W
W
y
Win Wout
xin xout
y = y + Winxin
W = Win
Wout = W xout = xin
Design R2
- Results stay , inputs and weights move in the same direction but at different speeds- Pure-systolic convolution arrays with global data communication
• Multiplier-accumulator can be used effectively and so can tag bit method to signal the output of each cell.
• Compared with R1 , all cells work all the time when additional register in each cell to hold a w value.
Design W1Design W1
Inputs and results move in the opposite direction
Weights stay
• one-half the cells are work
• constant response time
Types of systolic structure: Types of systolic structure: design W1design W1Inputs and results move in the opposite direction
Weights stay
• one-half the cells are work
• constant response time
• applications : polynomial division
yout = yin + Wxin
xout = xin
yin
xin
yout
W
xout
x1x3 x2 W1W2
y
W3
Design W1
-Weights stay, inputs and results move in opposite direction- Pure-systolic convolution arrays with global data communication
• This design is fundamental in the sense that it can be naturally extend to perform recursive filtering.
• This design suffers the same drawback as R1 , only appro-ximately 1/2 cells work at any given time unless two inde-pendent computation are in-terleaved in the same array.
Overlapping the executions of multiply-and-add in design W1
Design W2Design W2
Inputs and results move in the same direction at different speeds
Weights stay• all cells work (high throughputshigh throughputs rather than fast response)
Types of systolic structure: Types of systolic structure: design W2design W2
Inputs and results move in the same direction at different speeds
Weights stay
• all cells work (high throughputshigh throughputs rather than fast response)
x
W
xin xout
yin yout
yout = yin + Winxin
x = xin
xout = x
W1W2
x5
W3
x7 x3 x2x1
y1y2y3
W W Wx4x6
Design W2
-Weights stay, inputs and results move in thesame direction but at different speeds- Pure-systolic convolution arrays with global data communication
• This design lose one advan-tage of W1 , the constant response time.
• This design has been extended to implement 2-D 2-D convolution ,convolution , where high throughputs rather than fast response are of concern.
Remarks on Linear Arrays• Above designs are all possible systolic designs for the convolution problem. (some are semi-)
• Using a systolic control path , weight can be selected on- the-fly to implement interpolation or adaptive filtering.
• We need to understand precisely the strengths and drawbacks of each design so that an appropriate design can be selected for a given environment.
• For improving throughput, it may be worthwhile to implement multiplier and adder separately to allow overlapping of their execution. (Such as next page show)
• When chip pin is considered:• pure-systolic requires four I/O ports; • semi-systolic requires three I/O ports.
Retiming of Retiming of filtersfilters
FIR circuit: initial designFIR circuit: initial design
delays
Pipelining of xi
• We insert various numbers of unified delays
FIR circuit: registers added below FIR circuit: registers added below weight multipliersweight multipliers
Notice changed timing here
We insert delays here
FIR Summary: comparison of FIR Summary: comparison of sequential and systolicsequential and systolic
Conclusions on 1D and 1.5D Systolic Arrays Conclusions on 1D and 1.5D Systolic Arrays
Systolic arrays are more than processor arrays which execute systolic algorithms.
– A systolic cell takes on one of the followingone of the following forms:
1. A special purpose cell with hardwired functions,
2. A vector-computer-like cell with instruction decoding and a processing element,
3. A systolic processor complete with a control unit and a processing unit.
Smarter processor for SAT, Petrick, etc.
Large Large Systolic Arrays Systolic Arrays as as
general purpose general purpose computerscomputers
Large Large Systolic Arrays as general Systolic Arrays as general purpose computerspurpose computers
• Originally, systolic architectures were motivated for high performance special purpose computational systems that meet the constraints of VLSI,
• However, it is possible to design systolic systems which: – have high throughputs – yet are not constrained to a single VLSI chip.
Problems with systolic array Problems with systolic array designdesign
1. Hard to design - hard to understand
low level realization may be hard to realize
2. Hard to explain
remote from the algorithm
function can’t readily be deduced from the structure
3. Hard to verify
Key Key architectural issuesarchitectural issues in designing in designing special-purpose systemsspecial-purpose systems
•Simple and regular design
Simple, regular design yields cost-effective special
systems.
•Concurrency and communication
Design algorithm to support high concurrency and
meantime to employ only simple blocks.
•Balancing computation with I/O
A special-purpose system should be a match to a variety
of I/O bandwidths.
Two Dimensional Two Dimensional Systolic Systolic ArraysArrays
• In 1978, the first systolic arrays were introduced as a feasible design for special purpose devices which meet the VLSI constraints.
• These special purpose devices were able to perform four types of matrix operations at high processing speeds:
– matrix-vector multiplication,
– matrix-matrix multiplication,
– LU-decomposition of a matrix,
– Solution of triangular linear systems.
General General Systolic OrganizationSystolic Organization
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
Example 2:Example 2: Matrix-Matrix Multiplication
All previously showntricks can be applied
• Seth Copen Goldstein, CMU Seth Copen Goldstein, CMU A.R. HursonA.R. Hurson2. David E. Culler, UC. Berkeley,2. David E. Culler, UC. Berkeley,3. 3. [email protected]. Syeda Mohsina Afroze4. Syeda Mohsina Afrozeand other students of Advanced Logic and other students of Advanced Logic Synthesis, ECE 572, 1999 and 2000.Synthesis, ECE 572, 1999 and 2000.
SourcesSources