A Flexible Parallel Architecture Adapted to Block-Matching Motion-Estimation Algorithms Santanu...
-
date post
21-Dec-2015 -
Category
Documents
-
view
220 -
download
0
Transcript of A Flexible Parallel Architecture Adapted to Block-Matching Motion-Estimation Algorithms Santanu...
A Flexible Parallel Architecture Adapted to Block-Matching Motion-Estimation Algorithms
Santanu Dutta, and Wayne Wolf
IEEE Trans. On CSVT, vol. 6, NO. 1, Feb 1996
Introduction VLSI design phases Generic processor vs. ASIC Programmable architectures Architecture design
PE architecture Parallel architecture Memory bandwidth
Data-flow design Pipeline flow
Control circuit H/W or Prog.
specification
behavior
register-transfer
logic
circuit
layout
Controllerunit
PE ArrayArchitecture
Memory
Data
Data
The basic ideas A general-purpose interconnect network
whose topology supports arbitrary paths from ME’s to PE’s.
A memory partitioning scheme that allows the required memory accesses, and
programmable interconnect and PE’s controlled by a stored-program controller.
Interconnection Networks Multistage network
Benes, Crossbar, Omega, etc. A simple combination of multiplexers or a dire
ct connection between the memory and the processing elements.
Each frame memory can be implemented as either an interleaved set of multiple banks or a single block of dual-port RAM.
Data-flow design for TSS Eight processors will be needed for each step Each of the TSS takes 256 cycles The size and the cost of a memory increase
considerably with the number of ports. Computer architects and circuit designers usually
restrict the # of ports to two or three. The usage of a 9-port memory for implementing
the TSS is highly impractical.
Two solutions with different memory partitioning schemes Broadcasting the Previous-Frame Data Broadcasting the Current-Frame Data
Broadcasting the Previous-Frame Data b(4,12) is required by PE8 in cycle 0, by PE5
in cycle 8, by PE1 in cycle 4, and by other PE’s in some other cycles.
Solve the memory-bandwidth problem by aligning the b(.) data carefully.
At most two different b(.) values in a cycle. Problems
TSS could not be completed in 768 cycles. The a(.) data are now misaligned and therefore
cause memory-access conflicts.
Broadcasting the Previous-Frame Data 16 smaller memory banks A multistage, 16-port interconnection
network Supplying appropriate memory bandwidth
is critical to maintaining the throughput of a BM architecture.
Two different conflicts The memory conflicts
Arise when two different a(.) values that reside in the same memory bank are needed in the same cycle.
The path conflicts Arise in an interconnection network when one path
( a connection from a src to a dest through s/w) is blocked by another existing path.
Derived of conflict-free schedule A memory partitioning scheme and a
processor assignment scheme are first chosen, through simulation of different memory-partitioning and processor assignment schemes. The number of conflicts is not prohibitively
large Cycles which do not have conflicts are left
unchanged and the ones that have conflicts are recursively
broken into sub-cycles.
Broadcasting the Current-Frame Data To implement the original TSS data-flow. a(.) is broadcasted. b(.) is partitioned into 16 memory banks .
Performance of the motion estimator The simulator takes as input:
A data-flow description of a BMA specifying the # of PE’s and the ideal flow of the pixel data.
A memory configuration Specifying the # of ME’s and the # of memory ports.
A network characterization Specifying the topology of the interconnection network
between the PE’s and the ME’s. The pipelining information
Specifying the number of pipeline stages in each PE and the network.
Determines the network-path and memory-access conflicts.
Interconnection networks Completely connected Network
N2 crosspoint switches are needed in a single-stage Crossbar
N port (N in, N out) multistage network May not be possible to free all path conflicts
Generalized Cube and Omega N-port network, log2N stages with N/2 switches in each stage
Benes 2 log2N-1 switch stages