A Flexible Parallel Architecture Adapted to Block-Matching Motion-Estimation Algorithms Santanu...

A Flexible Parallel Architecture Adapted to Block-Matching Motion-Estimation Algorithms

Santanu Dutta, and Wayne Wolf

IEEE Trans. On CSVT, vol. 6, NO. 1, Feb 1996

Introduction VLSI design phases Generic processor vs. ASIC Programmable architectures Architecture design

PE architecture Parallel architecture Memory bandwidth

Data-flow design Pipeline flow

Control circuit H/W or Prog.

specification

behavior

register-transfer

logic

circuit

layout

Controllerunit

PE ArrayArchitecture

Memory

Data

Data

Architecture of PE

SAD PE element

A data-flow design for a full-search block-matching motion estimator

The basic ideas A general-purpose interconnect network

whose topology supports arbitrary paths from ME’s to PE’s.

A memory partitioning scheme that allows the required memory accesses, and

programmable interconnect and PE’s controlled by a stored-program controller.

An abstract architectural model for the proposed motion-estimator

Interconnection Networks Multistage network

Benes, Crossbar, Omega, etc. A simple combination of multiplexers or a dire

ct connection between the memory and the processing elements.

Each frame memory can be implemented as either an interleaved set of multiple banks or a single block of dual-port RAM.

Data-flow design for TSS Eight processors will be needed for each step Each of the TSS takes 256 cycles The size and the cost of a memory increase

considerably with the number of ports. Computer architects and circuit designers usually

restrict the # of ports to two or three. The usage of a 9-port memory for implementing

the TSS is highly impractical.

Nine shifts tested in step 1 of a three-step search

Data-flow for step 1 of a three-step search procedure

Two solutions with different memory partitioning schemes Broadcasting the Previous-Frame Data Broadcasting the Current-Frame Data

Broadcasting the Previous-Frame Data b(4,12) is required by PE8 in cycle 0, by PE5

in cycle 8, by PE1 in cycle 4, and by other PE’s in some other cycles.

Solve the memory-bandwidth problem by aligning the b(.) data carefully.

At most two different b(.) values in a cycle. Problems

TSS could not be completed in 768 cycles. The a(.) data are now misaligned and therefore

cause memory-access conflicts.

Revised data-flow for step 1 of a three-step search procedure (1)

Revised data-flow for step 1 of a three-step search procedure (2)

Broadcasting the Previous-Frame Data 16 smaller memory banks A multistage, 16-port interconnection

network Supplying appropriate memory bandwidth

is critical to maintaining the throughput of a BM architecture.

Two different conflicts The memory conflicts

Arise when two different a(.) values that reside in the same memory bank are needed in the same cycle.

The path conflicts Arise in an interconnection network when one path

( a connection from a src to a dest through s/w) is blocked by another existing path.

Derived of conflict-free schedule A memory partitioning scheme and a

processor assignment scheme are first chosen, through simulation of different memory-partitioning and processor assignment schemes. The number of conflicts is not prohibitively

large Cycles which do not have conflicts are left

unchanged and the ones that have conflicts are recursively

broken into sub-cycles.

Motion estimator architecture: broadcasting previous-frame data

Broadcasting the Current-Frame Data To implement the original TSS data-flow. a(.) is broadcasted. b(.) is partitioned into 16 memory banks .

Motion estimator architecture: broadcasting current-frame data

Performance of the motion estimator The simulator takes as input:

A data-flow description of a BMA specifying the # of PE’s and the ideal flow of the pixel data.

A memory configuration Specifying the # of ME’s and the # of memory ports.

A network characterization Specifying the topology of the interconnection network

between the PE’s and the ME’s. The pipelining information

Specifying the number of pipeline stages in each PE and the network.

Determines the network-path and memory-access conflicts.

Interconnection networks Completely connected Network

N2 crosspoint switches are needed in a single-stage Crossbar

N port (N in, N out) multistage network May not be possible to free all path conflicts

Generalized Cube and Omega N-port network, log2N stages with N/2 switches in each stage

Benes 2 log2N-1 switch stages

Memory-partitioning scheme without data duplication

Data duplicated memory-partitioning

Simulation results for different networking and memory-partitioning schemes

Simulation results for different pixel distributions

Data-flow for step 1 of the conjugate-direction search

Data-flow for step 2 of the conjugate-direction search

Conclusions An engine that can be adapted to multiple

motion-estimation algorithms.

A Flexible Parallel Architecture Adapted to Block-Matching Motion-Estimation Algorithms Santanu...

Documents

Transcript of A Flexible Parallel Architecture Adapted to Block-Matching Motion-Estimation Algorithms Santanu...