COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION
description
Transcript of COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION
![Page 1: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/1.jpg)
COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION
03/26/20121
![Page 2: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/2.jpg)
OUTLINE
Introduction Motivation Network-on-Chip (NoC) ASIC based approaches Coarse grain architectures Proposed Architecture Results
2
![Page 3: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/3.jpg)
INTRODUCTION Goal
Application specific hybrid coarse grained reconfigurable architecture using NoC
Purpose Support Variable Block Size Motion Estimation
(VBSME) First approach
No ASIC and other coarse grained reconfigurable
architectures Difference
Use of intelligent NoC routers Support full and fast search algorithms 3
![Page 4: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/4.jpg)
4
MOTIVATION
H.264
Motion Estimation
Ө(f)=
![Page 5: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/5.jpg)
5
MOTION ESTIMATION
Previous Frame
Current Frame
Current 16x16 Block
Mot
ion
Vecto
r
Search Window
Sum of Absolute Difference (SAD)
![Page 6: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/6.jpg)
SYSTEM-ON-CHIP (SOC)
Single chip systems Common components
Microprocessor Memory Co-processor Other blocks
Increased processing power and data intensive applications Facilitating communication between individual
blocks has become a challenge
6
![Page 7: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/7.jpg)
TECHNOLOGY ADVANCEMENT
7
![Page 8: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/8.jpg)
DELAY VS. PROCESS TECHNOLOGY
8
![Page 9: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/9.jpg)
NETWORK-ON-CHIP (NOC)
Efficient communication via use of transfer protocols
Need to take into consideration the strict constraints of SoC environment
Types of communication structure Bus Point-to-point Network
9
![Page 10: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/10.jpg)
COMMUNICATION STRUCTURES
10
![Page 11: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/11.jpg)
BUS VS. NETWORK
Bus Pros & Cons Network Pros & Cons
Every unit attached adds parasitic capacitance
x ✓ Local performance not degraded with scaling
Bus timing is difficult x ✓ Network wires can be pipelined
Bus arbitration can become a bottleneck
x ✓ Routing decisions are distributed
Bus testability problematic and slow
x ✓ Locally placed BIST is fast and easy
Bandwidth is limited and shared by all
x ✓ Bandwidth scales with network size
Bus latency is wire speed once granted
✓ x Network contention may cause latency
Very compatible ✓ x IPs need smart wrappers
Simple to understand ✓ x Relatively complicated
11
![Page 12: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/12.jpg)
EXAMPLE
12
![Page 13: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/13.jpg)
EXAMPLE OF NOC
13
![Page 14: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/14.jpg)
ROUTER ARCHITECTURE
14
![Page 15: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/15.jpg)
BACKGROUND
ME General purpose processors, ASIC, FPGA and
coarse grain Only FBSME VBSME with redundant hardware
General purpose processors Can exploit parallelism Limited by the inherent sequential nature and
data access via registers
15
![Page 16: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/16.jpg)
CONTINUED…
ASIC No support to all block sizes of H.264 Support provided at the cost of high area
overhead Coarse grained
Overcome the drawbacks of LUT based FPGAs Elements with coarser granularity Fewer configuration bits Under utilization of resources
16
![Page 17: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/17.jpg)
ASIC Approaches
Topology SAD accumulation
2D systolic array
•Large number of registers•Store partial SADs•Area overhead•High latency
•Mesh based architecture•Store partial SADs•Area overhead•High latency•No VBSME
Partial Sum
Parallel Sum
1D systolic array
1D systolic array
2D systolic array
Partial Sum
Parallel Sum
2D systolic array
Partial Sum
Parallel Sum
•Reference pixels broadcasted•SAD computation for each 4x4 block pipelined•Each processing element computes pixel difference, accumulates it to the previous partial SAD and sends the computed partial SAD to the next processing element•Large number of registers
•All pixel differences of a 4x4 block computed in parallel•Reference pixels are reused•Direction of data transfer depends on search pattern
17
![Page 18: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/18.jpg)
OU’S APPROACH
16 SAD modules to process 16 4x4 motion vectors
VBSME processor Chain of adders and comparators to compute
larger SADs PE array
Basic computational element of SAD module Cascade of 4 1D arrays
1D array 1D systolic array of 4 PEs Each PE computes a 1 pixel SAD
18
![Page 19: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/19.jpg)
Module 0Module 0
Module 1Module 1
Module 15Module 15
current_block_data_0 search_block_data_0
current_block_data_1
current_block_data_15
search_block_data_1
search_block_data_15
SAD_0
SAD_1
SAD_15
MV_0
MV_1
MV_15
strip_sel read_addr_B
read_addr_A
write_addr
SAD Modules
MUX for SADMUX for SAD
1D Array
0
1D Array
0
1D Array
3
1D Array
3
block_strip_B
block_strip_A
DD DDcurrent_block_data_i
4 bits
1 bit 1 bit
32 bits
32 bits
SAD_i
MV_i
PE Array19
![Page 20: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/20.jpg)
PEPE
PEPE
PEPE
PEPE
ACCMACCM
DD
DD
DD
DD
DDDD
DD
DD DD
DD DD DD
32 bits 32 bits
1D Array
20
![Page 21: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/21.jpg)
PUTTING IT TOGETHER
Clock cycle Columns of current 4x4 sub-block scheduled using a
delay line Two sets of search block columns broadcasted
4 block matching operations executed concurrently per SAD module
4x4 SADs -> 4x4 motion vectors Chain of adders and comparators
4x4 SADs -> 4x8 SADs -> … 16x16 SADs Chain of adders and comparators
Drawbacks No reuse of search data between modules Resource wastage
21
![Page 22: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/22.jpg)
22
ALTERNATIVE SOLUTION: COARSE GRAIN ARCHITECTURES
ChESS*(M x 0.8M)/256 x 17 x 17
MATRIX*(M x0.8M)/256 x 17 x 17
RaPiD*272+32M+14.45M2
* Performance (clock cycles) [Frame Size: M x 0.8M]
• Resource utilization
• Generic interconnect
![Page 23: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/23.jpg)
PROPOSED ARCHITECTURE
2D architecture 16 CPEs 4 PE2s 1 PE3 Main Memory Memory Interface
CPE (Configurable Processing Element) PE1 NoC router Network Interface Current and reference block from main memory
23
![Page 24: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/24.jpg)
CPE(1,1)CPE(1,1)
CPE(2,1)CPE(2,1)
CPE(3,1)CPE(3,1)
CPE(4,1)CPE(4,1)
CPE(1,2)CPE(1,2)
CPE(2,2)CPE(2,2)
CPE(3,2)CPE(3,2)
CPE(4,2)CPE(4,2)
CPE(1,3)CPE(1,3)
CPE(2,3)CPE(2,3)
CPE(3,3)CPE(3,3)
CPE(4,3)CPE(4,3)
CPE(1,4)CPE(1,4)
CPE(2,4)CPE(2,4)
CPE(3,4)CPE(3,4)
CPE(4,4)CPE(4,4)
c_d
c_d
c_d
c_d
r_d
r_d
r_d
r_d
c_d
c_d
c_d
c_d
r_d
r_d
r_d
r_d
c_d
c_d
c_d
c_d
r_d
r_d
r_d
r_d
r_d
r_d
r_d
r_d
c_d
c_d
c_d
c_d
PE 2(1)PE
2(1)
PE 2(3)PE
2(3)
PE 2(2)PE
2(2)
PE 2(4)PE
2(4)
PE 3PE 3
Main MemoryMain Memory Memory Interface (MI)
Memory Interface (MI)
data_load_control
(16 bits)
reference_block_id (5 bits)
c_d_(x,y)
(32 bits)
r_d_(x,y)
(32 bits)
32 bits
14 bits
12 bits
24
![Page 25: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/25.jpg)
18 bit sub
18 bit sub
CPRCPR
RPRRPR
28 bit sub
28 bit sub
CPRCPR
RPRRPR
38 bit sub
38 bit sub
CPRCPR
RPRRPR
48 bit sub
48 bit sub
CPRCPR
RPRRPR
58 bit sub
58 bit sub
CPRCPR
RPRRPR
68 bit sub
68 bit sub
CPRCPR
RPRRPR
78 bit sub
78 bit sub
CPRCPR
RPRRPR
88 bit sub
88 bit sub
CPRCPR
RPRRPR
98 bit sub
98 bit sub
CPRCPR
RPRRPR
108 bit sub
108 bit sub
CPRCPR
RPRRPR
118 bit sub
118 bit sub
CPRCPR
RPRRPR
128 bit sub
128 bit sub
CPRCPR
RPRRPR
138 bit sub
138 bit sub
CPRCPR
RPRRPR
148 bit sub
148 bit sub
CPRCPR
RPRRPR
158 bit sub
158 bit sub
CPRCPR
RPRRPR
168 bit sub
168 bit sub
CPRCPR
RPRRPR
10 bit
adder
10 bit
adder
10 bit
adder
10 bit
adder
10 bit
adder
10 bit
adder
10 bit
adder
10 bit
adder
12 bit
adder
12 bit
adder
COMPCOMP
REGREG
r_d c_d To/From NI
To/From East
To/From South
4x4 mv
25
![Page 26: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/26.jpg)
CONTROL UNIT
CONTROL UNIT
PACKETIZATION UNIT
PACKETIZATION UNIT
DEPACKETIZATION UNIT
DEPACKETIZATION UNIT
reference_block_id to MI
data_load_control to MI
Network Interface
NETWORK INTERFACE
26
![Page 27: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/27.jpg)
00
11
33
55
4422
Ring Buffer
First Index Last Index
Header DecoderHeader Decoder
PE 1East
West
North
South
PE 1
EastWest
North
South
Input Controller
Input Controller
Output Controller
Output Controller
ack ackrequest requestReceives
packets from NI/ adjacent router
Stores packets
•XY routing protocol•Extracts direction of data transfer from header packet•Updates number of hops
Sends packets to NI or adjacent router
Input/Output Control Signals
27
NOC ROUTER
![Page 28: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/28.jpg)
Input Controller
Output Controller
Input Controller
Output Controller
Router 1 Router 2
Step 1: Send a message from Router 1 to Router 2
req (1 bit)
Busy?
Buffer space available?
ack (1 bit)
Step 2: Send a 1 bit request signal to Router 2Step 3: Router 2 first checks if it is busy. If not checks for available buffer spaceStep 4: Send ack if space availableStep 5: Send the packet
packet
32 bit
28
![Page 29: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/29.jpg)
PE2 AND PE3
AddersMuxesDe-muxes
ComparatorsRegisters 29
![Page 30: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/30.jpg)
FAST SEARCH ALGORITHM
Diamond Search
•9 candidate search points•Numbers represent order of processing the reference frames•Directed edges labeled with data transmission equations derived based on data dependencies
30
![Page 31: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/31.jpg)
EXAMPLE
Frame
Macro-block
SAD
31
![Page 32: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/32.jpg)
CONTINUED…
32
![Page 33: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/33.jpg)
DATA TRANSFER
Data Transfer between PE1(1,1) and PE1(1,3)
Individual PointsIntersecting Points
33
![Page 34: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/34.jpg)
DATA LOAD SCHEDULE
34
![Page 35: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/35.jpg)
OTHER FAST SEARCH ALGORITHMS
Hexagon
Big Hexagon Spiral
35
![Page 36: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/36.jpg)
FULL SEARCH
36
![Page 37: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/37.jpg)
CONTINUED…
37
![Page 38: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/38.jpg)
RESULTS
38
![Page 39: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/39.jpg)
CONTINUED…
39
![Page 40: COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION](https://reader034.fdocuments.us/reader034/viewer/2022051116/56815255550346895dc08a32/html5/thumbnails/40.jpg)
40