A low-power oriented architecture for H.264 variable block size motion estimation based on a...

9
A low-power oriented architecture for H.264 variable block size motion estimation based on a resource sharing scheme Majdi Elhaji a,b,n , Abdelkrim Zitouni a , Samy Meftali b , Jean-Luc Dekeyser b , Rached Tourki a a Laboratory of Electronic and Micro-Electronic (LAB-IT06), University of Monastir, Tunisia b Univ. Lille 1, LIFL, F-59650, CNRS, UMR 8022, INRIA LilleNord Europe, Villeneuve d’Ascq, France article info Article history: Received 4 December 2011 Received in revised form 17 September 2012 Accepted 17 September 2012 Available online 3 October 2012 Keywords: Motion estimation (ME) Motion vector (MV) SAD FSVBSME abstract In the Advanced Video Coding (AVC) standard, motion estimation (ME) adopts many new features to increase the coding performances such as block matching algorithm (BMA), motion vector prediction (MVP) and variable block size motion estimation (VBSME). However, VBSME is utilized in the MPEG4- AVC/H.264 standard which leads to high computational complexity and data dependency that make the hardware implementation very complex. This paper proposes a flexible VLSI architecture for full-search VBSME (FSVBSME), allowing the partitioning of the source frames into sixteen 4 4 sub-blocks and using a MVP scheme. A clock gating technique based on a distributed control unit is used for power saving. The proposed architecture was designed by Synopsys Design Compiler with 0.13 mm CMOS standard cell library. Under a clock frequency of 500 MHz, it allows a power consumption of about 131 mW. Our VLSI architecture, compared with contemporary ones, can offer higher processing speed, lower power consumption, lower latency and lower gate count complexity. & 2012 Elsevier B.V. All rights reserved. 1. Introduction The Advanced Video Coding (AVC) standard [1] has been applied to a large number of electronic products such as high- definition televisions (HDTVs), set-top boxes, and personal digital assistants (PDAs). The MPEG4-AVC/H.264 is a new compression standard that can be used in such products. Motion estimation (ME) is the most critical component in the H.264 standard. In this coder many new techniques that involve the block matching algorithm (BMA) are adopted. In addition, the full-search block matching algorithm (FSBMA) is the most perfect algorithm in terms of quality and bit-rate [2,3]. In H.264/AVC, variable block size motion estimation (VBSME) is used, which requires seven block sizes: 16 16, 16 8, 8 16, 8 8, 8 4, 4 8 and 4 4 (see Fig. 1). For each macro block (MB), the SADs (sums of absolute differences) of all the block sizes are calculated. VBSME offers better estimation of small motions in a video sequence, higher compression efficiency and good video quality. In H.264, a video frame is segmented into 16 16 macro blocks. Each 16 16 MB is segmented into sixteen 4 4 sub-blocks as shown in Fig. 2. However, the VBSME requires much more computational processes. In fact, the hardware accelerator is necessary for real-time coding, and to design the dedicated VLSI architecture for VBSME. In the H.264 standard, a 16 16 macro block is divided into 16 8, 8 16 and 8 8 sub-blocks. An 8 8 sub-block is also segmented into 8 4, 4 8 and 4 4 sub-blocks as illustrated in Fig. 1. The VBSME algorithm has to handle 41 different sub-blocks of seven different sizes. Furthermore, a single motion vector (MV) is associated with each sub-block. Therefore, the total number of MVs to be processed is 41. In the current frame, each block is compared with a candidate block within the search area over the reference frame as shown in Fig. 2. According to Eqs. (1) and (2), where C(x, y) and R (i þ x, j þ y) represent the current picture and the candidate frame respectively, the best displacement is obtained for the estimated motion vector SAD k, l ð Þ¼ XX Cx, y ð ÞRx þ i, j þ l ð ÞÞ ð ð1Þ SAD Min ¼ min SAD k, l ð ÞÞ ð ð2Þ The 16 16 block of the current frame and the smallest blocks are at the same location and the calculated SAD can be reused to compute different blocks. Therefore, the SAD of sixteen 4 4 blocks can be reused to compute larger blocks. For this work the value of the displacement is [ p, p], (p ¼ 7), in both directions, x and y. Therefore, the size of the search area is (N þ 2p 1) (N þ 2p 1), with N ¼ 16. This search area is divided in two zones, Ref 0 and Ref 1, in order to favor parallelism. Contents lists available at SciVerse ScienceDirect journal homepage: www.elsevier.com/locate/vlsi INTEGRATION, the VLSI journal 0167-9260/$ - see front matter & 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.vlsi.2012.09.001 n Corresponding author at: Laboratory of Electronic and Micro-Electronic (LAB-IT06), University of Monastir, Tunisia. E-mail address: [email protected] (M. Elhaji). INTEGRATION, the VLSI journal 46 (2013) 404–412

Transcript of A low-power oriented architecture for H.264 variable block size motion estimation based on a...

Page 1: A low-power oriented architecture for H.264 variable block size motion estimation based on a resource sharing scheme

INTEGRATION, the VLSI journal 46 (2013) 404–412

Contents lists available at SciVerse ScienceDirect

INTEGRATION, the VLSI journal

0167-92

http://d

n Corr

(LAB-IT

E-m

journal homepage: www.elsevier.com/locate/vlsi

A low-power oriented architecture for H.264 variable block sizemotion estimation based on a resource sharing scheme

Majdi Elhaji a,b,n, Abdelkrim Zitouni a, Samy Meftali b, Jean-Luc Dekeyser b, Rached Tourki a

a Laboratory of Electronic and Micro-Electronic (LAB-IT06), University of Monastir, Tunisiab Univ. Lille 1, LIFL, F-59650, CNRS, UMR 8022, INRIA Lille—Nord Europe, Villeneuve d’Ascq, France

a r t i c l e i n f o

Article history:

Received 4 December 2011

Received in revised form

17 September 2012

Accepted 17 September 2012Available online 3 October 2012

Keywords:

Motion estimation (ME)

Motion vector (MV)

SAD

FSVBSME

60/$ - see front matter & 2012 Elsevier B.V. A

x.doi.org/10.1016/j.vlsi.2012.09.001

esponding author at: Laboratory of Elect

06), University of Monastir, Tunisia.

ail address: [email protected] (M. Elhaji).

a b s t r a c t

In the Advanced Video Coding (AVC) standard, motion estimation (ME) adopts many new features to

increase the coding performances such as block matching algorithm (BMA), motion vector prediction

(MVP) and variable block size motion estimation (VBSME). However, VBSME is utilized in the MPEG4-

AVC/H.264 standard which leads to high computational complexity and data dependency that make the

hardware implementation very complex.

This paper proposes a flexible VLSI architecture for full-search VBSME (FSVBSME), allowing the

partitioning of the source frames into sixteen 4�4 sub-blocks and using a MVP scheme. A clock gating

technique based on a distributed control unit is used for power saving. The proposed architecture was

designed by Synopsys Design Compiler with 0.13 mm CMOS standard cell library. Under a clock

frequency of 500 MHz, it allows a power consumption of about 131 mW. Our VLSI architecture,

compared with contemporary ones, can offer higher processing speed, lower power consumption,

lower latency and lower gate count complexity.

& 2012 Elsevier B.V. All rights reserved.

1. Introduction

The Advanced Video Coding (AVC) standard [1] has beenapplied to a large number of electronic products such as high-definition televisions (HDTVs), set-top boxes, and personal digitalassistants (PDAs). The MPEG4-AVC/H.264 is a new compressionstandard that can be used in such products. Motion estimation(ME) is the most critical component in the H.264 standard. In thiscoder many new techniques that involve the block matchingalgorithm (BMA) are adopted. In addition, the full-search blockmatching algorithm (FSBMA) is the most perfect algorithm interms of quality and bit-rate [2,3]. In H.264/AVC, variable blocksize motion estimation (VBSME) is used, which requires sevenblock sizes: 16�16, 16�8, 8�16, 8�8, 8�4, 4�8 and 4�4(see Fig. 1). For each macro block (MB), the SADs (sums ofabsolute differences) of all the block sizes are calculated. VBSMEoffers better estimation of small motions in a video sequence,higher compression efficiency and good video quality. In H.264, avideo frame is segmented into 16�16 macro blocks. Each 16�16MB is segmented into sixteen 4�4 sub-blocks as shown in Fig. 2.

However, the VBSME requires much more computationalprocesses. In fact, the hardware accelerator is necessary for

ll rights reserved.

ronic and Micro-Electronic

real-time coding, and to design the dedicated VLSI architecturefor VBSME.

In the H.264 standard, a 16�16 macro block is divided into16�8, 8�16 and 8�8 sub-blocks. An 8�8 sub-block is alsosegmented into 8�4, 4�8 and 4�4 sub-blocks as illustrated inFig. 1.

The VBSME algorithm has to handle 41 different sub-blocks ofseven different sizes. Furthermore, a single motion vector (MV) isassociated with each sub-block. Therefore, the total number ofMVs to be processed is 41. In the current frame, each block iscompared with a candidate block within the search area over thereference frame as shown in Fig. 2. According to Eqs. (1) and (2),where C(x, y) and R (iþx, jþy) represent the current picture andthe candidate frame respectively, the best displacement isobtained for the estimated motion vector

SAD k, lð Þ ¼XX

C x, yð Þ�R xþ i, jþ lð ÞÞð ð1Þ

SADMin ¼min SAD k, lð ÞÞð ð2Þ

The 16�16 block of the current frame and the smallest blocksare at the same location and the calculated SAD can be reused tocompute different blocks. Therefore, the SAD of sixteen 4�4blocks can be reused to compute larger blocks. For this work thevalue of the displacement is [�p, p], (p¼7), in both directions,x and y. Therefore, the size of the search area is (Nþ2p�1)� (Nþ2p�1), with N¼16. This search area is divided in twozones, Ref 0 and Ref 1, in order to favor parallelism.

Page 2: A low-power oriented architecture for H.264 variable block size motion estimation based on a resource sharing scheme

Fig. 1. The different modes in H.264.

Fig. 2. Principle of the ME algorithm (a) and the smallest blocks in a 16�16 MB (b).

M. Elhaji et al. / INTEGRATION, the VLSI journal 46 (2013) 404–412 405

The objective of this paper is to present a hardware-orientedapproach for FSVBSME and its VLSI implementation. The proposedarchitecture is based on a distributed control unit for powersaving, which is integrated in the PE array and the comparatornetwork. Currently, it can achieve all the 41 motion vectors of aMB. An important aspect of this architecture is that it is able toreuse the results of the previous calculated SAD in order tocompute the actual SAD, deduce the motion vector from neigh-boring macro blocks, and compare all the block size SADs in aparallel way and with minimal resources. The top level of thisarchitecture contains 16 PEs array to compute the SADs for theseven modes, 16 comparators to determine the minimum SAD, amotion vector predictor and a distributed control unit to managepower consumption.

This paper is organized as follows. In Section 2, related work isreviewed. The proposed 1-D VBSME architecture is discussed inSection 3. A distributed control for power saving is proposed inSection 4. The CMOS-based ASIC design and the experimentalresults are explained and discussed in Section 5. Finally, Section 6concludes the paper.

2. Related work

Although VBSME is effective, the computing complexity andpower consumption of VBSME with FSBMA are very high. It isnecessary to design the dedicated VLSI architectures for VBSME withFSBMA. In the last few years many works have focused on themotion estimation for fixed and variable block sizes [4–17]. Ref. [4]presents a flexible 32�32 PEs (processing elements) array ofVBSME for HDTV applications. Ref. [5] describes 64 PEs for lowpower applications, but it supports only the blocks that are largerthan 8�8. These implementations are not supported by the H.264standard because they can only handle a few block patterns. Ref. [6]describes a ME algorithm that exploits the temporal redundancy ofsegmented video sequences. The objective of this algorithm is tofind the best displacement vector. In [7,8], authors propose ahardware architecture for VBSME. The partial SAD is adoptedto reduce computation cost and hardware complexity. In [7], a

real-time VBSME with FSBMA is described; a single instruction multidata (SIMD) architecture that utilizes the block matching algorithmis presented. The basic idea is the use of the smallest block tocompute the SAD values for the largest block by adding themtogether. To deduce the motion vectors, authors estimate thehorizontal and the vertical motion vectors, which leads to increasingthe motion vector computation process. In addition, the reuse ofprevious MV is not taken into account. The VBSME architecturesproposed in [8,9] are based on a systolic array of processingelements for SAD computation. The proposed hardware implemen-tations are based on the choice of partial SAD and the separationbetween the PE array and the VBS processor respectively. The toplevel of these architectures can contain 16 PEs. Therefore, thesearchitectures are expensive in terms of hardware resource andpower consumption. For video devices, a 1-D VBSME architectureis presented in [10]. This architecture uses a comparator unit (CU)that integrates a set of parallel comparators followed by manyregisters in each PE. Thus, it is highly complex to design sucharchitecture since each PE has a complex control logic and anirregular data flow. In order to reduce power consumption, Ref.[11] proposes a predictive VBSME algorithm for H.264/AVC thattakes the advantage of three effective predictive schemes; stationaryblock prediction, predictive search for non-stationary blocks andpredictive multi-pattern refinement search in merging process. Inthis work, the architectural point of view of the proposed algorithmis not described, and only a sequential software description isdescribed. The parallel data dependency will be studied in order toestimate the hardware implementation of the algorithm. Ref. [12]describes a reconfigurable VLSI architecture for VBSME with FSBMAin MPEG-4 AVC/H.264. This architecture can be reconfigured tosupport a ‘‘meander’’ like scan format of the search area through areconfigurable computing array and a memory for the search area.In [13], a fast VBSME algorithm and its very large scale integration(VLSI) design are presented. This algorithm is proposed with ahardware-oriented concept for regular VLSI design. Moreover, inthese works [12,13] the MV prediction is not addressed. In [14], anapproach for VBSME using an FPGA is proposed. This approach dealswith the reduction of the access to the off-chip memory banks. Thistechnique may reduce the objective video quality PSNR (peak signalto nose ratio) since it is based on the optimization of the direction ofthe macroblock in the current frame, and the scan direction of thematching in the search area. Also, the low-power VLSI VBSMEarchitecture presented in [15] reduces the PSNR since it is basedon the elimination of the unnecessary computation of SAD using aconservative lower bound. Refs. [16,17] present methods to reducethe computation and memory access for variable block size motionestimation (ME) using pixel truncation. Thus, these techniques[14–17] cannot be well used for the H.264 high profiles.

Different from previous work, we propose in this paper a novelfast variable-block-size ME algorithm based on MVP scheme. Theproposed architecture contains a processing element array for SADscomputation and an acceleration array to compare the computedSADs. This scheme has the advantages of providing low latency,reusing smaller sub-blocks for the computation results, sharing sub-blocks comparator and offering low power consumption and hard-ware cost. The seven predictive modes of VBSME are performed inparallel. In addition, up to 41 motion vector (MV) sub-blocks can beprocessed in the same number of clock cycles.

3. Proposed design

3.1. Top level of VBSME with FSBMA

In H.264/AVC, each MB has seven block modes. Therefore, the SADof the smallest block mode should be reused to save computation

Page 3: A low-power oriented architecture for H.264 variable block size motion estimation based on a resource sharing scheme

M. Elhaji et al. / INTEGRATION, the VLSI journal 46 (2013) 404–412406

time and resources. This means that the SAD of the 4�4 block isusually reused to calculate the SADs of larger blocks.

Fig. 3 illustrates the scenario of SAD computations for the 41 sub-blocks. In level 0, the estimated SAD values of the 4�4 blocks arestored into registers, in order to be used in a following level. Forinstance, in level 1, the SAD values of the 4�4 blocks are addedtogether to deduce the SAD values of eight 8�4 blocks and eight4�8 blocks. As scenario for this, the 4�4_0 and 4�4_1 blocks areadded to obtain the SAD value of 4�8_0 and so on. These SAD valuesare also stored in order to be reused in a following level. Therefore,the SADs of 4�8 and 8�4 blocks are added to a well definedmanner to obtain the SAD values of four 8�8 blocks. As shown inFig. 3 and by application of the same method we can deduce the SADvalue of the whole block using the calculated SADs of 4�4 blocks.

In the proposed design, this SAD reuse method is utilized toreduce intensive computation. The proposed architecture is basedon a 1-D processing element array. In our case, it contains 16 PEs.

The pixels of the reference frame and actual MB are stored inthree memories. The address generator module allows access tothe pixels in the order specified in Table 1. They are transferred toPE via a D-latch line, as is illustrated in Fig. 4.

In each PE, the absolute differences between the data values ofactual MB and search area are calculated at every clock cycle.Applying this approach to the whole macro-block leads to 16

Fig. 3. Scenario of SAD computa

SADs that should be computed, each with size 4�4. These SADsare stored in the corresponding register so that they can be reusedto calculate SAD values for larger block sizes. Based on the pixelposition, the stored SADs of the 4�4 block are added to computethe SADs for the 4�8 and 8�4 block sizes. The SADs of the 4�8and 8�4 block sizes are combined to deduce the SAD of the 8�8block size, and so on. This method allows the reduction of thecomputational requirements in the VBSME. These SADs arediffused via 16 buses to a comparator array.

The SAD generated by each PE is shuffled in a defined order sothat it can be compared with the minimum SAD. There are 41SADs and hence 41 associated MVs, and then normally thereshould be 41 comparator elements (CEs). But a method based on ashared comparator is adopted to improve the reuse of resources.Therefore, 16 comparators are utilized. Inside each comparator,there is a component that allows the reuse of the calculated MVs.This is done by saving the MV of the minimum SAD of thesmallest block and reusing it at the appropriate time to calculatethe MV of larger block sizes.

3.2. PE unit

The processing element unit allows the computation of SAD inthe VBSME approach. The PE used in this work is inspired from [8].

tions for the 41 sub-blocks.

Page 4: A low-power oriented architecture for H.264 variable block size motion estimation based on a resource sharing scheme

Table 1Data flow.

Clock cycle Pixels sequences PE0 PE1 ? PE15

0 CP(0,0) ? ? RP(0,0) CP(0,0)–RP(0,0)

1 CP(0,1) ? ? RP(0,1) CP(0,1)–RP(0,1) CP(0,0)–RP(0,1)

2 CP(0,2) ? ? RP(0,2) CP(0,2)–RP(0,2) CP(0,1)–RP(0,2)

� ? ? ? ? ? ? ? ? ? ?� ? ? ? ? ? ? ? ? ? ?14 CP(0,14) ? ? RP(0,14) CP(0,14)–RP(0,14) CP(0,13)–RP(0,14)

15 CP(0,15) ? ? RP(0,15) CP(0,15)–RP(0,15) CP(0,14)–RP(0,15) CP(0,0)–RP(0,15)

16þ0 CP(1,0) RP(1,0) RP(0,16) CP(1,0)–RP(1,0) CP(0,15)–RP(0,16) CP(0,1)–RP(0,16)

16þ1 CP(1,1) RP(1,1) RP(0,17) CP(1,1)–RP(1,1) CP(1,0)–RP(1,1) CP(0,2)–RP(0,17)

. ? ? ? ? ? ? ? ? �

. ? ? ? ? ? ? ? ? �

16þ15 CP(1,15) RP(1,15) RP(0,31) CP(1,15)–RP(1,15) CP(1,14)–RP(1,15) CP(1,0)–RP(1,15)

2�16þ0 CP(2,0) RP(2,0) RP(1,16) CP(2,0)–RP(2,0) CP(1,15)–RP(1,16) CP(1,1)–RP(1,16)

2�16þ1 CP(2,1) RP(2,1) RP(1,17) CP(2,1)–RP(2,1) CP(2,0)–RP(2,1) CP(1,2)–RP(1,17)

� ? ? ? ? ? ? ? ? �

� ? ? ? ? ? ? ? ? �

2�16þ15 CP(2,15) RP(2,15) RP(1,31) CP(2,15)–RP(2,15) CP(2,14)–RP(2,15) CP(2,0)–RP(2,15)

� ? ? ? ? ? ? ? ? �

� ? ? ? ? ? ? ? ? �

15�16þ0 CP(15,0) RP(15,0) RP(14,16) CP(15,0)–RP(15,0) CP(14,15)–RP(14,16) CP(14,1)–RP(14,16)

� ? ? ? ? ? ? ? ? �

� ? ? ? ? ? ? ? ? �

15�16þ15 CP(15,15) RP(15,15) RP(14,31) CP(15,15)–RP(15,15) CP(15,14)–RP(15,15) CP(15,0)–RP(15,15)

16�16þ0 RP(15,16) CP(15,15)–RP(15,16) CP(15,1)–RP(15,16)

� � � .

� � � .

16�16þ15 RP(15,31) CP(15,15)–RP(15,30)

Fig. 4. The top level of the VBSME.

M. Elhaji et al. / INTEGRATION, the VLSI journal 46 (2013) 404–412 407

The PE logic stores the absolute difference (AD) between thecurrent MB and the search area at every clock cycle. The opera-tions performed in this component are based on Table 1. The16�16 MB is segmented into sixteen 4�4 MBs. The pixels of the16�16 MB are labeled CP0 (current pixel) to CP255. The SAD ofthe 4�4_0 block includes four pixel rows. The pixels of the rowsof the 4�4 block, presented in Fig. 3, are labeled CP0 to CP3, CP4to CP7, CP8 to CP11 and CP12 to CP15, and so on for the other4�4 blocks.

The PEs, in the proposed architecture, are working in parallelas a single-instruction multiple-data (SIMD) architecture. Theycompute the SAD values for the actual MB. The PE has two inputsone from the actual MB and one from the reference block.

These pixels are broadcasted to the PE after access to thememories. As shown in Table 1, accessing to the search windowmemories is done row by row. Hence, the reference block iscompared with the current MB in the first row simultaneously,and so on.

Page 5: A low-power oriented architecture for H.264 variable block size motion estimation based on a resource sharing scheme

M. Elhaji et al. / INTEGRATION, the VLSI journal 46 (2013) 404–412408

The design of this PE involves three parts. The first part allowsthe AD computation in accordance with the data flow presentedin Table 1. The second part is composed of registers to store theAD values. It should be multiplexed in a defined order to computethe appropriate SADs. Finally, the third part allows the reuse ofthe calculated SADs to build the SADs of larger block sizes. This PEis adopted for several reasons, such as the simplicity of the designand the regularity of the work flow. Its architecture is presentedin Fig. 5.

In an inter-level architecture, the PE allows the computing ofabsolute deference (AD) and accumulating the SAD of a searchingMB. This stage contains hardware to derive absolute differencevalues between the CP and the RP. These ADs are broadcasted to asecond stage where they are multiplexed in the correct order andstored in a register. The multiplexer ‘‘C’’ and the height registersallow the control of the computations that must be performedappropriately. For instance the AD value that includes CP0 and thecorresponding pixel in the search area is returned after the firstcycle and stored with the AD value that is calculated between CP1and the appropriate RP. This is repeated for CP2 and CP3 and theadded result of the ADs is stored in register 0. Now, the CP4 thatcorresponds to the 4�4_1 block and the corresponding RP areused to compute the AD value and so on. However, this valuemust be accumulated in another different register such as register1. This method is repeated for the next four pixels and the derivedAD value is stored in register 2 and so on up to register 7. Thisprocess is repeated for the second row of the MB: the AD values ofthe first four pixels are stored in register 0, those of the next fourpixels in register 1 and so on. This is repeated for blocks 4�4_0 to4�4_7 and the ADs values are stored respectively in register0 to register 7. The SAD values of 4�4_0 to 4�4_7 are readyrespectively at clock cycles 51, 55, 59,63, 115, 119, 123, 127. Eachavailable SAD is latched to the third stage. This stage allows firstthe passage of the basic SADs (SADs of 4�4 blocks) or thecombined SADs (SADs of large blocks). Second is to reuse thecomputed SADs of small blocks, in order to deduce the SADs oflarge blocks. For instance the SAD value of 4�4_0 is ready onclock cycle 51 and 4�4_1 on clock cycle 55, then this can becombined to derive the SAD value of 4�8_0 and so on. Thisprocess is done by using multiplexers and an adder circuit. Forinstance, at cycle 51, the SAD of 4�4_0 is defused from register0to the adder, where is added with ‘0’. Then, it is stored in register13 and outputted via bus0. The SADs of 4�4_0 and 4�4_1blocks, stored respectively in register 0 and register 1, are broad-casted via MuxA and Mux B and are added together in order toderive the SAD value of 4�8_0 block. This stage has an identicalfunction to the previous stage. It allows a reuse of stored SADs viaMuxD and MuxF to deduce the SADs of other modes.

Fig. 5. Architecture of a pr

3.3. Comparator array

Based on a sharing method, the comparator array is composedof 16 comparators element (CEs) although there are 112 SADsthat should be compared. A CE is composed of an encoder,multiplexers, a simple comparator and registers, as illustrated inFig. 6. The inputs to the comparison process are the SADs valuesof the group of PEs, the associated MVs, the ‘‘Cmpt’’ signal andthe minimum SAD value found from the previous comparator(SAD (n�1)).

The encoder takes advantage of the ‘‘Cmpt’’ signal, which isassociated to a clock counter, in order to control the broadcastingof SAD values. The main role of this encoder is to generateaddresses that correspond to the exact time for when the SADsshould be compared. The (SAD(N�1)) presents the previous SAD,which is compared by another comparator. The multiplexerMux2:1 is used to transfer the previous stored SAD or minimalSAD coming from another comparator. The ‘‘valid’’ signal indi-cates that the comparison is done and the actual SAD is smallerthan the previous stored SAD. This signal allows the activation ofthe registers and a new SAD value is stored with the associatedMV. Table 2 presents the data flow of the processing of thiscomponent.

The idea of sharing of hardware resources is based on someremarks inspired from this work flow. The obtained SAD values ofthe first block (4�4_0 block) are delayed by one clock cycle. Forexample, the SAD value of the first block is calculated in cycle 51by the PE0, in cycle 52 by PE1, in cycle 53 by PE2 and in cycle 54by PE3. Therefore, the SAD values of 4�4 blocks can be comparedfour by four. Hence, in order to save hardware resources, thesePEs can share one comparator. The same principle is applied tothe other PEs. Therefore, four comparators are allocated to the4�4 mode. Based on this philosophy, the 8�4 mode can use twocomparators, where the first comparator is shared by the proces-sing elements PE0 to PE7 and the second one is shared by theprocessing elements PE8 to PE15. For the 4�8 mode, fourcomparators are associated to the processing elements PE0 toPE3, PE4 to PE7, PE8 to PE11 and PE12 to PE15. Two comparatorsare used here, one for the 8�8 mode and the other for the 8�16mode. Finally, a comparator is allocated to the 16�8 mode andanother one to the 16�16 mode.

For instance, at cycle 51 the SAD of 4�4_0 is generated byPE0, and it is then compared, by CE0, with the SADs generated byPE1, PE2, and PE3 in cycles 52, 53, and 54 respectively. Theminimum SAD is therefore generated by CE0 and then is broad-casted to CE1 via the SAD(N�1) signal.

In cycle 55, the SAD of 4�4_0 is calculated by the PE4.The CE1 is in charge of the minimum SAD produced by CE0.

ocessing element (PE).

Page 6: A low-power oriented architecture for H.264 variable block size motion estimation based on a resource sharing scheme

Fig. 6. Architecture of a comparator element.

Table 2Work flow of the CE array.

Clk CE0 CE1(t-55) CE2(t-59) CE3(t-63) CE8(t-63) CE9(t-64) y. CE15(t-261)

51 4�4_0PE0 4�4_0PE4 4�4_0PE8 4�4_0PE12 8�4_0PE0 8�4_0PE8 y.. 16�16_0PE0

52 4�4_0PE1 4�4_0PE5 4�4_0PE9 4�4_0PE13 8�4_0PE1 8�4_0PE9 y.. 16�16_0PE1

53 4�4_0PE2 4�4_0PE6 4�4_0PE10 4�4_0PE14 8�4_0PE2 8�4_0PE10 y.. 16�16_0PE2

54 4�4_0PE3 4�4_0PE7 4�4_0PE11 4�4_0PE15 8�4_0PE3 8�4_0PE11 y.. 16�16_0PE3

55 4�4_1PE0 4�4_1PE4 4�4_1PE8 4�4_1PE12 8�4_0PE4 8�4_0PE12 y.. 16�16_0PE4

56 4�4_1PE1 4�4_1PE5 4�4_1PE9 4�4_1PE13 8�4_0PE5 8�4_0PE13 y.. 16�16_0PE5

57 4�4_1PE2 4�4_1PE6 4�4_1PE10 4�4_1PE14 8�4_0PE6 8�4_0PE14 y.. 16�16_0PE6

58 4�4_1PE3 4�4_1PE7 4�4_1PE11 4�4_1PE15 8�4_0PE7 8�4_0PE15 y.. 16�16_0PE7

59 4�4_2PE0 4�4_2PE4 4�4_2PE8 4�4_2PE12 8�4_1PE0 8�4_1PE8 y.. 16�16_1PE8

60 4�4_2PE1 4�4_2PE5 4�4_2PE9 4�4_2PE13 8�4_1PE1 8�4_1PE9 y.. 16�16_1PE9

61 4�4_2PE2 4�4_2PE6 4�4_2PE10 4�4_2PE14 8�4_1PE2 8�4_1PE10 y.. 16�16_1PE10

62 4�4_2PE3 4�4_2PE7 4�4_2PE11 4�4_2PE15 8�4_1PE3 8�4_1PE11 y.. 16�16_1PE11

63 4�4_3PE0 4�4_3PE4 4�4_3PE8 4�4_3PE12 8�4_1PE4 8�4_1PE12 y.. 16�16_1PE12

64 4�4_3PE1 4�4_3PE5 4�4_3PE9 4�4_3PE13 8�4_1PE5 8�4_1PE13 y.. 16�16_1PE13

65 4�4_3PE2 4�4_3PE6 4�4_3PE10 4�4_3PE14 8�4_1PE6 8�4_1PE14 y.. 16�16_1PE14

66 4�4_3PE3 4�4_3PE7 4�4_3PE11 4�4_3PE15 8�4_1PE7 8�4_1PE15 y.. 16�16_1PE15

Fig. 7. Motion vector prediction.

M. Elhaji et al. / INTEGRATION, the VLSI journal 46 (2013) 404–412 409

It compares this SAD with the ones generated by PE4, PE5, PE6,and PE7, the new minimum SAD is diffused to CE2, and so on. Thesame idea is applied to other block modes.

3.4. Motion vector predictor

In parallel with the SAD prediction, the motion vector pre-dictor predicts the MV of bigger blocks using the MV that isgenerated for the smallest block. According to [11], the spatio-temporel correlation can be adopted to estimate the MV of thecurrent mode using the motion vector already calculated. There-fore, this method can be taken in order to reduce computation.Fig. 7 presents an algorithm that allows this prediction. The MVscorresponding to the Min_SAD of the 4�4 block mode are reusedto calculate the MV of Min_SAD of the 8�4 mode. In cycle 66, theminimum SAD of 4�4_0 and the corresponding MV are gener-ated and stored. In cycle 70, the minimum SAD of 4�4_1 and thecorresponding MV are generated. In parallel, the comparator ofthe 8�4 mode begins the process and the minimum SAD of thisgroup is generated in cycle 71. Finally the associated MV iscomputed as the average of the MV of the 4�4_0 block and the

Page 7: A low-power oriented architecture for H.264 variable block size motion estimation based on a resource sharing scheme

M. Elhaji et al. / INTEGRATION, the VLSI journal 46 (2013) 404–412410

MV of the 4�4_1 block. The MVs of other modes are processed inthe same way. The system below describes the proposed pre-dictor method. Note that in the system we denote the predictorMV by PMV

PMVBLK0 ¼ 1=2 ðPMVBLK_H0, BLK_H1þ PMVBLK_V0, BLK_v1Þ

PMVBLK_H0 ¼ 1=2 ðPMVBLK00þ PMVBLK10Þ

PMVBLK_H1 ¼ 1=2 ðPMVBLK01þ PMVBLK11Þ

PMVBLK_V0 ¼ 1=2 ðPMVBLK00þ PMVBLK01Þ

PMVBLK_V1 ¼ 1=2 ðPMVBLK10þPMVBLK11Þ

8>>>>>><>>>>>>:

The architecture of this fast predictive VBSME can be describedby the following algorithm.

Fig. 8. Model of power estimator.

If SAD (4�4 or 8�8) omin_SAD(4�4 or 8�8) thenMin_SAD¼SAD (4�4 or 8�8)

PMV (4�8 or 8�16)¼1/2 (MVBLKi0þMVBLKi1)/ iE{0,1}PMV (8�4 or 8�16)¼1/2 (MVBLK0iþMVBLKi1)/ iE{0,1}PMV (8�8 or 16�16)¼1/4

P( MVBLK_HiþMVBLK_Vi)

End if;ElseReturn (Min_SAD, MV (4�8) or (8�16), MV(8�4) or (16�8),

MV(8�8) or (16�16))

Fig. 9. The CE estimator’s FSM (left) and the PE estimator’s FSM (right).

4. Distributed control for power saving

The power consumed by logic in the proposed design iscorrelated to its input and mainly to the continuous activity ofcells, although there is no processing to do. Furthermore, itdepends on other parameters such as supplied voltage, frequency,capacity and activity. Eq. (3) shows the principal parameters ofdynamic power consumption

Pd ¼ aCV2DDF ð3Þ

The power consumption in ME can reach 70% of the globalpower consumption of the H.264. Therefore, it is mandatory tointegrate a control scheme in order to reduce power consumptionand to obtain accurate power values. In this scheme, a model ofpower saving is included.

Gating clock is among the techniques that are used forreducing the power consumption. The power model proposed inthis paper is based on FSM in order to gate logic. Therefore, weintegrate FSMs in the PE and CE arrays, by taking advantage of thework flow of each array. Hence, we have developed a hardwarebased on an FSM model. This model is integrated in the VBSMEarchitecture and evaluated by the experiment results. Our pro-posal is based on two observations:

(1)

The data flow of pixels is pipelined, and hence there are nopixels that should be processed. Therefore, gating these PEs ismandatory in order to reduce activity and interrupt voltage.

(2)

The comparator array is inactive during 51 clock cycles foreach MB in the current frame that should be processed.Moreover, the functioning of CE is pipelined, hence thedeveloped model can gate these elements and improvespower saving.

The developed model is presented in Fig. 8.The two finite state machines presented in Fig. 9 describe the

transitions between the states of the CE array and between therelevant power states of the PE array respectively. The transitionsare conditioned by a control signal that is associated to a clocksignal. FSM of the CE estimator contains three states, namely

‘‘active’’, ‘‘Nactive’’, and ‘‘idle’’. The FSM intercepts the controlsignal that indicates if the comparator must be ‘‘active’’ or‘‘Nactive’’. Then, according to the value of the control signal, therewill be a transition to one of these two states. If the state is‘‘Nactive’’ the comparator is paused by the ‘‘s’’ signal that allowsthe gating of the comparator clock.

By referring to Table 1, we notice that the PEs do not operateduring the functioning of the system. In fact, during the first 16clock cycles only PE0 begins the processing at t¼0 and ends att¼15. The other PEs begin the processing after a number of clockcycles. PE2 begins the processing at tþ2, PE3 at tþ3, and so on. Inaddition, we notice that PE0 finishes the processing at the 255thclock cycle. One clock cycle after, it is PE1 that finishes theprocessing, and so on. The description of the FSM of the PEestimator is also based on this idea. It is composed of 32 states,where in each state we enabled or disabled one PE. In the intervalof one clock cycle there are many PEs in the PE array that areinactive, hence the gating clock is required. The architectureassociated to this state machine is developed in a hardwaredescription language. It is simple to design and not expensive interms of power and area.

5. Implementation and results analysis

The proposed architecture was designed with VHDL descrip-tion and synthesized by Synopsys Design Compiler with 0.13 umCMOS standard cell library. The backend tool is Cadence SOCEncounter. The created design contains about 23 K gates exceptSRAM. The maximum running frequency of the circuit is about500 MHz. At this maximum running frequency, the dissipatedpower of the proposed architecture is about 131 mw. For a typicalvideo application, requiring QCIF at 30 fps, the circuit willdissipate 85 mW. Conversely, at maximum clock speed, up to45 fps can be processed in a system with 4-CIF video resolution.

Page 8: A low-power oriented architecture for H.264 variable block size motion estimation based on a resource sharing scheme

Table 3Comparison of seven architectures for VBSME in terms of eight criteria.

Architecture [8] [9] [16] [12] [13] Our [15] [17]

Search range 32�32 33�33 32�32 33�33 57�57 30�30 32�32 31�31

No. of PEs 16 256 256 256 96 16 256 256

Latency (cycles/MB) 4096 256 500 2788 240 196 1614 3328

Block size 16�16 16�16 16�16 16�16 16�16 16�16 16�16 16�16

Process (um) 0.13 0.18 0.13 0.18 0.18 0.13 0.18 0.13

Gate count (K) 61 597 283.5 160 110 23 350 54

Max frequency (MHz) 294 200 1.4 200 100 500 150 –

Power (mw) 570 503 1.33 423 – 131 101.5 –

M. Elhaji et al. / INTEGRATION, the VLSI journal 46 (2013) 404–412 411

A comparison between the proposed architecture and previoustypical VBSME architectures for H.264 is presented in Table 3.Among the architectures, the proposed one contains the lowestgate count which is 57.4% lower than that of Ref. [17], the mostoptimized gate count architecture in the references.

Let T¼4p2, where p represents the displacement in the searcharea (p¼7 in the case of our architecture), be the latency of theFSBMA hardware architecture. It is defined as the number of clockcycles required to identify the MVs for all the 41 sub-blocks in theMB. Among the architectures, the proposed one allows the lowestlatency which is 18.3% lower than that of Ref. [13], the mostreduced latency architecture in the references

PowerðXÞ � 500

FreqðXÞ

� �ð4Þ

The proposed design operates with the highest operationalfrequency and consumes the lowest power, compared with allreferences. From the table, it can be thought that architectures ofRefs. [15,16] consume low powers than that of our architecture.But if we convert the power consumption of these architectures inrelation to 500 MHz (maximal running frequency of our archi-tecture) by using the expression (4); where Power(X) and Freq(X)represent the dissipated power and the frequency of architecturesproposed by [15] or [16], we conclude that these architecturesconsume about 338 mw and 475 mw respectively. In fact, asshown in expression (4), power is proportional to the frequency.Thus the proposed architecture consumes about 61.2% less powerthan the architecture of [15], the most reduced power architec-ture in the references.

Among the major reasons that argument the reduced gatecount and dissipated power of our architecture in relation to theother architectures is its optimized PE and CE module. In fact, theproposed VBSME is decomposed in a structural manner and allthese blocks are optimized by hand. This designing way is notconsidered during the design of the PEs of the architecturespresented in related works. As example, the PE architectureproposed in [13] is based on a behavioral description and itsVBSME is constructed via a LPE (large PE) that consists of threePEs arrays where each array contains 32 PEs. This PE allows onlythe SAD computing of 4�4 blocks separately and the combinedvalues of these SADs are calculated in another module called‘Merged for variable Block size computing’. This module consistsof an adder tree that makes its architecture very complex andexpensive in terms of resources and power consumption. As otherexample, in [17] the SAD computing is based on absolutedifference in signed digit using an online arithmetic (OLA) modulethat contains 16 adders. This proposal integrates a complexbehavioral algorithm to compute the SAD of each mode. Whenthe SAD is generated, it is sent to the OLA module. To computeSADs of 8�8 sub-blocks, four 16-OLA adder trees are used in amodule called VBSME-N8. This module is considered as a base todesign a global adder called VBSME-16. Thus, this architecture isvery complex and uses many hardware resources.

6. Conclusion

This paper presents a VLSI architecture of VBSME, which is oneof the key features of the H.264 standard. The proposed archi-tecture can provide full-search VBS and achieve 41 motionvectors for one 16�16 MB. A shared resource and a reuse methodare proposed in order to reduce intensive computation. Due to theuse of a model of power saving, a motion estimation predictionmethod and a structural optimization technique, a significantreduction in area and power consumption was obtained.

This approach can be used in redesigning current VBSMEarchitectures to improve their scalability and reduce theirdesign costs.

Reference

[1] Joint Video Team (JVT) of ISO/IEC MPEG and ITU-TVCEG, Draft ITU-TRecommendation and Final Draft International Standard of Joint VideoSpecification (ITU-T Rec. H2649ISO/IEC 14496-10 AVC), documentJVT-G050d35.doc, Seventh Meeting, Pattaya, Thailand, March 2003.

[2] P.M. Kuhn, G. Diebel, S. Hermann, A. Keil, H. Mooshofer, A. Kaup, R. Mayer, W.Stechele, Complexity and PSNR comparison of several fast motion estimationalgorithm for MPEG4, in: Proceedings of the SPIE Applications of DigitalsImage Processing, July1998, pp. 486–489.

[3] S. Yang, W. Wolf, N. Vijaykrishnan, Power and performance analysis ofmotion estimation based on hardware and software realizations, IEEETransactions on Computers 54 (6) (June 2005) 714–726.

[4] J.P. Berns, T.G. Noll, A flexible motion estimation chip for variable size blockmatching, in: Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors, August 1996, pp. 112–121.

[5] J.F. Shen, T.C. Wang, L.G. Chen, A novel low-power full-search block-matchingmotion-estimation design for H.263þ , IEEE Transactions on Circuits andSystems for Video Technology 11 (7) (July 2001) 890–897.

[6] K.M. Yang, M.T. Sun, L. Wu, A family of VLSI designs for the motioncompensation block-matching algorithm, IEEE Transactions on Circuits andSystems 36 (10) (October 1989) 1317–1325.

[7] M. Sayed, I. Amer, W. Badawy, Towards an H.264/AVC full encoder on chip:an efficient real-time VBSME ASIC chip, in: Proceedings of the IEEE inter-national Symposium on Circuits and Systems, July 2006, pp. 22–24.

[8] S.Y. Yap, J.V. McCanny, A VLSI architecture for variable block size videomotion estimation, IEEE Transactions on Circuits and Systems 51 (7) (July2004) 384–389.

[9] C.M. Ou, C.F. Le, W.J. Hwang, An efficient VLSI architecture for H.264 variableblock size motion estimation, IEEE Transactions on Consumer Electronics51 (4) (November 2005) 1291–1299.

[10] H.P. Afshar, P. Brisk, P. Ienne, Scalable and low cost design approach forvariable block size motion estimation (VBSME), in: Proceedings of theInternational Symposium on VLSI Design, Automation and Test, April 2009,pp. 271–274.

[11] Z. Yang, J. Bu, C. Chen, X. Li, Fast predictive variable-block-size motionestimation for H.264/AVC, in: Proceedings of the International Conference onMultimedia and Expo, July 2005, pp. 354–357.

[12] C. Wei, H. Hui, T. Jiarong, L. Jinmei, M. Hao, A high-performance reconfigur-able VLSI architecture for VBSME in H.264, IEEE Transactions on ConsumerElectronics 54 (3) (August 2008) 1338–1345.

[13] S.C. Hsia, P.Y. Hong, Very large scale integration (VLSI) implementation oflow-complexity variable block size motion estimation for H.264/AVC coding,IET Circuits circuit Devices Systems 4 (5) (September 2010) 414–424.

[14] S. Asano, S.Z. Zheng, T. Maruyama, An FPGA implementation of full-searchvariable block size motion estimation, in: Proceedings of the IEEE Inter-national Conference on Field-Programmable Technology, December 2010,pp. 399–402.

Page 9: A low-power oriented architecture for H.264 variable block size motion estimation based on a resource sharing scheme

M. Elhaji et al. / INTEGRATION, the VLSI journal 46 (2013) 404–412412

[15] P. Li, H. Tang, A low-power VLSI implementation for variable block sizemotion estimation in H.264/AVC, in: Proceedings of the IEEE InternationalSymposium on Circuits and Systems, June 2010, pp. 2972–2975.

[16] A. Bahari, T. Arslan, A.T. Erdogan, Low-power H.264 video compressionarchitectures for mobile communication, IEEE Transactions on Circuits andSystems for Video Technology 19 (9) (September 2009) 1251–1261.

[17] J. Olivares, VBSME Reconfigurable, Architecture using RBSAD, Journal ofUniversal Computer Science 18 (2) (January 2012) 264–285.

Majdi Elhaji received his PhDs degree in ComputerScience and Physics from the University of Lille andFaculty of Sciences of Monastir in 2012. His researchinterests video coding, network on chip, system onchip co-design, synthesis and simulation, performanceevaluation and model driven engineering.

Abdelkrim Zitouni was born in Gabe’s, Tunisia, on 6October 1970. He received the DEA and the PhD degreein Physics (Electronics option) from the Faculty ofSciences of Monastir, Tunisia, in 1996 and 2001respectively. He received the HDR degree in Physics(Electronics option) in 2009. Since this date he hasbeen professor in Electronics and Microelectronicswith the Physics Department in Faculty of Sciences ofMonastir. His researches interest, communicationsynthesis for SoC, video coding and asynchronoussystem design.

Samy Meftali received his Master of Science (1999)and his PhD in Computer Science (2002) from theUniversity Joseph Fourier of Grenoble (Grenoble 1),France. In 2002, he has been a research assistant(ATER) with LIFL Laboratory, University of Lille 1,where he worked on heterogeneous multi-languagesimulation platforms for embedded SoCs. He becameassociate professor at University of Lille 1 in 2004. Heworked then in LIFL laboratory and INRIA DaRT pro-ject. His research interests are mainly modelling,simulating and implementing massively parallel sys-tems on partially and dynamically reconfigurable

FPGAs. He obtained the HDR degree in 2010. He has

published about 60 research articles in high level international journals andconferences.

Jean-Luc Dekeyser received his PhD degree inComputer Science from the University of Lille in1986, he was a fellowship at CERN Geneva. After afew years at the Supercomputing ComputationResearch Institute in the Florida State University,where he worked on high performance computingfor Monte-Carlo methods in High Energy Physics, hejoined in 1988 the University of Lille in France as anassistant professor. There he worked on data parallelparadigm and vector processing. He created a researchgroup working on High Performance Computing in theCNRS lab in Lille. He is currently professor in Computer

Science at the University of Lille. He was heading the

DaRT Inria project from 2005 to 20011. His research interests include embeddedsystems, system on chip co-design, synthesis and simulation, performanceevaluation, high performance computing, model driven engineering, dynamicreconfiguration, chip-3D.

Rached Tourki received the BS degree in Physics (Elec-tronics option) from the Tunis University, in 1970; the MSand doctorate in Electronics from Univesity of Orsay,Paris—South University in 1971 and 1973 respectively.From 1973 to 1974 he served as microelectronics engi-neer in Thomson—CSF. He received the Doctoral thesis inPhysics from the Nice University in 1979. Since this datehe has been professor in Microelectronics and Micropro-cessors with the Physics Department, Faculty of Monastir.Currently, he is the dean of this institution. He is also thehead of Electronics and Microelectronics Lab which isinvolved in system and network on chip hardware-

software codesign for ASIC and FPGA prototyping.