_An FPGA-Based Fully Synchronized Design of A

12
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 8, AUGUST 2014 4093 An FPGA-Based Fully Synchronized Design of a Bilateral Filter for Real-Time Image Denoising Anna Gabiger-Rose, Student Member, IEEE, Matthias Kube, Robert Weigel, Fellow, IEEE, and Richard Rose, Student Member, IEEE Abstract—In this paper, a detailed description of a synchronous field-programmable gate array implementation of a bilateral filter for image processing is given. The bilateral filter is chosen for one unique reason: It reduces noise while preserving details. The de- sign is described on register-transfer level. The distinctive feature of our design concept consists of changing the clock domain in a manner that kernel-based processing is possible, which means the processing of the entire filter window at one pixel clock cycle. This feature of the kernel-based design is supported by the ar- rangement of the input data into groups so that the internal clock of the design is a multiple of the pixel clock given by a targeted system. Additionally, by the exploitation of the separability and the symmetry of one filter component, the complexity of the design is widely reduced. Combining these features, the bilateral filter is implemented as a highly parallelized pipeline structure with very economical and effective utilization of dedicated resources. Due to the modularity of the filter design, kernels of different sizes can be implemented with low effort using our design and given instruc- tions for scaling. As the original form of the bilateral filter with no approximations or modifications is implemented, the resulting image quality depends on the chosen filter parameters only. Due to the quantization of the filter coefficients, only negligible quality loss is introduced. Index Terms—Bilateral filter, field-programmable gate array (FPGA), image processing, noise reduction, real-time processing. I. I NTRODUCTION B ILATERAL filtering has gained great popularity in image processing due to its capability of reducing noise while preserving the structural information of an image. The bilateral filter [1] consists of two components. The detail-preserving property of the filter is mainly caused by the nonlinear filter component also called photometric filter. It selects the pixels of similar intensity which are averaged by the linear component afterward. Very often, the linear component is formulated as a low-pass filter. The amount of noise reduction via selective averaging and the amount of the blurring via low-pass filtering are both adjusted by two parameters. The understanding of Manuscript received March 5, 2012; revised August 6, 2012 and October 24, 2012; accepted December 6, 2012. Date of publication October 25, 2013; date of current version February 7, 2014. A. Gabiger-Rose, R. Weigel, and R. Rose are with the Institute for Elec- tronics Engineering, Friedrich-Alexander University of Erlangen-Nuremberg, 91058 Erlangen, Germany (e-mail: [email protected]; robert.weigel@ fau.de; [email protected]). M. Kube is with the Department of Contactless Test and Measuring Systems, Fraunhofer Institute for Integrated Circuits, 91058 Erlangen, Germany (e-mail: [email protected]). Digital Object Identifier 10.1109/TIE.2013.2284133 these parameters is very intuitive, which leverages the bilateral filter to an almost all-purpose solution in image processing. The authors of [2] and [3] show that noise filtering, despite the prevailing view, not always implies resolution reduction but can even be used to sharpen the edges [2] or to enhance the flowlike structures [3]. In [4], the motion-adaptive bilateral filter is used for quality improvement in low bit rate video coding. Also, in [5], the bilateral filter is applied for noise reduction in a method for local tone mapping which maps high dynamic range image to low dynamic range image. Recently, bilateral filtering has gained a high awareness level in medical image processing and nondestructive testing. The authors of [6] studied the impact of noise reduction by the bilateral filter applied to the reconstructed images. They concluded that the images processed with this filter show a significant improvement in image quality compared to their unfiltered counterparts. In [7], the authors discuss the results of noise reduction by the bilateral filter in projection space. This means that the noise filtering takes place prior to computing the reconstructed volume. It has been concluded that noise reduc- tion of this kind can be translated into a dose reduction in X-ray computed tomography. Considering industrial applications, the dose reduction permits the reduction of the scanning time and thus allows a higher throughput of test items. Our own experiments and studies shown in [8] and [9] confirm the possible dosis reduction. As the reduction of the exposure time due to filtering is feasible, we are interested in a real-time filtering of projections. Moreover, the filter is not supposed to reduce the spatial resolution of projections to maintain the visibility of defects in a reconstruction. Since we achieve very satisfying results considering detail preservation with our field-programmable gate array (FPGA) implementa- tion presented in [10], we intend to give a deeper insight in our work. The major contribution of this paper is the detailed descrip- tion of a novel FPGA design architecture of the bilateral filter on register-transfer level (RTL). This abstraction level is chosen for the possibility of direct specification of the clocking scheme [11]. The main advantages of this design are the capability of real-time processing and economical and effective utilization of resources through the following. 1) Sorting the data into equal groups to which separate pipelines are assigned. 2) Raising the internal clock frequency according to the data flow. 3) No external image buffer is necessary. 0278-0046 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

description

FPGA based project

Transcript of _An FPGA-Based Fully Synchronized Design of A

Page 1: _An FPGA-Based Fully Synchronized Design of A

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 8, AUGUST 2014 4093

An FPGA-Based Fully Synchronized Design of aBilateral Filter for Real-Time Image Denoising

Anna Gabiger-Rose, Student Member, IEEE, Matthias Kube, Robert Weigel, Fellow, IEEE, andRichard Rose, Student Member, IEEE

Abstract—In this paper, a detailed description of a synchronousfield-programmable gate array implementation of a bilateral filterfor image processing is given. The bilateral filter is chosen for oneunique reason: It reduces noise while preserving details. The de-sign is described on register-transfer level. The distinctive featureof our design concept consists of changing the clock domain ina manner that kernel-based processing is possible, which meansthe processing of the entire filter window at one pixel clock cycle.This feature of the kernel-based design is supported by the ar-rangement of the input data into groups so that the internal clockof the design is a multiple of the pixel clock given by a targetedsystem. Additionally, by the exploitation of the separability andthe symmetry of one filter component, the complexity of the designis widely reduced. Combining these features, the bilateral filter isimplemented as a highly parallelized pipeline structure with veryeconomical and effective utilization of dedicated resources. Due tothe modularity of the filter design, kernels of different sizes can beimplemented with low effort using our design and given instruc-tions for scaling. As the original form of the bilateral filter withno approximations or modifications is implemented, the resultingimage quality depends on the chosen filter parameters only. Dueto the quantization of the filter coefficients, only negligible qualityloss is introduced.

Index Terms—Bilateral filter, field-programmable gate array(FPGA), image processing, noise reduction, real-time processing.

I. INTRODUCTION

B ILATERAL filtering has gained great popularity in imageprocessing due to its capability of reducing noise while

preserving the structural information of an image. The bilateralfilter [1] consists of two components. The detail-preservingproperty of the filter is mainly caused by the nonlinear filtercomponent also called photometric filter. It selects the pixels ofsimilar intensity which are averaged by the linear componentafterward. Very often, the linear component is formulated asa low-pass filter. The amount of noise reduction via selectiveaveraging and the amount of the blurring via low-pass filteringare both adjusted by two parameters. The understanding of

Manuscript received March 5, 2012; revised August 6, 2012 and October 24,2012; accepted December 6, 2012. Date of publication October 25, 2013; dateof current version February 7, 2014.

A. Gabiger-Rose, R. Weigel, and R. Rose are with the Institute for Elec-tronics Engineering, Friedrich-Alexander University of Erlangen-Nuremberg,91058 Erlangen, Germany (e-mail: [email protected]; [email protected]; [email protected]).

M. Kube is with the Department of Contactless Test and Measuring Systems,Fraunhofer Institute for Integrated Circuits, 91058 Erlangen, Germany (e-mail:[email protected]).

Digital Object Identifier 10.1109/TIE.2013.2284133

these parameters is very intuitive, which leverages the bilateralfilter to an almost all-purpose solution in image processing.

The authors of [2] and [3] show that noise filtering, despitethe prevailing view, not always implies resolution reductionbut can even be used to sharpen the edges [2] or to enhancethe flowlike structures [3]. In [4], the motion-adaptive bilateralfilter is used for quality improvement in low bit rate videocoding. Also, in [5], the bilateral filter is applied for noisereduction in a method for local tone mapping which maps highdynamic range image to low dynamic range image.

Recently, bilateral filtering has gained a high awarenesslevel in medical image processing and nondestructive testing.The authors of [6] studied the impact of noise reduction bythe bilateral filter applied to the reconstructed images. Theyconcluded that the images processed with this filter show asignificant improvement in image quality compared to theirunfiltered counterparts. In [7], the authors discuss the results ofnoise reduction by the bilateral filter in projection space. Thismeans that the noise filtering takes place prior to computing thereconstructed volume. It has been concluded that noise reduc-tion of this kind can be translated into a dose reduction in X-raycomputed tomography. Considering industrial applications, thedose reduction permits the reduction of the scanning time andthus allows a higher throughput of test items.

Our own experiments and studies shown in [8] and [9]confirm the possible dosis reduction. As the reduction of theexposure time due to filtering is feasible, we are interestedin a real-time filtering of projections. Moreover, the filter isnot supposed to reduce the spatial resolution of projections tomaintain the visibility of defects in a reconstruction. Since weachieve very satisfying results considering detail preservationwith our field-programmable gate array (FPGA) implementa-tion presented in [10], we intend to give a deeper insight in ourwork.

The major contribution of this paper is the detailed descrip-tion of a novel FPGA design architecture of the bilateral filteron register-transfer level (RTL). This abstraction level is chosenfor the possibility of direct specification of the clocking scheme[11]. The main advantages of this design are the capability ofreal-time processing and economical and effective utilization ofresources through the following.

1) Sorting the data into equal groups to which separatepipelines are assigned.

2) Raising the internal clock frequency according to the dataflow.

3) No external image buffer is necessary.

0278-0046 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: _An FPGA-Based Fully Synchronized Design of A

4094 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 8, AUGUST 2014

Moreover, due to the modularity of the design, it can beextended to implement arbitrary kernel size with low effort. Theinstructions required for this can be found later in this paper.

The remainder of this paper is organized as follows. InSection II, we consider the related work. After a short descrip-tion of the bilateral filter in Section III, we give a detaileddescription of our FPGA design in Section IV. Section IV isthe main part of this paper presenting the filter design stageby stage. In Section V, the criteria applied to the evaluation ofthe image quality prior and after the noise filtering are detailed.After that, in Section VI, the results are discussed, and theperformance potential of our filter design is analyzed.

II. RELATED WORK

Since the bilateral filter is in widespread use, a lot of efforthas been put into acceleration for use in practical applications.Mainly, among the publications concerning speeding up ofthe bilateral filtering, two trends can be stated. One streamis focused on the modification of the filtering components,resulting in an efficient algorithm. Another trend is to acceleratethe filtering through parallelizing the algorithm or throughhardware acceleration, including modifications of the filter atthe same time.

In [12], a fast approximation of the original bilateral filteris proposed. Here, the 2-D filtering is separated into two 1-Doperations performing 1-D bilateral filtering in one arbitrarydimension and filtering the intermediate result in the samemanner in the subsequent dimension. The authors report thatthe proportionality of the execution time to the number offilter dimensions decreases from exponential to linear. Thisapproach requires a little memory overhead but results in afilter which is fast enough to be used for preprocessing in videocompression systems. However, as the photometric componentof the bilateral filter is not separable, the image resulting fromthe modified filter is documented to be slightly different fromthe image produced by the original filter.

Another acceleration approach proposed in [13] has given abasis for numerous extensive works. This approach provides anumerical scheme for speeding up the filtering via a piecewise-linear approximation of the bilateral filter in the intensity do-main and substituting the low-pass filtering by downsampling.In [14], this technique is extended by transposing the computa-tion to a 3-D space presenting the image intensity as a thirddimension over the 2-D image coordinate space. After that,the authors of [15] formulated the concept of the bilateral gridand implemented the bilateral filter using the proposed datastructure on three different graphics processing units (GPUs).Not until then, by means of their hardware acceleration, aprocessing with 30 fps is possible which they assign as real-time performance. Later, the technique proposed in [13] wasalso implemented on a GPU by the authors of [16] and isalso capable of the real-time processing with the same framerate. More recently, the lazy sliding window implementationof the approach in [13] was proposed in [17]. This methodis suitable for single-instruction-multiple-data-type processorslike DSPs. In this case, the speedup also allows applicationsrequiring real-time performance. The main drawback of the

filter acceleration approach discussed so far is the high amountof memory required for the implementation.

Instead of a piecewise-linear approximation and subsam-pling, the idea of utilizing a histogram-based approach foraccelerating the filter is presented in [18] and [19]. The maindifference between these two works is that, in [18], a hierarchyof partial distributed histograms on multiple tiers is computedand adjusted for each output pixel while the author of [19]calculates the integral histogram of the image and extracts thehistogram for each target filter window to obtain one outputpixel. These methods both are fast, but a real-time performanceof the histogram-based approach in [19] can only be achievedby very-large-scale-integration design of the filter shown in[20]. The memory demand of the histogram-based accelerationmethod is also high but is lower than that of the piecewise-linearapproximation and subsampling approach.

The aforementioned examples show that a filter modificationtechnique reaches real-time performance only if its imple-mentation utilizes hardware acceleration. Most of the referredworks rely on GPUs for acceleration. However, in fields ofapplications in which high power efficiency is crucial, an FPGAsolution is preferable. In [21], an algorithm for the denoising ofmedical images is implemented on an FPGA and four differentGPUs. The authors show that the power consumption of theirFPGA implementation is always significantly lower. Further-more, the authors of [21] point out that an FPGA implementa-tion allows to count latency in image lines, resulting in delayslower than one frame, while the latency on a GPU is alwaysone frame. This is relevant for many medical applications whichdemand fast image output to supply interactive operations.

The authors of [22] also choose an FPGA implementationfor their image processing system because moving time-criticalfunctionalities, like the edge detection in an image, to hardwareplatforms makes it possible to keep delays in the control loopto a minimum. The authors of [23] and [24] report excellentexperience of using FPGAs for motion control of robots basedon real-time image processing. The main reason for usingFPGAs for real-time robotics tasks is the ability of FPGAs tosatisfy the requirement for high computational power and datathroughput [24]. Moreover, FPGA solutions offer additionaladvantages, such as reconfigurability and portability.

However, considering complexity and timing constraints ofthe algorithm to be implemented, the suitability of the chosenhardware platform has to be checked [25]. A DSP implemen-tation has been regarded to be more appropriate for complexalgorithms with high data dependence. For algorithms withlow data dependence and high timing constraints, an FPGAsolution is more suitable. The authors of [25] discuss in detailthe advantages of using FPGAs even if the algorithm showsboth high complexity and timing constraints. At the same time,the authors of [26] emphasize in their conclusion that FPGA-based digital processing systems achieve better performance, ata lower cost, than traditional solutions based on DSPs.

Furthermore, the parallel architecture of the FPGA providesan excellent platform for the implementation of paralleled andpipelined structures. This conclusion is made by many authors.Therefore, implementing an algorithm for color image segmen-tation for object detection in full parallelism on an FPGA, the

Page 3: _An FPGA-Based Fully Synchronized Design of A

GABIGER-ROSE et al.: FPGA-BASED FULLY SYNCHRONIZED DESIGN OF BILATERAL FILTER 4095

authors of [27] report a drastic improvement of the speed ofsegmentation compared with the sequential-code-based seg-mentation. In [28], a design of a fully pipelined data path forreal-time face detection using FPGA is described which sup-ports high-speed detection irrespective of the number of facesin an image. The authors of [29] implement their paralleled andfully pipelined hardware for real-time electromagnetic transientsimulation on an FPGA and thereby solve a challenging prob-lem of implementation of the complex simulation models.

There are several publications dealing with FPGA implemen-tations of the bilateral filter. In [30], one of these designs ispresented. The verilog hardware description language (VHDL)code of this design is generated automatically from the mod-els for FPGA synthesis using System Generator from Xilinx.Although the optimization setting for the code generation wasfor maximum clock frequency, the authors admit that the speedof their implementation for a 15 × 15 pixel filter kernel isinsufficient for a real-time application. The authors of [31]compared a VHDL and a high-level synthesis (HLS) descrip-tion, created by System Generator, of an adaptive impulse noisefilter and concluded that higher speed of the system clock canbe achieved using VHDL description. Thus, these publicationsshow exemplarily that the handcrafted optimization of an FPGAdesign regarding both the operating frequency and the resourceutilization is still irreplaceable.

A different approach for the FPGA implementation of a real-time bilateral filter has been proposed in [32]. The modifiedfilter is based on the calculation of the filter coefficients fromthe photometric filter only. The spatial filtering is eliminateddue to the processing of the minimal window of 3 × 3 andraising of the derived photometric coefficients to the power of8. According to the authors, for a moderate noise level, theirmodified bilateral filter can achieve slightly better results com-pared to the traditional bilateral filter shown in [1]. However,the original bilateral filter can be tuned by two parameterswhich are highly responsible for the filtering performance.Unfortunately, no description of the parameters used for thiscomparison is given in [32].

The work published in [33] is most related to our work.The major parallel to our design consists in implementing thebilateral filter on an FPGA without any modification. Thisapproach is sometimes called brute-force method. However, themain difference to our work is that the authors developed theirdesign using an HLS tool. The resulting architecture presents a3 × 3 filter kernel. In contrast, our design is based on an RTLdescription and presents a 5 × 5 filter kernel. Our design allowshigh clock frequency and high data throughput and shows onlya slight increase of resource demand considering the largerkernel. From this follows that our architecture utilizes hardwareresources more efficiently and more economically.

III. BILATERAL FILTER

The bilateral filter [1] embodies the idea of a combinationof domain and range filtering. The domain filter averages thenearby pixel values and acts thereby as a low-pass filter. Therange filter stands for the nonlinear component and plays animportant part in edge preserving. This component allows

averaging of similar pixel values only, regardless of their po-sition in the filter window. If the value of a pixel in the filterwindow diverges from the value of the pixel being filtered by acertain amount, the pixel is skipped.

Taking Gaussian noise into account, the shift-variant filteringoperation of the bilateral filter is given by

φ̄(m̄0) =1

k(m0)

∑m∈F

φ(m) · s (φ(m0), φ(m)) · c(m0,m).

(1)

The term m = (m,n) denotes the pixel coordinates in theimage to be filtered and m0 = (m0, n0) and m̄0 = (m̄0, n̄0)represent the coordinates of the centered pixel in the noisy andin the filtered images, respectively. With these notations, φ̄(m̄0)means the gray value of the pixel being filtered, and φ(m)identifies the gray value of the spatially neighboring pixels toφ(m0) in the filter window F .

The following expressions (2) and (3) describe the photo-metric and the geometric components s(φ(m0), φ(m)) andc(m0,m), respectively:

s (φ(m0), φ(m)) = exp

(−1

2

(‖φ(m0)− φ(m)‖

σph

)2)

(2)

c(m0,m) = exp

(−1

2

(‖m0 −m‖

σc

)2)

(3)

where parameters σph and σc regulate the width of the Gaussiancurve assigned to s(φ(m0), φ(m)) and c(m0,m), respectively.

The photometric component compares the gray value of thecentered pixel with the gray values of the spatial neighborhoodand computes the corresponding weight coefficients dependingon the factor σph. The more the absolute difference of thegray values exceeds σph, the lower is the corresponding filtercoefficient and vice versa. The domain filter c(m0,m) acts asa standard low-pass filter, the weights of which are reciprocallyproportional to the spatial distance of the centered pixel to thepixels in the neighborhood.

Normalization with

k(m0) =∑m∈F

s (φ(m0), φ(m)) · c(m0,m) (4)

guarantees that the range of the filtered images does not changesignificantly due to the filtering. Owing to the fact that thecoefficients of the photometric component cannot be computedin advance, the division by the normalization factor cannot beavoided by means of prescaling of the filter coefficients.

IV. DESIGN CONCEPT

The image data, as well as all constants and coefficientsused in the following design concept, are integer numbers. Asdiscussed in Section VI, there is no need to implement floating-point computation. With the aid of the presented design con-cept, the bilateral filter can be realized as a highly parallelizedpipeline structure giving great importance to the effective re-source utilization. In this paper, the data paths are detailed. Thedescription of the control signals is not addressed here.

ragul
Highlight
ragul
Highlight
Page 4: _An FPGA-Based Fully Synchronized Design of A

4096 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 8, AUGUST 2014

Fig. 1. Order of the functional units of the bilateral filter.

Fig. 2. Principle of the input data retrieval for the image filtering.

For the design description, a window size of 5 × 5 is chosen.This window size is the tradeoff between high noise reductionand low blurring effect.

The design concept for the implementation of the bilateralfilter is subdivided into three functional blocks. The block-based design approach reduces design complexity and simpli-fies validation [34]. Fig. 1 presents these units and their orderin the concept. The input data marked by “Data_in” are readline by line and arranged for further processing in the registermatrix. The second unit is the photometric filter which weightsthe input data according to the intensity of the processed pixels.The filtering is completed by the geometric filter, and thefiltered data are marked by “Data_out.”

A. Register Matrix

The photometric filter component, also often referred to asa range filter in the related literature, is a nonlinear filter. Itmeans that the filter coefficients change for every filter position.Thus, the pixel weights for the photometric component haveto be calculated separately for every pixel in the filter window.The number of weights depends on the filter window size. Here,24 weights have to be computed for the filtering of one imagepixel.

The filter window is shifted first along the input lines rep-resenting the image rows, moving one row down every timethe precedent row has been filtered. Consequently, the demandarising from this filtering technique is that at least five lineshave to be stored for the period of time during which a lineis filtered. As an external image buffer is undesired becauseof the additional expenses of resources due to the memorycontroller and because of the additional latency due to thememory accesses, the five input lines are stored in the linestorages which are implemented as block RAMs for data withN bits. The five input lines are called image rows or rows in thefollowing. These five rows include the row to be filtered, twoforegoing rows, and two succeeding rows.

This arrangement is depicted in Fig. 2. The pixel being fil-tered is marked by “mid_pix.” This pixel and its neighborhoodin the solid box represent the kernel of the bilateral filter.After the middle row has been filtered, the outer foregoing row

Fig. 3. Register matrix of the kernel-based design concept.

“line storage n-2” moves out of the register matrix. As theinput data are read into the register matrix pixel by pixel, thecontent of the line storages and of the filter kernel is shiftedby one pixel at each clock event. This shift emulates the shiftof the filter kernel. Acting this way, at the end of an imageline, all remaining rows are shifted one row down. The formersucceeding row “line storage n + 1” can now be processed. Theoutput lines form the output image which is stored externally.

The parallel calculation of 24 weights in the photometricfilter component and the subsequent weighting in the geometriccomponent combined with the final normalization at the filteroutput require a large amount of resources considering thesparse time of just one pixel cycle. Due to the flexibility of theclock management in FPGAs, this challenge can be accepted.

The solution is offered by our kernel-based design concept inFig. 3. The single registers are interconnected in a manner that,aside from the shift of the filter window by one pixel, the entirekernel is provided to the next filter stage simultaneously. Thisis an important advantage of the presented kernel-based designconcept as no extra data buffer is required. On the other hand, itis necessary to process all 25 pixels in one pixel cycle in orderto keep up with the reading of the input lines into the registermatrix.

The output of the register matrix is sorted into groups, in thiscase into six groups, and fed into the photometric filter compo-nent with the quadruple pixel clock frequency synchronously.

ragul
Highlight
Page 5: _An FPGA-Based Fully Synchronized Design of A

GABIGER-ROSE et al.: FPGA-BASED FULLY SYNCHRONIZED DESIGN OF BILATERAL FILTER 4097

Fig. 4. Abstract illustration of the photometric filter component.

The number of the groups is explained by the symmetry ofthe geometric filter component which is discussed later inSection IV-C. The sorting is done by means of multiplexing thepixels in the manner shown in Fig. 3. The quadruplication ofthe filter processing clock is implemented by setting the selectsignal of the multiplexers four times in one pixel clock. Here,the clock domain changes to the fourfold of the input pixelclock. The counter on the top of Fig. 3 generates the selectsignal and thus controls the readout of the register matrix. Thiscounter is clocked with the quadruple pixel clock as well. Thecounter is first enabled after the whole register matrix is filled.

The pixels in each group are processed in parallel while eachgroup is pipelined through to the register matrix output stage.The pixel in the center of the filter window is not a part of anygroup and is forwarded to a latch belonging to the input stageof the photometric filter component. The sorting of the pixelsinto groups and the quadruplication of the pixel clock are thekey to the presented synchronous FPGA design concept usinga parallelized pipeline architecture.

B. Photometric Component

After the register matrix has been filled, the grouped imagedata are provided to the photometric filter component whichis pictured in Fig. 4. At the output of the photometric filter, theweighted pixels appear, still sorted into groups, accompanied bythe “weighted mid_pix.” Additionally, the photometric coeffi-cients have to be forwarded for the required normalization at thelast stage of the filtering according to (4). Thus, in parallel to thepixels, the photometric coefficients also have to be processed bythe geometric filter in order to obtain the normalization factordefined in (4). For this reason, the output of the photometricfilter consists of the following:

1) weighted pixels sorted into groups 0 . . . 5;2) the weighted pixel being filtered, marked by “mid_pix”;3) photometric coefficients corresponding to groups 0 . . . 5.In further stages of the design, the weighted pixel values, i.e.,

the outputs of the multipliers, are named by their groups 0 . . . 5.A detailed functional flow block diagram of the photometric

filter is shown in Fig. 5. The pixel in the center of the filterwindow has to be available during the calculation of the re-quired 24 pixel weights. Latching the centered pixel allows thecomputation of the gray value differences between the centeredpixel and the remaining pixels inside of the filter window. Eachgroup contains four pixels. A separate pipeline belonging toeach group makes it possible to process the entire neighborhoodof “mid_pix” at one pixel clock signal. All six pipelines aredesigned identically.

Fig. 5. Photometric filter component.

Fig. 6. Processing order of input data in the photometric filter component.

The way of arranging and the processing order of the inputdata of the photometric component are shown in Fig. 6. At thefirst internal clock event t0, the first pixels of each group areprovided to the respective pipeline. At the second internal clockt1, the second pixels of each group enter the component. Thisorganization of groups allows the processing of the whole filterwindow in four internal clock cycles corresponding to one pixelcycle. In the upper part of Fig. 5, the processing path for thegroup 0 is shown; in the lower part, there is the processing pathfor the group 5.

Page 6: _An FPGA-Based Fully Synchronized Design of A

4098 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 8, AUGUST 2014

Fig. 7. Limitation of the number of coefficients.

The combinatory blocks “comb.0 . . . 5” compute the abso-lute gray value difference required by (2). In order to keepthe design synchronous, the gray values of each pipeline areregistered during the difference calculation. The upper path inFig. 5 shows the required registers labeled “group 0” to makesure that the gray value appears at the input of the multiplierat the same time as the corresponding photometric coefficient.Through the following, we use registers to keep our designsynchronous. Thus, it makes any delay control inside of ourarchitecture redundant.

To avoid the calculation of the expensive exponential, allpossible values of the function (2) are precalculated and storedin the lookup table (LUT). The absolute difference of thegray values itself is directly interpreted as the address of thecorresponding weight coefficient in the LUT.

Due to the quantization, the number of the weight coeffi-cients is limited. This limit depends on three parameters:

1) the word length N of the input data;2) the parameter σph;3) the word length W of the coefficients.

The first point means that increasing the color depth of animage causes a larger amount of intensity differences thathave to be stored in the LUT. Depending on the parameterσph, the slope of the Gaussian curve is steeper or more flatwhich influences the number of coefficients different from zeroafter the quantization. It depends on the word length W itselfwhose coefficients actually are different from zero after thequantization.

In Fig. 7, the coefficients are plotted for N = 8 b, W = 8 b,and σph = 60. As the negative exponential converges towardzero for increasing gray value differences, there are only alimited number of quantized coefficients that are different fromzero. Considering the example in Fig. 7, there are only 188coefficients to be stored. For simplification of the internalcontrol, the number of coefficients is extended to the nextpower of 2, resulting in the highest address 2P − 1. In theexample, the highest address is 255. The coefficients are storedin the LUT of each pipeline in the initialization phase of thefiltering.

Fig. 8. Abstract illustration of the geometric filter component.

If N is greater than P, via logical disjunction of left (N-P) bits,it is checked whether the gray value difference is greater thanthe chosen limit 2P − 1. The result of the disjunction selects thecoefficient address. If the gray value difference is greater thanthe limit, the weight coefficient is set to zero which is storedat the address 2P − 1. In the opposite case, the correspondingcoefficient is read out of the LUT. This coefficient may alsobe zero as the number of coefficients is extended to 2P − 1.During the readout of the coefficient, the related gray value isregistered for synchronicity. At the next internal clock event, thegray values of each group are multiplied by the correspondingcoefficients while registering the coefficients in “coeff. group0 . . . 5” for the final normalization.

The pixel in the center of the filter window does not belong toany group and is processed separately. This pixel is multipliedby the highest coefficient 2W − 1 and delayed by registers“photo_k middle” and “geom_in middle” for synchronicity.

C. Geometric Component

For the design of the geometric filter component, advantageis taken of its separability and its symmetry. Because of theseparability, the geometric filter is split into the vertical and hor-izontal parts. Therefore, 2-D filtering is replaced by successive1-D filtering in vertical and horizontal directions. This solutionis preferred in the design of the geometric filter because 1-Dfiltering can be implemented more efficiently. Both parts areimplemented twice to filter the weighted image data and thephotometric weights simultaneously which is shown in Fig. 8.The input of the vertical component parts is the 2-D arrayof the filter window and the 2-D array of the correspondingcoefficients. Each output is a 1-D vector in which each entryrepresents one filtered and cumulated column. The coefficientsof the geometric component are labeled “C_0, C_1, C_2.” Theoutput of the geometric filter consists of the filtered unnor-malized gray value (kernel result) and the normalization factor(norm result).

Due to the symmetry of the weight coefficients of the geo-metric component, the order of multiplication and addition isswapped in both filter parts. This fact plays an important rolein pixel group formation. At first, the weighted gray valueswhich are located at the same distance from the centered pixelin the filter window are summed up [35]. Because of the equaldistance, these gray values should be weighted with the samecoefficient anyway. For a 5 × 5 window, there are always 4

Page 7: _An FPGA-Based Fully Synchronized Design of A

GABIGER-ROSE et al.: FPGA-BASED FULLY SYNCHRONIZED DESIGN OF BILATERAL FILTER 4099

Fig. 9. Vertical part of the geometric filter component.

or 8 pixels at the same distance from the centered pixel. Forthe simplicity of the design, it makes sense to assemble thepixels into equally large groups. Smaller groups allow for betterhandling of the design. For this reason, the pixels are dividedinto groups of four with regard to the subsequent processingexplained in the following sections. After the accumulation ofthe pixels according to their symmetry, the sum is multipliedby the corresponding coefficient. The horizontal processing isdone in the same way.

The coefficients for the geometric component are scaled insuch a manner that the sum of the vertical coefficients (andthe horizontal ones, respectively) is equivalent to the so-callednormalized one [35]. For the signed coefficients with the wordlength W, the normalized one is equal to 2W−1. This meansthat the division of the weighted gray values and photometriccoefficients after geometric filtering can be realized as a simpleshift operation. In the last stage, the normalized filtered grayvalue has to be divided by the normalized product of the photo-metric coefficients. The geometric coefficients are calculated inadvance and stored in a block RAM.

1) Vertical Component Part: The first stage of the geometriccomponent is the vertical part which is pictured in Fig. 9. Withthe aid of Fig. 6, it can be seen that the pixels of the first columnnumbered 1, 2, 3, 4, 5 and the first pixel of the middle columnnumbered 11 enter the vertical component part simultaneously.For the corresponding photometric coefficients, the same orderof processing is valid.

The groups 0, 1, 2, 3, 4, which means all columns with theexception of the centered column, are processed as shown inthe upper part of Fig. 9. The geometrically symmetrical pixelsare cumulated at first and then multiplied by the geometricweight coefficient. All coefficients for the geometric filter areconstant for the chosen filter window size. Due to the scaling

Fig. 10. Horizontal part of the geometric filter component.

of the geometric coefficients, it is assured that the accumulationdoes not result in a carry. The registers “REGcol 0,1,2” in thispart of the design are used to delay weighted data to maintainsynchronicity. After the multiplication, the weighted values aresummed up by the adder tree to one value at each internal clockevent.

The processing of the centered column is detailed in thelower part of Fig. 9. The centered pixel is weighted and delayedby “REGcen” so that this pixel and the remaining pixels in thecentered column can be fed to the input of the adder tree simul-taneously. The remaining pixels enter the dedicated processingpath one by one. They were multiplexed in the register matrixin the way that they can be combined pairwise and multipliedby the same coefficient in the geometric component. In orderto weight the pixels in a proper way, every incoming pixel isstored in the register “REGcol mid” so that the subsequentlycalculated sum is valid every second internal clock event. Themultiplexing of the filter coefficients with zeros assures thatinvalid sums vanish due to the multiplying by zero and do notfalsify the result.

As it is shown in Fig. 8, the vertical part of the geometric fil-ter for the weighting of the photometric coefficients is designedidentically.

2) Horizontal Component Part: In Fig. 10, the horizontalpart of the geometric component is displayed. After processingin the vertical dimension, the filter window is reduced to onerow, and its elements are computed at one internal clock eventeach. In order to be able to reuse the symmetrical design, thevalues of the filtered columns 0, 1, 3, 4 are stored in the shiftregisters according to the order of their reception. The filteredphotometrical coefficients are stored in the same way. Since thecontent of the shift register in the left part of Fig. 10 is validat every fourth internal clock event, the time domain changes

Page 8: _An FPGA-Based Fully Synchronized Design of A

4100 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 8, AUGUST 2014

Fig. 11. Final normalization of the filtered data.

here to the domain of the pixel clock. This domain change isindicated by the dashed line in Fig. 10. All operations on theright-hand side of the dashed line are executed according to thepixel clock.

At every pixel clock signal, the valid column values are writ-ten to the registers which perform the division of the weightedgray values by the normalized ones. The division is imple-mented through a shift operation. The remaining processing issimilar to the processing described in the previous paragraph.The geometrically symmetrical pixels are cumulated at first andmultiplied afterward by the geometric weight coefficient. Forthe geometric filtering in the horizontal direction, the same geo-metric coefficients are used as for the vertical filtering. The finaldivision by the normalized one is performed in the next stage.

D. Normalization

At the final stage, the kernel result has to be normalized bythe norm result as shown in Fig. 11. After the final accumulationof these values, they are both divided by the normalized oneagain. In this manner, the word lengths of the weighted grayvalues and of the norm are both (W − 1) bits shorter. Finally,after the division, N bits of the final result are forwarded to theoutput of the bilateral filter.

E. Design Scalability

In previous paragraphs, we detailed the filter design for the5 × 5 kernel. However, depending on an application, anotherkernel size might be required. For small images, a 3 × 3window size is more suitable to prevent blurring. Some authorschoose to work with a larger kernel of the size of 11 ×11 pixels [36]. Our design can be scaled for different kernelsizes. Starting at the register matrix, it has to be dimensionedaccording to the required kernel size. The kernel size in onedimension is assigned with K in the following:

Ngroups = K+ 1 (5)

where Ngroups means the number of the pixel groups. Thequantity of the line storages equals K. The number of requiredmultiplexers equals Ngroups. The multiplexing pattern of thepixels remains unchanged for every kernel size. According tothe symmetry of the kernel, the pixels have to be grouped intoNgroups containing ngroup_member pixels each

ngroup_member = K− 1. (6)

The groups are always built up in the manner that each rowexcept for the middle pixel forms a pixel group. The middle

column represents the last pixel group in which particularattention has to be paid to the arrangement of the pixels in orderto keep the weighting in the geometric component valid.

Furthermore, the number of pipelines, including combinatoryblocks and coefficient LUTs in the photometric component,equals Ngroups. The design of the pipelines remains the same.The number of the pipelines in the vertical part of the geo-metric component changes according to the kernel size. Forthe structure in the upper part of Fig. 9, (K + 1)/2 pipelinesare required because the geometrical symmetry of the pixelshas to be taken into account. The lower part of the verti-cal geometric component remains unchanged except for themultiplexer which has ngroup_member inputs according to therequired filter window size. The shift register of the horizontalpart of the geometric component has to be dimensioned for(K− 1) values. The number of the connected pipelines hasto be adjusted to the length of the shift register, taking thegeometrical symmetry into account again. The processing ofthe centered column remains unchanged. The same holds forthe normalization coefficients as well.

Finally, if the maximal operating frequency foperating isknown, the internal clock frequency finternal can be determinedas follows:

finternal =foperating

ngroup_member. (7)

According to the internal clock frequency finternal, the counterhas to be adjusted, which generates the “select” signal for themultiplexers and the enable signal “EnREG” for the horizontalpart of the geometric component.

V. IMAGE QUALITY ASSESSMENT

To evaluate the performance of the noise reduction and theaccuracy of the detail preservation, criteria for the image qualityassessment are required. The criteria chosen in this work arePSNRdB and MSSIM.

1) PSNRdB: The well-known peak-signal-to-noise ratioPSNRdB in decibels is defined as follows:

PSNRdB =20 · log10(GVmax√

MSE

)(8)

MSE =1

MN

∑M

∑N

[φref(m)− φ̃(m)

]2(9)

where MSE denotes the mean squared error between theimage to be compared and the reference image. GVmax

represents the maximum gray value depending on theword length after the digitalization of the images. Thenoiseless M × N image with gray values φref(m) pro-vides the reference for the measurement of the MSE.The gray values φ̃(m) originate from the image to becompared. Considering the quality of the noise filter,PSNRdB describes the capability of the filter to suppressnoise regardless of the perceived visual quality of thefiltered image.

Page 9: _An FPGA-Based Fully Synchronized Design of A

GABIGER-ROSE et al.: FPGA-BASED FULLY SYNCHRONIZED DESIGN OF BILATERAL FILTER 4101

2) MSSIM: The mean structural similarity index MSSIM isa method for the assessment of the image quality thattakes advantage of the characteristics of the human visualsystem [37]. First, the local structural similarity SSIM ofthe 11 × 11 image blocks v(φref) and v(φ̃) is calculated

SSIM(v(φref),v(φ̃)

)= l

(v(φref),v(φ̃)

· c(v(φref),v(φ̃)

)· s

(v(φref),v(φ̃)

)(10)

where l(v(φref),v(φ̃)) is the luminance comparisonfunction, c(v(φref),v(φ̃)) compares the contrast ofthe image blocks after luminance subtraction, ands(v(φref),v(φ̃)) conducts the structure comparison aftercontrast normalization. After averaging the SSIM of Jblocks over the whole image, the mean value MSSIM

MSSIM(Φref , Φ̃) =1

J

J∑j=1

SSIM(vj(φref),v(φ̃)

)(11)

of an entire image represented by Φ̃ is identified. Thevalue MSSIM = 1 means that two images are completelyidentical. The smaller the MSSIM, the less the structuralsimilarity that the two images show. The detailed descrip-tion of MSSIM can be found in [37].

VI. RESULTS

After an implementation in Matlab, the proposed architectureof the bilateral filter was implemented in VHDL and simulatedwith ModelSim. A test image was filtered by Matlab imple-mentation as well as the ModelSim simulation, and the filteredimages were compared. The purpose of this comparison is toanalyze the image quality drop due to the quantization of thefilter coefficients in our FPGA design.

The test image Lighthouse shown in Fig. 12(a) is an 8-bgrayscale image with a size of 512 × 512 pixels. Hence, in thefollowing, GVmax = 255 is used.

In order to apply the bilateral filter to a color image, thecolor data have to be transformed into the CIELab color space[1]. The structure of the filter remains unchanged. However,processing of color images is beyond our research interest, sono results on this topic will be reported.

A. Performance Analysis

For the comparison of the filtering capability betweenthe Matlab implementation and the ModelSim simulation,Gaussian noise with standard deviation σnoise = [10, 20, 30, 40,50, 60] was added to the test image.

In Fig. 12, the test image is contrasted with its noisy coun-terpart with σnoise = 20 and two filtered images. The filterparameters σph = 3 · σnoise and σc = 1 were chosen for thephotometric and geometric components, respectively. For filter-ing in Matlab, no quantization of the filter coefficients was ap-plied. The corresponding filtered image is shown in Fig. 12(c).For the simulation with ModelSim, the coefficient word lengthW = 8 was used. The simulation result is shown in Fig. 12(d).

Fig. 12. (a) Original image. (b) Noisy image with σnoise = 20. (c) Filteringin Matlab. (d) Filtering in ModelSim.

Page 10: _An FPGA-Based Fully Synchronized Design of A

4102 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 8, AUGUST 2014

Fig. 13. Performance comparison of the Matlab implementation and theModelSim simulation.

Between the Matlab implementation and the ModelSim simu-lation, no visually distinguishable difference can be registered.

The results of the quantitative comparison between the Mat-lab implementation and the ModelSim simulation are con-trasted in Fig. 13 and summarized in Table I. As our recentresearch shows, by adjusting σph as a multiple of the measuredstandard deviation of noise rather than by a single constant,even better PSNRdB can be achieved. Thus, an optimal settingfor the filter can be chosen which reduces noise and preventsblurring at the same time as far as possible. Exceeding this pointcauses oversmoothing, and choosing the adjusting parameterbelow this point leads to insufficient noise suppression. Thediscussion of this topic is important but beyond the scope ofthis paper. For more details, refer to [38].

Fig. 13 reveals that, for increasing noise levels, PSNRdB andMSSIM both increase after noise filtering. For higher standarddeviation of noise, the gain is higher. Using our setting σph =3 · σnoise, averaging with higher weights is performed for in-creasing noise levels. Owing to this fact, PSNRdB rises by ahigher amount. MSSIM also increases because the geometricalcomponent remains narrow, preventing oversmoothing.

TABLE IFILTERING RESULTS

TABLE IISYNTHESIS RESULT

The numbers in Table I show that applying the presentedfilter architecture delivers results almost as good as that ofthe Matlab implementation. The slight decrease of the imagequality due to filtering by ModelSim simulation is explained bycoefficient quantization and by rounding of the internal valuesduring the shift operations. No artifacts caused by quantizationare introduced into the filtered image. In summary, the simula-tion results are highly satisfying.

B. Verification

For verification, a Virtex-5 FPGA platform equipped with aVirtex XC5VLX50-1 device was used. The shortened synthesisreport of the filter design is shown in Table II. A long-termtrial proved that the design is suitable for real-time processing.The FPGA board was connected to a camera with a 12-bresolution depth, generating 30 fps at a full resolution of 1024 ×1024 pixels.

Due to the technical specification of the camera, pauses be-tween the frames are necessary so that 30 fps is the maximallyachievable frame rate. Thus, the maximal data flow reachesapproximately 31.5 Mpixel/s. Consequently, we restricted theclock frequency of our design to 40 MHz in this application.The internal clock frequency is 160 MHz. With this clock rate,a maximal throughput of 38 fps is possible.

With a different camera, an even higher frame rate is achiev-able. Using our FPGA platform, the maximal possible internalfrequency shown in Table II is 220 MHz. Hence, the maximaloperating frequency of our filter design with the contemplatedFPGA Virtex-5 equals 55 MHz. Considering the image reso-lution of 1024 × 1024 pixels, the following frame rate can becomputed:[

(1024× 1024)pixelsframe

· 18.18 nspixel

]−1

= 52.45framessecond

. (12)

This calculation is valid only for a throughput of 1 pixel/cyclewhich is given by our design.

Page 11: _An FPGA-Based Fully Synchronized Design of A

GABIGER-ROSE et al.: FPGA-BASED FULLY SYNCHRONIZED DESIGN OF BILATERAL FILTER 4103

TABLE IIICITED FPGA IMPLEMENTATIONS OF THE BILATERAL FILTER

The total delay of the output pixels of our architecture witha kernel size of 5 × 5 pixels applied to an image of 512 ×512 pixels is 2560 + 36 cycles. The time required for fillingup of the register matrix, depending on the kernel size andimage width, results in a delay of 5× 512 = 2560 cycles. Theprocessing time from the multiplexers in the register matrix tothe output of the normalization stage is constant and dependsnot on the kernel size. The critical operations are performedat internal clock frequency. If the kernel size is changed, thepixel groups have to be reordered, and the internal clock has tobe adjusted according to (7). In this case, the processing timestill accounts for 36 cycles. The normalization by division costs24 cycles, which makes out 66% of the whole processing time.

For the evaluation of the performance of the filter design,a comparison with other implementations from the referencesis given in Table III. Except for the authors of [32], all otherauthors implement the original bilateral filter from [1]. From[32], the full parallel architecture is used for the comparisonin Table III. All filters are implemented on different FPGAs ofdifferent families and generations, which makes the comparisonless significant, but still, itemizing some features like the max-imum clock frequency of the design or the resource demandmight give a good insight.

Our design works at the highest clock frequency. However,considering the kernel size of 5 × 5 pixels and the switchingof the time domain, our architecture presents only the thirdhighest frame rate. However, it looks different if we implementa 3 × 3 filter kernel. In this case, the operating frequency is110 MHz, and the resulting frame rate doubles, which puts theperformance of our design on the second place.

Regarding the resource demand, it should be clear that thelogic elements of Altera and the logic slices of Xilinx arebuilt differently. The values in Table III give merely a hint atthe FPGA area used by each design. On the other hand, thenumber of required multipliers can be compared directly. In[30], the number of the multipliers is not available. Accordingto the statement of the authors of [33], an efficient parallelimplementation of a bilateral filter for a 5 × 5 mask requires 25multipliers.We have shown that our design concept is efficientand it requires only 23 multipliers. Therefore, considering theimplemented window size of 5 × 5 pixels, we use the resourcesmore economically.

VII. CONCLUSION

In this paper, we have given a detailed description of anFPGA design of the bilateral filter for real-time image pro-cessing. The advantages of our design can be summarized infollowing points.

1) The filter design for a kernel size of 5 × 5 shown hereutilizes the FPGA resources economically, which makesit feasible to implement the filter on a common medium-sized FPGA.

2) The introduced register matrix at the first stage of thefilter makes external image storage redundant, contribut-ing to the decrease of the resource demand of the filterimplementation.

3) The shown architecture is synchronous and capable ofreal-time processing supporting high clock frequencies.Maximal operating frequency depends on the chosenFPGA family.

4) Conceiving our filter architecture, we kept in mind thescalability of the design in order to enable the implemen-tation of arbitrary filter window size with low effort.

5) The shown filter architecture assures a constant process-ing delay independent of the filter window size. The totaldelay is the sum of the processing delay and the fill-uptime of the line storages which depends on the kernel sizeand image width.

6) Image quality assessment in terms of PSNRdB and struc-tural similarity assured that the image quality loss dueto coefficient quantization and due to rounding of theinternal results is negligible.

REFERENCES

[1] C. Tomasi and P. Manduchi, “Bilateral filtering for gray and color im-ages,” in Proc. IEEE ICCV , 1998, pp. 839–846.

[2] B. Zhang and J. P. Allebach, “Adaptive bilateral filter for sharpness en-hancement and noise removal,” IEEE Trans. Image Process., vol. 17,no. 5, pp. 664–678, May 2008.

[3] B. Yan and A.-D. Saleh, “Structure enhancing bilateral filtering ofimages,” in Proc. IEEE PCSPA, 2010, pp. 614–617.

[4] M. de-Frutos-López, H. Medina-Chanca, S. Sanz-Rodríguez, C. Peláez-Moreno, and F. Díaz-de-María, “Perceptually-aware bilateral filter forquality improvement in low bit rate video coding,” in Proc. IEEE PCS,2012, pp. 477–480.

[5] J. Won Lee, R.-H. Park, and S. Chang, “Noise reduction and adaptivecontrast enhancement for local tone mapping,” IEEE Trans. Consum.Electron., vol. 58, no. 2, pp. 578–586, May 2012.

[6] J. Giraldo, Z. Kelm, L. Yu, J. Fletcher, B. Erickson, and C. McCollough,“Comparative study of two image space noise reduction methods for com-puted tomography: Bilateral filter and nonlocal means,” in Proc. Conf.IEEE EMBS, 2009, pp. 3529–3532.

[7] L. Yu, A. Manduca, J. Trzasko, N. Khaylova, J. Kofler, C. McCollough,and J. Fletcher, “Sinogram smoothing with bilateral filtering for low-dose CT,” in Proc. SPIE Med. Imag.: Phys. Med. Imag., 2008, vol. 6913,pp. 691329-1–691329-8.

[8] A. Gabiger, R. Weigel, S. Oeckl, and P. Schmitt, “Enhancement of CTimage quality via bilateral filtering of projections,” in Proc. 1st Int. Conf.Image Formation X-ray Comput. Tomography, 2010, pp. 140–143.

[9] A. Gabiger-Rose, R. Rose, M. Kube, P. Schmitt, and R. Weigel, “Noiseadaptive bilateral filtering of projections for computed tomography,” inProc. 11th Int. Meet. Fully Three-Dimens. Image Reconstruction Radiol.Nucl. Med., 2011, pp. 306–309.

[10] A. Gabiger, M. Kube, and R. Weigel, “A synchronous FPGA design ofa bilateral filter for image processing,” in Proc. IEEE IECON, 2009,pp. 1990–1995.

[11] T. Riesgo, Y. Torroja, and E. de la Torre, “Design methodologies basedon hardware description languages,” IEEE Trans. Ind. Electron., vol. 46,no. 1, pp. 3–12, Feb. 1999.

Page 12: _An FPGA-Based Fully Synchronized Design of A

4104 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 8, AUGUST 2014

[12] T. Q. Pham and L. J. van Vliet, “Separable bilateral filtering for fast videopreprocessing,” in Proc. IEEE ICME, 2005, pp. 1–4.

[13] F. Durand and J. Dorsey, “Fast bilateral filtering for the display of high-dynamic-range images,” ACM Trans. Graph., vol. 21, no. 3, pp. 257–266,Jul. 2002.

[14] S. Paris and F. Durand, “A fast approximation of the bilateral filter usinga signal processing approach,” in Proc. ECCV , 2006, pp. 568–580.

[15] J. Chen, S. Paris, and F. Durand, “Real-time edge-aware image processingwith the bilateral grid,” ACM Trans. Graph., vol. 26, no. 3, pp. 1–9,Jul. 2007.

[16] Q. Yang, K.-H. Tan, and N. Ahuja, “Real-time O(1) bilateral filtering,” inProc. IEEE CVPR, 2009, pp. 557–564.

[17] M. M. Bronstein, “Lazy sliding window implementation of the bilateralfilter on parallel architectures,” IEEE Trans. Image Process., vol. 20, no. 6,pp. 1751–1756, Jun. 2011.

[18] B. Weiss, “Fast median and bilateral filtering,” ACM Trans. Graph.,vol. 25, no. 3, pp. 519–526, Jul. 2006.

[19] F. Porikli, “Constant time O(1) bilateral filtering,” in Proc. IEEE CVPR,2008, pp. 1–8.

[20] Y.-C. Tseng, P.-H. Hsu, and T.-S. Chang, “A 124 Mpixels/sec VLSI de-sign for histogram-based joint bilateral filtering,” in IEEE Trans. ImageProcess., Nov. 2011, vol. 20, no. 11, pp. 3231–3241.

[21] F. Hannig, M. Schmid, J. Teich, and H. Hornegger, “A deeply pipelinedand parallel architecture for denoising medical images,” in Proc. IEEEFPT , 2010, pp. 485–490.

[22] L. Costas, P. Colodrón, J. J. Rodríguez-Andina, J. Fariña, andM.-Y. Chow, “Analysis of two FPGA design methodologies applied toan image processing system,” in Proc. IEEE ISIE, 2010, pp. 3040–3044.

[23] N. Sudha and A. R. Mohan, “Hardware-efficient image-based robotic pathplanning in a dynamic environment and its FPGA implementation,” IEEETrans. Ind. Electron., vol. 58, no. 5, pp. 1907–1920, May 2011.

[24] R. Marin, G. León, R. Wirz, J. Sales, J. M. Claver, P. J. Sanz, andJ. Fernández, “Remote programming of network robots within the UJI in-dustrial robotics telelaboratory: FPGA vision and SNRP network proto-col,” IEEE Trans. Ind. Electron., vol. 56, no. 12, pp. 4806–4816, Dec. 2009.

[25] E. Monmasson and M. N. Cirstea, “FPGA design methodology for in-dustrial control systems—A review,” IEEE Trans. Ind. Electron., vol. 54,no. 4, pp. 1824–1842, Aug. 2007.

[26] J. J. Rodriguez-Andina, M. J. Moure, and M. D. Valdes, “Features, designtools, and application domains of FPGAs,” IEEE Trans. Ind. Electron.,vol. 54, no. 4, pp. 1810–1823, Aug. 2007.

[27] H. Zhuang, K.-S. Low, and W.-Y. Yau, “Multichannel pulse-coupledneural-network-based color image segmentation for object detection,”IEEE Trans. Ind. Electron., vol. 59, no. 8, pp. 3299–3308, Aug. 2012.

[28] S. Jin, D. Kim, T. T. Nguyen, D. Kim, M. Kim, and J. W. Jeon, “Design andimplementation of a pipelined datapath for high-speed face detection usingFPGA,” IEEE Trans. Ind. Informat., vol. 8, no. 1, pp. 158–167, Feb. 2012.

[29] Y. Chen and V. Dinavahi, “Digital hardware emulation of universal ma-chine and universal line models for real-time electromagnetic transientsimulation,” IEEE Trans. Ind. Electron., vol. 59, no. 2, pp. 1300–1309,Feb. 2012.

[30] C. Charoensak and F. Sattar, “FPGA design of a real-time implementationof dynamic range compression for improving television picture,” in Proc.IEEE ICICS, 2007, pp. 1–5.

[31] A. Rosado-Muñoz, M. Bataller-Mompeán, E. Soria-Olivas, C. Scarante,and J. F. Guerrero-Martínez, “FPGA implementation of an adaptive filterrobust to impulsive noise: Two approaches,” IEEE Trans. Ind. Electron.,vol. 58, no. 3, pp. 860–870, Mar. 2011.

[32] T. Q. Vinh, J. H. Park, Y.-C. Kim, and S. H. Hong, “FPGA implementationof real-time edge-preserving filter for video noise reduction,” in Proc.IEEE ICCEE, 2008, pp. 611–614.

[33] H. Dutta, F. Hannig, J. Teich, B. Heigl, and H. Hornegger, “A designmethodology for hardware acceleration of adaptive filter algorithms inimage processing,” in Proc. IEEE ASAP, 2006, pp. 331–340.

[34] R. Chen, L. Chen, and L. Chen, “System design consideration for digitalwheelchair controller,” IEEE Trans. Ind. Electron., vol. 47, no. 4, pp. 898–907, Aug. 2000.

[35] R. Turney, “Two-dimensional linear filtering,” in Application Note: XilinxFPGAs, 2007, pp. 1–8.

[36] M. Zhang and B. K. Gunturk, “Multiresolution bilateral filter for imagedenoising,” IEEE Trans. Image Process., vol. 17, no. 12, pp. 2324–2333,Dec. 2008.

[37] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assess-ment: From error visibility to structural similarity,” IEEE Trans. ImageProcess., vol. 13, no. 4, pp. 600–612, Apr. 2004.

[38] A. Gabiger-Rose, M. Kube, P. Schmitt, R. Weigel, and R. Rose, “Imagedenoising using bilateral filter with noise-adaptive parameter tuning,” inProc. IEEE IECON, 2011, pp. 4515–4520.

Anna Gabiger-Rose (S’09) was born inOrdshonikidse, Ukraine, in 1978. She received theDipl.-Ing. degree in electrical engineering, electro-nics, and information technology from the Friedrich-Alexander University of Erlangen-Nuremberg,Erlangen, Germany, in 2007.

From 2001 to 2007, she was a Student Assistantwith the Department of Contactless Test and Mea-suring Systems, Fraunhofer Institute for IntegratedCircuits, Erlangen. She is currently a Research As-sistant with the Institute for Electronics Engineering,

University of Erlangen-Nuremberg. Her research interests include the design ofembedded systems for image processing and the investigation of digital filteringtechniques for image quality enhancement.

Mrs. Gabiger-Rose is member of the IEEE Industrial Electronics Society.She served as a reviewer for the 35th Annual Conference of the IEEE IndustrialElectronics Society (IECON09).

Matthias Kube was born in Mainz, Germany, in1975. He received the Dipl.-Ing. FH (M.Sc.) degreein electrical engineering and microelectronics fromthe Georg-Simon-Ohm University of Applied Sci-ence of Nuremberg, Nuremberg, Germany, in 2002.

Since 2003, he has been working as a member ofthe research staff at the Department of ContactlessTest and Measuring Systems, Fraunhofer Institutefor Integrated Circuits, Erlangen, Germany. He hasthe technical leadership for the development of aninnovative indirect converting X-ray detector with

conventional optical sensors for scientific and industrial applications of non-destructive testing (NDT), which is optimized for tasks that require a highdynamic range, a high speed, and a long life cycle. His interests in researchinclude optical sensors and cameras, field-programmable-gate-array design,embedded systems for image processing, and X-ray imaging for NDT.

Robert Weigel (S’88–M’89–SM’95–F’02) was bornin Ebermannstadt, Germany, in 1956. He receivedthe Dr.-Ing. and Dr.-Ing.habil. degrees in electricalengineering and computer science from the Mu-nich University of Technology, Munich, Germany, in1989 and 1992, respectively.

He was a Research Engineer from 1982 to 1988,a Senior Research Engineer from 1988 to 1994, anda Professor for RF Circuits and Systems from 1994to 1996 with the Munich University of Technology.From 1996 to 2002, he was the Director of the

Institute for Communications and Information Engineering, University of Linz,Linz, Austria. Since 2002, he has been the Head of the Institute for ElectronicsEngineering, University of Erlangen-Nuremberg, Erlangen, Germany.

Dr. Weigel was the recipient of the IEEE Microwave Applications Award in2007. Within IEEE Microwave Theory and Techniques Society (MTT-S), he hasbeen the Founder and Chair of the Austrian Communications/Microwave Theoryand Techniques Society Joint Chapter and Region 8 Coordinator. He is the Chairof MTT-2 Microwave Acoustics and the MTT-S President-Elect in 2013.

Richard Rose (S’09) was born in Nuremberg,Germany, in 1981. He received the Dipl.-Ing. degreein electrical engineering, electronics, and informa-tion technology from the Friedrich-Alexander Uni-versity of Erlangen-Nuremberg, Erlangen, Germany,in 2007.

In 2008, he joined the Institute for ElectronicsEngineering, University of Erlangen-Nuremberg, asa Research Assistant, and since 2010, he has been theTeam Leader of the System Engineering group. Hisresearch interests include digital signal processing,

receiver design, antenna design, localization techniques, and wireless commu-nication systems.

Mr. Rose is a member of the IEEE Microwave Theory and Techniques So-ciety, the IEEE Signal Processing Society, the IEEE Antennas and PropagationSociety, and the IEEE Communications Society. He served as a reviewer for thejournal of Mathematical Problems in Engineering and the International Journalof Electronics and Communications.