Resource-Efficient FPGA Architecture and Implementation of Hough Transform

download Resource-Efficient FPGA Architecture and Implementation of Hough Transform

of 10

description

Hough transform is widely used for detecting straightlines in an image, but it involves huge computations. For embeddedapplication, field-programmable gate arrays are one of the mostused hardware accelerators to achieve real-time implementation ofHough transform. In this paper, we present a resource-efficient architectureand implementation of Hough transform on an FPGA.The incrementing property of Hough transform is described andused to reduce the resource requirement. In order to facilitateparallelism, we divide the image into blocks and apply the incrementingproperty to pixels within a block and between blocks.Moreover, the locality of Hough transform is analyzed to reducethe memory access. The proposed architecture is implement on anAltera EP2S180F1508C3 device and can operate at a maximumfrequency of 200 MHz. It could compute the Hough transformof 512 512 test images with 180 orientations in 2.07–3.16 mswithout using many FPGA resources (i.e., one could achieve theperformance by adopting a low-cost low-end FPGA).

Transcript of Resource-Efficient FPGA Architecture and Implementation of Hough Transform

  • IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012 1419

    Resource-Efficient FPGA Architecture andImplementation of Hough Transform

    Zhong-Ho Chen, Alvin W. Y. Su, and Ming-Ting Sun, Fellow, IEEE

    AbstractHough transform is widely used for detecting straightlines in an image, but it involves huge computations. For embeddedapplication, field-programmable gate arrays are one of the mostused hardware accelerators to achieve real-time implementation ofHough transform. In this paper, we present a resource-efficient ar-chitecture and implementation of Hough transform on an FPGA.The incrementing property of Hough transform is described andused to reduce the resource requirement. In order to facilitateparallelism, we divide the image into blocks and apply the incre-menting property to pixels within a block and between blocks.Moreover, the locality of Hough transform is analyzed to reducethe memory access. The proposed architecture is implement on anAltera EP2S180F1508C3 device and can operate at a maximumfrequency of 200 MHz. It could compute the Hough transformof 512 512 test images with 180 orientations in 2.073.16 mswithout using many FPGA resources (i.e., one could achieve theperformance by adopting a low-cost low-end FPGA).

    Index TermsFPGA, hough transform, real-time.

    I. INTRODUCTION

    H OUGH transform [1] is a popular technique for detectingstraight lines in images. In actual applications, throughpreprocessing and thresholding, the images are converted intobinary feature images. The pixels with the pixel value 1 arecalled feature points. A line is the one that passes through manyfeatures points. Imagining we draw lines with various anglespassing through a feature point, each line can be represented asa point in the space, where is the perpendicular distanceof the line to the origin and is the angle between a normalto the line and the positive -axis. Each time when a line isdrawn for a feature point, it will produce a value whichcan be considered as a vote for the specific value. Afterprocessing all of the feature points, the value that hasthe largest accumulated votes will correspond to the line thatpasses through the largest number of feature points. In the im-plementation, the votes for a specific value can be storedin a memory addressed by the specific value. The Houghtransform is robust and performs well even in the presence of

    Manuscript received December 18, 2010; revised April 07, 2011; acceptedJune 01, 2011. Date of publication July 25, 2011; date of current version June14, 2012. This work was supported in part by the National Science Council,Taiwan, under Grant NSC 98-2221-E-006-158-MY3.

    Z.-H. Chen and A. W. Y. Su are with the SCREAM Lab, Department of Com-puter Science and Information Engineering, National Cheng-Kung University,701 Tainan, Taiwan. (e-mail: [email protected]; [email protected]).

    M.-T. Sun is with the Electrical Engineering Department, University of Wash-ington, Seattle, WA 91895 USA (e-mail: [email protected]).

    Digital Object Identifier 10.1109/TVLSI.2011.2160002

    noise or missing data, but also involves huge computations andexcessive memory requirements.

    Through Hough transform, the for a line with an anglepassing through a feature point at the image coordinatecan be calculated by

    (1)

    Practical implementations of Hough transform generally in-volve a voting procedure over the discrete parameter space. Al-gorithm 1 below shows the voting process of Hough transform.

    is the number of angles, and the rounding operation is appliedto the results of (1) to get integer values of .

    Algorithm 1. Voting Process of Hough Transform

    Initialize Votes as zeros

    for all feature points

    for , using 180 as the step-size

    end for

    end for.

    After the voting process, the with local-maximumvalues of votes are considered as candidate lines. In this paper,we only focus on the voting process of the Hough transform.Given a CIF (352 288) video with 30 frames/second (fps) and10% feature points, it needs 109 M multiplications per second tocompute the values of 180 angles. For embedded applications,it requires hardware accelerators to achieve real-time Houghtransform. Compared with application-specific integrated cir-cuits (ASICs), field-programmable gate arrays (FPGAs) usuallytarget smaller markets and require much less development time.In FPGA, high throughput is often achieved by exploiting theparallelism of the design rather than by operating the chip ata very high clock frequency. In addition, a better architectureshould have more efficient utilization of the function blocks inthe FPGA. In this paper, we propose an architecture and theimplementation of Hough transform on an FPGA by exploitingboth angle-level and pixel-level parallelism. The goal is toachieve the highest throughput with the minimum hardwareresource.

    1063-8210/$26.00 2011 IEEE

  • 1420 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012

    The remainder of this paper is organized as follows. Section IIprovides a brief review of related works for the implementa-tion of Hough transform. Section III describes our observations,which lead to an efficient architecture for implementing Houghtransform. In Section IV, we describe the proposed Hough trans-form architecture and FPGA implementation. Section V evalu-ates the proposed architecture. Finally, a conclusion is given inSection VI.

    II. RELATED WORKS

    Because the general Hough transform is very computation-ally intensive, other line-detection schemes have been proposed,such as gradient-based Hough transform [2] and kernel-basedtransform [3]. These schemes require fewer computations thanHough transform, but they still require a high-end CPU, which isoften unavailable in practical applications, or special hardwaredevices to achieve real-time performance. The gradient-basedHough transform has been implemented in special hardware[4], but kernel-based Hough transform, which adopts a link-list data structure, is difficult to implement efficiently in hard-ware. There has been some research to implement the Houghtransform on special hardware, such as graphic processors [5],scan line array processors [6], and pyramid multiprocessors [7].However, these devices are unsuitable for low-cost embeddedsystems. This paper focuses on the implementation of the gen-eral Hough transform using FPGA.

    One straightforward method to implement Hough transformis using multipliers [8]. However, multipliers are less avail-able on low-end FPGAs. Hence, some researchers implementHough transform using a coordinate rotation digital computer(CORDIC) [9][13] or a simplified CORDIC algorithm such asthe multisector algorithm [14]. CORDIC is an arithmetic tech-nique developed by Volder [15] to solve trigonometric problemsby rotating a vector in small angles until the desired angle isachieved. The CORDIC algorithms for FPGA are surveyed in[16]. CORDIC could implement the Hough transform usingonly shifters and adders rather than multipliers. The majordisadvantage is that it requires multiple iterations to obtain one

    in the parameter space. Hence, pipelined CORDIC imple-mentations are proposed to improve the throughput; however,the required resources are also increased. Another drawbackof CORDIC is that the result produced is not the correct valuebut the correct value with a constant gain. The gain canbe eliminated by applying an inverse gain to the initializationvector. However, this also requires additional resources. In [17],an FPGA platform is proposed to implement Hough transformby using hybrid-log arithmetic. In [18], a distributed arithmetic(DA) architecture is proposed to implement Hough transformwith shiftadd operations. Unfortunately, it also requires mul-tiple iterations to obtain one in the parameter space.

    An accumulator-based architecture is proposed in [19]. Anangle-level parallelism is applied in this architecture. It couldobtain a point in the parameter space by a single accumula-tion. However, all of the pixels in the binary feature image needto go through the computation pixel by pixel, which limits itsthroughput. Additive Hough transform [20] utilizes the proper-ties of Hough transform. Although the pixel-level parallelism is

    facilitated in this architecture, the memory requirement is alsoincreased in proportion to the parallelism. Another drawbackis that it requires additional cycles to get the entire parameterspace.

    Incremental Hough transforms [21][24] are modified Houghtransforms for the hardware implementation. It reuses previ-ously computed values to derive another point in the parameterspace.

    Hough transform is not only computation-demanding butalso memory-demanding. In [25], a line-based implementationis proposed to reduce the bandwidth requirement on SIMDarchitecture. In [26], authors propose a memory-efficient im-plementation of Hough transform by using circular buffers ofDSP processors. In [27], the memory requirement is reducedby storing the coordinates of a binary feature image ratherthan entire binary feature image. In [28] and [29], a modifiedcache-friendly Hough transform is proposed to reduce thememory requirement and the parallelism overhead on multi-processors.

    In this paper, we utilize the incrementing property of Houghtransform to achieve an efficient architecture for implementingHough transform. The proposed architecture facilitates bothpixel-level and angle-level parallelisms. Unlike [20], thememory requirement is not increased in proportion to theparallelism. We propose using run-length encoding to skipunnecessary computations and memory accesses. Run-lengthencoding is widely used for data compression [30] but couldalso be used for reducing the computing complexity of imageprocessing [31]. Another application of run-length encoding isthe skew detection of documents [32], [33]. In this paper, weuse run-length encoding to reduce the computing complexityof Hough transform. Moreover, we utilize the locality of theHough transform and Vote Consolidation to reduce the memoryrequirement.

    To help later discussions, we summarize some terms we usein this paper as follows.

    specific represents a line with anangle and distance from the origin.

    value for a line with angle which passesthrough the point at the coordinate .

    value for a line with angle which passesthrough the point .

    Vote line with which passes through afeature point will generate a vote to thespecific value.

    Vote-offset difference of the integer part betweenand , where is any point in a blockand has the minimum integer partamong all points in a block.

    III. OBSERVATIONSIn Hough transform, only feature pixels produce votes. The

    number of feature pixels in an image is usually much less thanthat of nonfeature pixels. In order to perform the pixel-level par-allelism of Hough transform, we divide a image into

  • CHEN et al.: RESOURCE-EFFICIENT FPGA ARCHITECTURE AND IMPLEMENTATION OF HOUGH TRANSFORM 1421

    Fig. 1. Example image and its zero-run-length encoded symbols.

    Fig. 2. Number of symbols versus the block-sizes for several 512 512images.

    blocks with a block-size by , so that a relatively small pro-cessing element (PE) can process all of the pixels inside a blocksimultaneously. We call the blocks that do not contain featurepixels nonfeature blocks (i.e., all zero blocks). The performanceof computation can be significantly improved if the nonfeatureblocks are skipped.

    A. Run-Length EncodingA run-length encoding can encode an input binary feature

    image into a zero-run-length symbol stream, in order to skipthe nonfeature blocks. We encode the binary image as a list ofsymbols before the calculation of the values. A symbol is rep-resented as a triplet, where is a bit to indicatethe beginning of a block-row, represents the pixel valuesin a block, and is the number of successive zero-blocks afterthe current block. Fig. 1 gives an example of a binary image andthe encoded run-length symbols, where pixels with a value 1represent feature pixels. In this example, the image size is 168 and the block size is 2 4. Note that the first block of eachblock-row is always encoded whether it is a nonfeature block ornot. These 16 blocks are encoded into six symbols.

    The maximum number of successive zero-blocks needs tobe limited in order to limit the size of a lookup table (LUT)in our proposed architecture as will be explained later. Thecoding efficiency depends on both the block-size and the max-imum number of successive zero-blocks. Fig. 2 shows the totalnumber of symbols for various gray-level 512 512 imageswith the maximum number of successive zero-blocks set to 15.The original images are preprocessed by the edge function ofthe MATLAB image processing toolbox: 1) a Sobel operator ex-tracts the horizontal gradients and vertical gradients ;

    Fig. 3. Illustration of (2).

    2) the magnitudes of the gradients are calculated by ;3) a cutoff value is set as four times the mean of all magnitudes;and 4) the magnitudes are thresholded by the threshold value,which is set as the square root of the cutoff value.

    Typically, the number of symbols is roughly inversely propor-tional to the block-size. However, it is very likely that a largerblock will contain few nonzero feature pixels. In our proposedarchitecture, we use a PE to compute the values for a spe-cific for all of the pixels inside a block simultaneously. In orderto limit the complexity of the PE, the block-size cannot be toolarge. Based on the simulation results in Fig. 2, we choose theblock size of 2 4, which gives a good tradeoff between theresulting number of symbols and the PE complexity. The per-formance is similar to that of 4 2, but the block-size of 24 results in a smaller number of block-rows, which is preferredwith our architecture as will be clear later. It should be notedthat in practical applications, different preprocessing from theone we use in this example will be used depending on the appli-cations. So, this example only serves to illustrate the parameterselection process and considerations using our proposed archi-tecture. In the following discussions, without loss of generality,we will use the image shown in Fig. 1 as an example to illustratethe operation of our proposed architecture.

    B. Incrementing Property of Hough TransformWe can use parallel PEs to compute different in parallel. To

    further improve the computation speed, we also perform pixel-level parallelism for calculating Hough transform. For a specific

    , given two points in the image with coordinates and, one could directly calculate

    by (1) or derive it from as

    (2)

    The equation could be illustrated by Fig. 3. In Fig. 3, L1 andL2 are lines with angle passing through point A and B

    , respectively. The of L1 and L2 is the distanceand , respectively. One could directly calculate or

    compute it by adding and .From (2), it is easy to see that the pixel which gives the

    smallest value is the first pixel in the block forand is the upper rightmost pixel in the block for

    , respectively. In the proposed architecture, the pixel hasthe smallest value. If we label the pixels inside a block as

  • 1422 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012

    Fig. 4. Pixel labels in a block: (a) . (b) .

    TABLE IMAXIMUM VOTE-OFFSET FOR DIFFERENT BLOCK-SIZES

    shown in Fig. 4, all of the pixels in the block can be calculatedfrom which is the smallest value in the block. Further-more, since , and with the labeling inFig. 4(b), will be negative, the values forcan be calculated using exactly the same circuits as those usedfor . So, we will only use in thefollowing discussion. Note that, from (1), for the first block inthe image, for , andfor .

    C. Locality of Hough TransformWe observe that the differences among produced by pixels

    in the same block are small. Also, pixels in the same block maycontribute votes to the same in the parameter space. We callthis property locality of Hough transform. The votes from thepixels of a block will locate in a small range. Instead of indi-vidually accumulating the vote from each feature pixel in theparameter space, pixels giving the same value can be jointlyaccumulated. We can utilize (2) to determine whether points inthe same block give the same value.

    Let , which has the minimum value among all pixels ofa block, be at the coordinate . Let , where

    is the integer part and is the residual fractional part. Letbe another pixel in the same block at coordinate .The integer part can be derived by

    (3)

    where is called the vote offset from ,which represents the contribution from the accumulation of thefractional parts with , which will add aninteger vote-offset to . For a given by block and a given

    , the maximum value for the vote-offset is

    (4)

    Fig. 5. Proposed architecture for Hough transform using FPGA.

    Since Hough transform may be used in different applications,we should consider all possible angles. For any angles, we coulddetermine the maximum by solving:

    where and . Since canbe represented as , whereand

    (5)

    we can compute the maximum among all as

    Table I shows the maximum vote-offset values for differentblock-sizes. For the block-size of 2 4, the largest vote offsetis 4, which means there can only be five different integer valuesfor in the whole block. We will discuss the use of this localityproperty in our proposed implementation of Hough transform toreduce the memory bandwidth requirement.

    IV. PROPOSED HOUGH TRANSFORM ARCHITECTURE ANDFPGA IMPLEMENTATION

    Based on the above observations, we propose an efficient ar-chitecture for implementing Hough transform. A block diagramof the proposed architecture is shown in Fig. 5 which will be dis-cussed in detail as follows.

    Run-length encoding is a simple process which readsthe binary pixel values from the feature image and outputthe triplet. Efficient implementations of therun-length encoder could be found in the implementations ofstandard codecs, such as JPEG [34]. The run-length encodingalso reduces the memory bandwidth requirement for the FPGAby a factor determined by the compression ratio achieved by therun-length encoding. Since it has little effect on the complexityof the overall circuits, and can reduce the data bandwidth, inthe system we implemented for our specific application, it was

  • CHEN et al.: RESOURCE-EFFICIENT FPGA ARCHITECTURE AND IMPLEMENTATION OF HOUGH TRANSFORM 1423

    Fig. 6. Functional blocks in the proposed PE. The Inter-block incrementingconsiders as input and computes of a nonzero block. isdivided into integer part and fractional part . Integer part is used toaccess Vote Memory, and fractional part is used for intra-block incrementing.Vote Consolidation considers code as input and consolidates votes. Finally, VoteAlignment aligns votes for access Vote Memory.

    implemented by the preprocessor off-chip. The PE is run-timeconfigurable for computing Hough transform of any angles.Each PE calculates the consolidated values for all the pixelsin a block for a given . The number of PEs is adjustable anddepends on the performance requirement. The Vote Memorystores all the votes. These functional blocks are described indetail in the following. A block diagram of the PE is shown inFig. 6.

    The incrementing property in (2) is utilized for bothinter-block and intra-block incrementing. The inter-block in-crementing calculates the of first blocks in block-rowsand non-zero blocks in the run-length encoded symbols. Theintra-block incrementing calculates the values of otherafter the inter-block incrementing. As shown in the abovesection, the eight values from all the pixels in a block canonly have five different integer values, to , whereis the integer part of . So, some of the from differentpixels will have the same values. If the individual votes aredirectly saved into the memory, it will incur multiple memoryaccesses. To save the memory bandwidth, Vote Consolidationconsolidates the eight single votes from the eight pixels into fiveconsolidated votes ( to ). Since the addresses for storingthese five different votes are continuous, the Vote Alignmentaligns the initial address so that the votes can be stored into thecorrect memory locations in one clock cycle. The functionalblocks of the PE are described in detail as follows.

    A. Inter-Block IncrementingBecause we divide an image into blocks with a fixed block-

    size, and are constants between the corresponding pixelsof two blocks. Two accumulators can be used to implement theinter-block incrementing as shown in Fig. 7(a), wherecan be precomputed. In order to skip zero-blocks, a step-tableis introduced in the proposed architecture. The step-table storesall possible . Two PEs for calculating the votesfor and can share the same step-table. In our im-plementation, the maximum number of successive zero-blocks

    is set to 15 to limit the size of the step-table. Hence, there areonly 16 entries in each step-table. The output of the step-table iscalled step which is the component of (2) for computing

    based on the of the previous nonskipped blockin the same block-row. Col-reg calculates the valuesfor the nonzero blocks in a block-row in the -direction everyclock cycle, and row-reg calculates the values for thefirst blocks of block-rows in the -direction every time aftera block-row processing is completed. At the beginning of theframe, row-reg is initialized to 0 for , orfor . The result is represented in a fixed-pointformat, , where is the integer part and is the fractionalpart. In Fig. 7(a), is a control signal which is asserted be-fore processing of a frame and is another control signalwhich is asserted before processing a row-block. The col-reg isonly responsible for calculating the value of the first pixelin a block. The values of other pixels are calculated by theintra-block incrementing to be discussed in Section IV-B.

    B. Intra-Block Incrementing and Vote Consolidation

    The computed can be used to calculate all the othervalues of the pixels in the block simultaneously by using the

    corresponding , , , and values. This will result inseven more values. For the whole block, the eight votes in thememory addressed by the eight values will need to be accu-mulated. We observe that based on the locality of Hough trans-form as discussed in Section III-C, the maximum vote-offset is4 for the block-size of 2 4. So, there are at most five different

    values ( to ) in the whole block.The computed is divided into the integer part and

    the fractional part . To result in an efficient circuit, only thefractional part is used for calculating the vote-offsets rela-tive to for the pixels in the block as shown in Fig. 8. The firststage of Fig. 8 calculates the vote-offsets for the th pixelin the block. The vote-offsets range from 0 to 4, and are rep-resented by 3-b numbers. These numbers are decoded by 3:8decoders as shown in the second stage of Fig. 8. Each decoderoutput contains eight lines indicating the vote-offset value foreach pixel. These lines will be used with combination logic toproduce consolidated votes. In Fig. 8, we eliminate those signalsat the output of the decoders which are always zero, and onlykeep those lines which may be nonzero. The constants in Fig. 8are precomputed and stored in the registers before activating thePE. They are also shared between the two PEs calculating theangles and .

    In Fig. 9, the outputs of the decoders are combined with thevalues of the corresponding pixels (1 for feature pixels and 0for nonfeature pixels) using a combination logic circuit to de-termine , which represents the consolidated number of votesfor each different vote-offset. For example, if , it repre-sents three votes with .

    In summary, using the locality of Hough transform and VoteConsolidation, the circuits in Figs. 8 and 9 take the fractionalpart of and produce the output to which repre-sent the numbers of votes with equals to , respec-tively. Thus, instead of accessing the memory multiple times toaccumulate the votes, we only need to access the memory once

  • 1424 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012

    Fig. 7. (a) Proposed accumulator architecture for the inter-block incrementing. (b) Basic dataflow of the circuit.

    Fig. 8. Intra-block incrementing.

    Fig. 9. Vote consolidation.

    (with addresses to ), and each time we accumulate thememory contents in parallel. This is not possible without the

    Vote Consolidation, since the generated votes from the pixelsmay refer to the same memory location.

    C. Vote AlignmentTo update the contents of the memory locations to ,

    instead of accessing the memory five times sequentially, we up-date the five memory contents in one clock cycle. This is pos-sible since the five memory contents which need updating arestored in continuous memory locations with addresses from to

    with our vote consolidation scheme. The is consideredas the base address of these votes. We use two 4K RAM blocksin the FPGA to implement the Vote Memory. Because the accu-mulation requires one read and one write memory operation, the4K RAM blocks are configured as dual-port. In the FPGA, each4K RAM could be configured as 128 36, 256 18, 512 9,1024 4, 2048 2 or 4096 1. The maximum bit-width of a4K RAM is 36 b, thus two 4K RAMs can store eight votes (eachvote with 9 b). Since we only have five vote-offsets, three 0 valuevotes are added in Fig. 10(a). In the case that the block-size islarger than 2 4 or 4 2, the circuit can be modified accord-ingly or it can use multiple clock cycles to handle a symbol.The consolidated votes must be aligned based on the base-ad-dress before they are accumulated to the Vote Memory. Sincethe value may not be a multiple of 8, we need to align the fivevote-offsets with the correct memory locations before the ac-cumulation. This is achieved with a barrel rotator controlled bythe value (which is the base-address) as shown in Fig. 10(a) toalign the votes with the corresponding Vote Memory contents.The votes are grouped in two groups, each group containing fourvotes. The lower four votes are stored in one 4K RAM and theupper four votes are stored in the other 4K RAM. Fig. 10(b)shows the two 4K RAM configuration. In the circuit implemen-tation, the upper bits of are used to address the memory andthe lower bits of are used to align the votes. If the is lo-cated in the upper 4K RAM, the address of the lower 4K RAMshould be increased by 1 to give the correct addresses.

    In our implementation, the image size is 512 512. Althoughthe maximum can be which is a 10-b number, in prac-tice we limit it to a 9-b number, since, if a receives more than511 votes, it is certain there is a line with that value in theinput binary feature image.

  • CHEN et al.: RESOURCE-EFFICIENT FPGA ARCHITECTURE AND IMPLEMENTATION OF HOUGH TRANSFORM 1425

    Fig. 10. (a) Vote alignment. (b) Vote memory and accumulators.

    D. InitializationBefore a PE starting to process an angle, a host needs to ini-

    tialize the PE by initializing three components: 1) the row-reg-ister, 2) the step table, and 3) the registers in the intra-block in-crementing block for storing . If a processoris used to control the PE, these values could be computed by theprocessor. Otherwise, it requires a lookup table and an accumu-lator to compute these values. The lookup table stores allof all angles, and of an angle is compute by .

    V. EVALUATIONBecause the output of the proposed architecture is identical

    to the ideal Hough transform, we do not show the result of testimages. Here, we evaluate the resource requirement, memorybandwidth, and computation time of the proposed architecture.

    A. Resource RequirementIn the hardware implementation, the and are

    represented in the fixed-point format. Let the fraction part berepresented in bits. The maximum error introduced by eachincrement is . There are (W/M-1) steps in the -direc-tion and H/N steps, including the initialization, in the -direc-tion. Moreover, intra-block incrementing involves another in-crement. Therefore, the maximum error of is

    .

    Table II shows the accuracy, FPGA resources, and the max-imum achievable frequency for PEs under different accuracy.ALUT stands for adaptive LUT, and it is the basic cell in AlteraStratix II FPGAs. The result is reported by the FPGA vendorssynthesis tool, Quartus II Version 9.1 with SP2, on the deviceEP2S180F1508C3. By increasing the number of bits in thefixed-point implementation, the maximum error is reduced.However, it also increases the required resources and reducesthe maximum achievable frequency.

    In order to evaluate the proposed architecture, we alsoimplement the architecture of previous works [9], [18], [19]on the same device. In the implementation, we use 12 b forthe fractional part. Table III compares the throughput of theseapproaches, where the Throughput per Cycle is measuredas the average number of computed values per cycle. TheThroughput (M/s) is measured in millions of values persecond. In the comparisons of the throughput, we do not con-sider the effect of the Vote Memory and Accumulators, since

    TABLE IIACCURACY, FPGA RESOURCES, AND MAXIMUM FREQUENCY

    UNDER DIFFERENT ACCURACY FOR THE PE

    the main purpose of this work is to compare the throughputof the PE for the different architectures. Vote Memory is acommon part for all architectures to store all of the votes.Also, the memory access can be speeded up by using multiplereconfigurable memories on the FPGA. If the PE can run at amuch higher speed than the memory access, it is possible to useone PE to process multiple angles and use multiple memoriesto match the throughput.

    The accuracy of the CORDIC algorithm [9] depends on boththe number of fractional bits and the number of iterations. Inour implementation, we keep 9 b for the fractional parts and use13 iterations. The CORDIC algorithm produces two valuesper cycle. In the DA architecture [18], the accuracy depends onthe fractional part, and we keep 3 b for the fractional parts. Thethroughput depends on the bits for the image coordinates. Sincethe image coordinates are 9 b, each value is computed in nineclock cycles. The accuracy of Cherns [19] method depends onthe number of fractional bits, and it computes one value percycle. We use two proposed PEs to simultaneously calculate twoangles ( and of one 2 4 block, since the two PEscould share some resources. So, 16 values are computed percycle. In general, using more parallel PEs, higher throughputcan be achieved; however, this will also use more resources. So,the more important number in the comparison in Table III is theThroughput/ALUTs. As can be seen from Table III, our pro-posed architecture can achieve much better Throughput/ALUTscompared to other architectures. Although the maximum fre-quency of the proposed architecture is lower, it could achievehigher throughput with the minimum resources.

    B. Memory and Bandwidth Requirement

    Previous researches reported that Hough transform is notonly computation demanding but also memory-bandwidthdemanding. Here, we analyze the memory bandwidth of theproposed architecture.

    The entire votes of Hough transform are usually too large tobe stored in an internal temporary storage, and so, are storedin an external memory. Hence, the direct implementation re-quires two external memory accesses for accumulating eachnonzero pixel of an angle. In addition, all votes should be ini-tialized before the accumulation. The maximum value of anangle is . Hence, the required memory bandwidthis , where is the ratioof the number of nonzero pixels to the total number of pixels ofthe binary feature image and is the number of angles.

  • 1426 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012

    TABLE IIIPERFORMANCE COMPARISON AMONG THE PES OF DIFFERENT APPROACHES

    TABLE IVMEMORY BANDWIDTH OF DIRECT IMPLEMENTATION OF HOUGH TRANSFORM

    AND THE PROPOSED ARCHITECTURE

    The proposed PE used Vote Memory to store votes of anangle. In processing an angle of Hough transform, votes are tem-porarily stored in the Vote Memory to avoid accessing the ex-ternal memory. After an angle of Hough transform is calculated,a transfer from the Vote Memory to the external memory is ini-tialized. The required memory bandwidth is .

    Since the memory bandwidth is dependent on images,Table IV compares the required memory bandwidth of the testimages between the direct implementation of Hough transformand the proposed architecture. All image sizes are 512 512.The number of votes for an angle is 724 and the size of avote is limited to 9 b. The total number of angles is 180. Theresult shows that the proposed architecture requires much lessmemory bandwidth than the direct implementation of Houghtransform.

    C. Computation Time

    We use the proposed PE to compute the accumulated votesof Hough transform of a 512 512 image. Table V shows thesynthesis result on EP2S180F1508C3. The maximum frequencyis bounded by the Vote Memory. One M-RAM is used for run-length symbols and one 512-b RAM (M512) is used for the step-table. Four 4K RAMs (M4K), two 4K RAMs for each angle,are used for the Vote Memory with each PE. The last 4K RAM

    TABLE VSYNTHESIS RESULT OF THE PROPOSED ARCHITECTURE ON EP2S180F1508C3

    TABLE VIIMAGE SPECIFICATIONS AND EXECUTION TIME

    (M4K) in the chip is used for storing all and values,and all other constants.

    The total execution time for computing Hough transform de-pends on the number of symbols and angles. Before accumu-lating votes in the parameter space, the Vote Memory shouldbe initialized, and it takes 128 cycles. The time to initialize theconstants of PEs and to output the content of the Vote Memoryis overlapped with the initialization of the Vote Memory. Theangles, and , are computed simultaneously, but 0and 90 are computed individually. Table VI shows the speci-fication and execution time for each image. With the executiontime shown in Table VI, real-time processing of video can beeasily achieved.

    D. Extend the Proposed Architecture to Different Image SizesThe proposed architecture could be extended to process

    Hough transform of different image sizes. The first step is to

  • CHEN et al.: RESOURCE-EFFICIENT FPGA ARCHITECTURE AND IMPLEMENTATION OF HOUGH TRANSFORM 1427

    decide the block-size. Block-sizes affect the number of symbolsof an image and the computation time. For different block sizes,the Vote Memory should be carefully designed as describedin Section IV-C to match the bandwidth requirement of themaximum vote-offset (Table I). The proposed PE can process asymbol per clock cycle and the total number of cycles to processHough transform of an image is: number of symbols .Since the design of the Vote Memory varies with FPGAs, theclock cycles for the initialization of Vote Memory and thetransmission between Vote Memory and the external memoryare not included.

    The resource requirement of a PE also varies with differentimage-sizes and block-sizes. The inter-block incrementing re-quires two adders and the bit-width is dependent on the image-size and the precision F. The intra-block incrementing requires

    adders and the bit-width is dependent on the block-size and the precision F. The Vote Consolidation requires V(maximum vote-offset) adders to consolidate votes. The numberof votes is less than , and the bit-width of the addersis less than . The Vote Alignment requires arotator to rotate V votes and the bit-width is .

    VI. CONCLUSIONIn this paper, we propose a resource efficient architecture

    for calculating Hough transform. The incrementing propertyfor both inter-block and intra-block incrementing are exploitedto reduce the resource requirement. We use two accumulatorsto facilitate the inter-block incrementing, and zero-blocks areskipped by introducing a run-length coding scheme and a step-table. The intra-block incrementing could efficiently reduce theresource requirement. Instead of computing the of every pixelin a block, vote-offset is more efficient to determine the corre-sponding votes. We observe that pixels which are in the sameblock may generate identical votes in the parameter space. Thelocality of a block is analyzed and the votes corresponding to anidentical are consolidated in order to reduce the memory ac-cess and fully utilize the FPGA memory bandwidth. The resultshows that the proposed PE could achieve the best throughputwith the same amount of resources compared to previously re-ported architectures. The proposed PE is implemented on anAltera EP2S180F1508C3 device and the maximum frequencyis 200 MHz. It could compute the Hough transform of a 512

    512 image with 180 orientations in 2.07 3.16 ms. This per-formance is sufficient for real-time video processing.

    REFERENCES[1] R. O. Duda and P. E. Hart, Use of the Hough transformation to detect

    lines and curves in pictures, Commun. ACM, vol. 15, pp. 1115, 1972.[2] F. OGorman and M. B. Clowes, Finding picture edges through

    collinearity of feature points, IEEE Trans. Comput., vol. C-100, pp.449456, 1976.

    [3] L. A. F. Fernandes and M. M. Oliveira, Real-time line detectionthrough an improved Hough transform voting scheme, PatternRecognit., vol. 41, no. 1, pp. 299314, 2008.

    [4] L. Lin and V. K. Jain, Parallel architectures for computing the Houghtransform and CT image reconstruction, in Proc. Int. Conf. Applic.Specific Array Processors, 1994, pp. 152163.

    [5] R. Strzodka, I. Ihrke, and M. Magnor, A graphics hardware imple-mentation of the generalized hough transform for fast object recogni-tion, scale, and 3D pose detection, in Proc. 12th Int. Conf. Image Anal.Process., 2003, pp. 188193.

    [6] A. L. Fisher and P. T. Highnam, Computing the Hough transform ona scan line array processor, IEEE Trans. Pattern Anal. Mach. Intell.,vol. 11, no. 3, pp. 262265, Mar. 1989.

    [7] M. Atiquzzaman, Pipelined implementation of the multiresolutionHough transform in a pyramid multiprocessor, Pattern Recognit.Lett., vol. 15, no. 9, pp. 841851, 1994.

    [8] K. Hanahara, T. Maruyama, and T. Uchiyama, A real-time processorfor the Hough transform, IEEE Trans. Pattern Anal. Mach. Intell., vol.10, no. 1, pp. 121125, Jan. 1988.

    [9] F. Zhou and P. Kornerup, A high speed Hough transform usingCORDIC, Univ. Southern Denmark, Tech. Rep. PP-1995-27, 1995.

    [10] S. M. Karabernou and F. Terranti, Real-time FPGA implementation ofHough transform using gradient and CORDIC algorithm, Image Vis.Computing, vol. 23, no. 11, pp. 10091017, 2005.

    [11] J. D. Bruguera, N. Guil, T. Lang, J. Villalba, and E. L. Zapata, Cordicbased parallel/pipelined architecture for the Hough transform, J. VLSISignal Process., vol. 12, no. 3, pp. 207221, 1996.

    [12] D. D. S. Deng and H. Elgindy, High-speed parameterisable Houghtransform using reconfigurable hardware, in Proc. Pan-Sydney AreaWorkshop Vis. Inf. Process., Sydney, Australia, 2001, vol. 11, pp.5157.

    [13] K. Maharatna and S. Banerjee, A VLSI array architecture for Houghtransform, Pattern Recognit., vol. 34, no. 7, pp. 15031512, 2001.

    [14] E. K. Jolly and M. Fleury, Multi-sector algorithm for hardware accel-eration of the general Hough transform, Image Vis. Computing, vol.24, no. 9, pp. 970976, 2006.

    [15] J. E. Volder, The CORDIC trigonometric computing technique, IRETrans. Electron. Comput., vol. 8, no. 3, pp. 330334, 1959.

    [16] R. Andraka, A survey of CORDIC algorithms for FPGA based com-puters, in Proc. ACM/SIGDA 6th Int. Symp. Field Programmable GateArrays, Monterey, CA, 1998, pp. 191200.

    [17] P. Lee and A. Evagelos, An implementation of a multiplierless Houghtransform on an FPGA platform using hybrid-log arithmetic, in Proc.SPIE, 2008, vol. 6811, p. 68110G.

    [18] K. Mayasandra, S. Salehi, W. Wang, and H. M. Ladak, A distributedarithmetic hardware architecture for real-time Hough-transform-basedsegmentation, Can. J.Electr. Comput. Eng., vol. 30, no. 4, pp.201205, 2005.

    [19] M.-Y. Chern and Y.-H. Lu, Design and integration of parallel Hough-transform chips for high-speed line detection, in Proc. 11th Int. Conf.Parallel Distrib. Syst. Workshops, 2005, vol. 2, pp. 4246.

    [20] S. S. Sathyanarayana, R. K. Satzoda, and T. Srikanthan, Exploitinginherent parallelisms for accelerating linear Hough transform, IEEETrans. Image Process., vol. 18, no. 10, pp. 22552264, Oct. 2009.

    [21] H. Koshimizu and M. Numada, FIHT2 algorithm: A fast incrementalHough transform, IEICE Trans., vol. E74, pp. 33893393, 1991.

    [22] S. Tagzout, K. Achour, and O. Djekoune, Hough transform algo-rithm for FPGA implementation, Signal Process., vol. 81, no. 6, pp.12951301, 2001.

    [23] O. Djekoune and K. Achour, Incremental Hough transform: Animproved algorithm for digital device implementation, Real-TimeImaging, vol. 10, no. 6, pp. 351363, 2004.

    [24] H. Bessalah, S. Seddiki, F. Alim, and M. Bencherif, On line modeincremental Hough transform implementation on Xilinx FPGAS, inProc. 8th Conf. Signal, Speech Image Process., Santander, Cantabria,Spain, 2008, pp. 176179.

    [25] Y. He, Z. Zivkovic, R. Kleihorst, A. Danilin, and H. Corporaal, Real-time implementations of Hough transform on SIMD architecture, inProc. 2nd ACM/IEEE Int. Conf. Distrib. Smart Cameras, 2008, pp. 18.

    [26] M. Khan, A. Bais, K. Yahya, G. Hassan, and R. Arshad, A swiftand memory efficient Hough transform for systems with limited fastmemory, Image Anal. Recognit., vol. 5627, Lecture Notes in Com-puter Science, pp. 297306, 2009, vol. 5627.

    [27] S. R. Geninatti, J. I. B. Bentez, and M. H. Calvio, FPGA implemen-tation of the generalized Hough transform, in Proc. Int. Conf. Recon-figurable Computing and FPGAs, 2009, pp. 172177.

    [28] Y.-K. Chen, W. Li, J. Li, and T. Wang, Novel parallel Houghtransform on multi-core processors, in Proc. IEEE Int. Conf. Acoust.,Speech Signal Process., 2008, pp. 14571460.

    [29] W. Li and Y.-K. Chen, Parallelization, performance analysis, and al-gorithm consideration of Hough transform on chip multiprocessors,ACM SIGARCH Comput. Architecture News, vol. 36, pp. 1017, 2008.

    [30] D. Salomon, Data Compression: The Complete Reference 4th Ed..New York: Springer, 2006.

    [31] C. H. Messom, G. Sen Gupta, and S. N. Demidenko, Hough transformrun length encoding for real-time image processing, IEEE Trans. In-strum. Meas., vol. 56, no. 3, pp. 962967, Jun. 2007.

  • 1428 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012

    [32] S. C. Hinds, J. L. Fisher, and D. P. DAmato, A document skew de-tection method using run-length encoding and the Hough transform,in Proc. 10th Int. Conf. Pattern Recognit., 1990, vol. 1, pp. 464468.

    [33] H. Liu, Q. Wu, H. Zha, and X. Liu, Skew detection for complex doc-ument images using robust borderlines in both text and non-text re-gions, Pattern Recognit. Lett., vol. 29, no. 13, pp. 18931900, 2008.

    [34] M. Kovac and N. Ranganathan, JAGUAR: A fully pipelined VLSIarchitecture for JPEG image compression standard, Proc. IEEE, vol.83, no. 2, pp. 247258, Feb. 1995.

    Zhong-Ho Chen received the M.S. and Ph.D.degrees in computer science and information en-gineering from National Cheng-Kung University,Tainan, Taiwan, in 2005 and 2011, respectively.

    He was a Visiting Student with the Universityof Washington, Seattle, from May 2010 to January2011. Currently, he holds a Postdoctoral positionwith the SCREAM Lab, Department of ComputerScience and Information Engineering, NationalCheng-Kung University, Tainan, Taiwan. His re-search activities include digital signal processing,

    VLSI/FPGA circuit design, computer architecture, and embedded systems.

    Alvin W. Y. Su was born in Taiwan, Taiwan, in 1964.He received the B.S. degree in control engineeringfrom National Chiao-Tung University, Hsinchu,Taiwan, in 1986, and the M.S. and Ph.D. degrees inelectrical engineering from Polytechnic University,Brooklyn, NY, in 1990 and 1993, respectively.

    From 1993 to 1994, he was with the Centerfor Computer Research in Music and Acoustics(CCRMA), Stanford University, Stanford, CA. From1994 to 1995, he was with Computer Communi-cation Laboratory, Industrial Technology Research

    Institute, Taiwan. In 1995, he joined the Department of Information Engi-neering and Computer Engineering, Chung-Hwa University, Taiwan, where heserves as an Associate Professor. In 2000, he joined the Department of Com-puter Science and Information Engineering, National Cheng-Kung University,Tainan, Taiwan, where he is a Professor. His research interests cover the areasof digital audio signal processing, musical signal analysis and synthesis, patternrecognition, data compression, image/video signal processing, and VLSI signalprocessing.

    Ming-Ting Sun (S79M81SM89F96) re-ceived the B.S. degree from National TaiwanUniversity, Taipei, Taiwan, in 1976, the M.S. degreefrom the University of Texas at Arlington in 1981,and the Ph.D. degree from University of California,Los Angeles, in 1985, all in electrical engineering.

    He joined the University of Washington, Seattle, inAugust 1996, where he is a Professor. Previously, hewas the Director of the Video Signal Processing Re-search Group at Bellcore. He has been a Chaired/Vis-iting Professor with Tsinghua University, Tokyo Uni-

    versity, National Taiwan University, National Cheng Kung University, NationalChung Cheng University, National Sun Yat-sen University, and Hong Kong Uni-versity of Science and Technology. He holds ten patents and has published over200 technical papers, including 14 book chapters in the area of video and mul-timedia technologies. He coedited a book, Compressed Video Over Networks(CRC, 2000).

    Dr. Sun was the Editor-in-Chief of the IEEE TRANSACTIONS ON MULTIMEDIA(TMM) and a Distinguished Lecturer of the Circuits and Systems Society from2000 to 2001. He received an IEEE CASS Golden Jubilee Medal in 2000, andwas the general co-chair of the Visual Communications and Image Processing2000 Conference. He was the Editor-in-Chief of the IEEE TRANSACTIONS ONCIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TCSVT) from 1995 to 1997.He received the TCSVT Best Paper Award in 1993. From 1988 to 1991, he wasthe chairman of the IEEE Circuits and Systems Society Standards Committeeand established the IEEE Inverse Discrete Cosine Transform Standard. He re-ceived an Award of Excellence from Bellcore for his work on the digital sub-scriber line in 1987.