New Fast and Area-Efficient Pipeline 3-D DCT Architectures...3-D DCT, including visual tracking,...
Transcript of New Fast and Area-Efficient Pipeline 3-D DCT Architectures...3-D DCT, including visual tracking,...
-
1
NewFastandArea-EfficientPipeline3-DDCTArchitectures
SaadAl-Azawia,OmarNiboucheb,SaidBoussaktac,GayeLightbodydaCollegeofEngineering,DiyalaUniversity,Diyala,Iraq;b,dFacultyofComputingandEngineering,UlsterUniversity,
UK;cSchoolofEngineering,NewcastleUniversity,[email protected],[email protected],[email protected],
AbstractTheefficient implementationof 3-D transforms is a challenging taskdue to the computation complexity,memory and arearequirementsofsuchtransforms.Oneimportant3-Dtransformisthe3-DDiscreteCosineTransform(3-DDCT)usedinmanyimageandvideoprocessingsystems.Inthispaper,twonewpipelinearchitecturesforthe3-DDCTcomputationusingthe3-DDCTVector-Radixalgorithm(3-DDCTVR)arepresented.Thesearchitecturesarescalableandparameterisablewithregardstodifferentwordlengthsandpipelininglevels.Theirarithmeticcomponentrequirementsarereducedtotheorderof!(#$%&')incontrastwith!(')for3-DDCTarchitecturesintheliterature,whileatthesametimetheycankeepsimilarorbetterarea-timecomplexity.Keywords:3-DDiscreteCosineTransform(DCT),Row-Column(RC),Row-Column-Frame(RCF),Vector-Radix,FPGA1.Introduction
Transforms such as the Fourier Transform (FT) [1-5],WaveletTransform(WT)[6-9],andCosineTransform(CT)[10-14]playacriticalpartinvariousDigitalSignalProcessing(DSP)applications,includingaudio,imageandvideosystems.Muchof the usefulness of these transforms arises from theirfrequencyandtime-frequencyrepresentationsandpropertiesincluding the decorrelation property, energy compactness,and theavailabilityof fastalgorithms for their computation.Nevertheless,eventhefastalgorithmsthatimplementthesetransformsarestillverycomputationallyintensive.Thus,thesetransformscanbecomeabottleneckintermsofthesystem’sspeed,andcontributegreatlytotheirareausageandpowerconsumption[1-3,6-8,10-19].Foritsroleinmanyimageandvideo applications, including the JPEG, MPEGx and H.26xcompressionstandards,theDCThasreceivedagreatdealofresearch interest [20-24]; the 1-D and2-DDCT arenow theestablished transforms formanyapplicationsand standards.Further,therearemanynewandemergingapplicationsforthe3-D DCT, including visual tracking, video coding andwatermarking[25-29].
Numerous 1-D and 2-D DCT architectures have beensuggestedintheliterature[30-39].Exploitingtheseparabilityprincipleofthetransform,2-DDCTcoresbasedonthe1-DDCTRow-Column(RC)approacharesuggestedin[33-36];yetveryfewarchitectures that implement the3-DDCTcanbe found[38-45].Traditionally,the3-DDCThasbeenimplementedbycascading stages of the 1-DDCT as in thewell-known Row-Column-Frame (RCF) approach. Noteworthy differencesbetween architectures in the literature are their level ofparallelisation in terms of the number of stages and thenumber of 1-DCT cores per stage, which leads to differenttrade-offs between circuit complexity and throughput. Onecommon architecture employs three stages of one 1-D DCTcore and N3+N2–word transpose memory [40-42].Parallelisationcanbeapplied to the first two1-DDCTcores
whichinfactimplementsa2-DDCTtransform,leadingtotheutilisation of 2N+1 1-D DCT processors and N3+N-wordmemory[41,42].Anotherclassofthe3-DDCTarchitecturesmultiplexes the 1-D DCT transforms involved in itscomputationontoasingle1-DDCTarchitecture.Suchaclassof architecture requires N3-word memory [41, 42]. Thereductionachievedinhardwareutilisationcomesatthecostofalowerthroughput.Usingthree1-DCTcorestoimplementthe 3-DDCT achieves a throughput three times higher thanwhenemployingasingle1-DDCTprocessor.ThethroughputisN-foldaugmentedviaparallelisationofthe1-DDCTprocessors[42].The1-DDCTcoresemployedinthe3-DDCTarchitecturecanusethetransform’sfastalgorithm,distributedarithmeticorROMbaseddesigns[38].Sucharchitecturesexhibitirregularstructures, lackofmodularity,andcomplexcontrol.Anotherclassof the3-DDCTarchitecturereliessolelyonthesystolicapproachwithitswellestablisheddesignmethodology[43].In[44, 46], high speed and low complexity pipeline n-D DCTarchitectures are proposed using the regular 1-D DCT andtensor product operations. The proposed architectures arebasedonthe1-Dand2-DDCTarchitecturesin[47].
In thispaper, twonewpipelineandscalablearchitecturesthat implement the 3-D Discrete Cosine Transform Vector-Radix (3-D DCT VR) are introduced. The presentedarchitecturesareparameterisableintermsofwordlengthandpipeline stages. Further, no block memory is used for datatransposition. These architectures have been implementedandtested;forinstance,anFPGA-basedimplementationofa512×512×8-worddatausingatransformlengthof8×8×8-wordcube size and a 14-bit wordlength has achieved a workingfrequencyof330MHzandaprocessingtimeof6.4ms.Thus,80000framescanbeprocessedineverysecond.
The remainder of this paper is organised as follows. Insection2,thebackgroundoftheDCTtransformand3-DDCTVRalgorithmareprovided.Sections3,4and5presentthetwonew architectures for 3-D DCT computation. The results
-
2
obtainedarediscussedinsection6andconclusionsaregiveninsection7.
2.Backgroundandthe3-DDCTVRAlgorithm
The 3-D DCT coefficients of a N×N×N data cube arecomputedasfollows:) *+, *&, *-
=801+01&01-
'- 3 4+, 4&, 4-
56+
789:
56+
7;9:
56+
7
-
3
Fortheremainingcombinationsofodd/evenindices*+,*&and*-, one can divide the computation of the transform asfollows:
The set of equations (6)-(13) represents a single butterflycomputationofaDecimationInFrequency(DIF)VRalgorithmasshowninFigure2.Itcomputeseightpoints;abutterflycanreceive aN×N×N data cube at its input and outputs 8 datacubes of N/2×N/2×N/2-word each; the process can be
repeated until58
Rdata cubes of 2× 2× 2-word each are
computed. Thus, the flow graph of the whole butterfly
computationconsistsof#$%&'stageswith58
Rbutterfliesper
stage[25].Theoutputfromthelastbutterflystageisfedtothepostadditionstagesthenitisbit-reversed.Thepostadditionoperations, shownby the termsoutsidebraces in the setofequations (6)-(13), are then carried out. Further to the
)(2*+, 2*&, 2*- + 1) = gh h h [23c::+(4+, 4&, 4-) cos ∅-]
[
789:
[
7;9:
[
7
-
4
reductioninarithmeticcomplexityandprocessingtime,the3-DDCTVRalgorithmdoesnotrequiretransposememoryandexhibitsaregularbutterflystructurewhichismoresuitableforhardware and software implementation than the RCFapproach.Furtherdetailsaboutthe3-DDCTVRalgorithmcanbefoundin[25].
3.New3-DDCTVectorRadixArchitectures
Two new architectures are presented; namely the SinglePath Data Flow 3-D DCT Architecture (SPDFA) and theDualPathDataFlow3-DDCTArchitecture(DPDFA).Thedifferencebetweenthemliesinthenumberofwordsfedtotheadjacentbutterfly and how the arithmetic operations are scheduledwithineachbutterfly stage; thishas led to thederivationoftwo structures with different hardware requirements. Botharchitecturesarebuiltaccordingtothegenericblockdiagramof Figure 1. The butterfly calculation consists of #$%&'parameterisedandscalablestagesasdescribedbythesetofequations (6)-(13) and illustrated in Figure 2. The datareordering is common to both SPDFA andDPDFA, however,theinternalarchitectureofthebutterfly,post-additionstagesand the 3-D Bit Reverse Ordering (3-D BRO) stage arearchitecture-dependent. Of the two presented structures,SPDFA exhibits a single line of data between neighbouringbutterfly stages. It is more efficient in memory usage asintermediate results are fed back to memory elements;however, using these feedback loops prevents any furtherpipelining. DPDFA is a dual-path data flow feed-forwardarchitecture. There are two data lines between adjacentbutterflies and further pipelining is a simple task; thearchitecturehoweverrequiresmorememorythanSPDFA.
The presented architectures both partition the inputsequenceintocubesofN×N×N-wordorN-blocksofN×N-word.The inputdata is reorderedaccording to (2).The reorderingprocessisperformedbyshufflingwordsalongtherow,columnand frame dimensions. It includes dividing data into odd-indexedandeven-indexedwordsandretrogradeindexing.As
anexample,forNindicesarrangedas“0,1,2,3,4,5,6,…N-1”,thereorderedsequencewillbe“0,2,4,6,…,N-2,N-1,N-3,N-5,…..,1“.ThisstageisimplementedusingadualportblockRAM which permits writing and reading operations to beperformedondifferentlocationsduringthesamecycle.Thus,for anN×N×N-word cube, thememory size required for the
reorderingoperationis5
&+ 1 '
&–wordwithalatencyof58
&
cycles as onlywriting operations are carried out during thisperiod.
4.SinglePathDataFlow3-DDCTArchitecture
SPDFA is composed of a 3-D reordering stage, #$%&(')butterflycomputationstages, threepostadditionsub-stagesanda3-DBitReverseOrder(3-DBRO)stage.
4.1ButterflyStages
Thereordereddatafromthe3-Dreorderingstageisfedtothebutterflystagesatarateofonewordperclockcycle.AsshowninFigure3,#$%&'butterflystages(m=1,2,3,..,#$%&')areused.Eachbutterflystagecanbefurtherdividedinto threesub-stagesandamultiplierasshowninFigure4.Asub-stageconsists of two add/subtract elements for carrying outadditionandsubtractionoperationsbetweenthetwohalvesofeach inputalong the threedimensionsofdata, a registerand a switch. The multiplier is used to multiply the output
Figure2.Singlebutterflyofthe3-DDCTDIFVRalgorithm
-
5
words by appropriate Twiddle Factors (TFs) which are pre-computedandstoredinalookuptable(LUT).
For the sake of explaining, thewords3 4+, 4&, 4- at theinput of the first butterfly stage can be indexed as 3 4+ +4&×' + 4-×'
& . The first sub-stage performs addition andsubtractionbetweenthetwohalvesoftheinputdatacube;the
first half contains words indexed from 0to58
&− 1 and the
secondpartthewordswithindicesfrom58
&H$'
-− 1.During
thefirst58
&clockcycles,thefirstpartofthedataisstoredin
Register1(oflength58
&-word)beforebeingfedtoaddersalong
with the inputdata fromthesecondhalfduring thenext58
&
cycles.Duringthisperiod,thesubtractionoperationresultsarestored inRegister1whiletheadditionresultsare fedtothenext sub stage.Once this is completed, it is the turn of thesubtractionresultsstored inRegister1 tobefedtothenextsub-stage.TheselectionofwhichpartofthedatatobestoredinRegister1,fedtotheaddersorfedtothenextsub-stageismanagedbythecontrolsignalofSwitch1,whichchangesits
valueevery58
&cycles.
What the first sub-stagecarriesoutoncubesofdata, thesecond sub-stage performs it on blocks ofN×N-word of thesamedatacube.Omittingchangesalong4-,eachblockisagaindividedintotwohalves;onehalfincludesindices4+ + 4&×'
from0H$5;
&− 1whilethesecondhalfincludeswordsofthe
sameblockwithindicesfrom5;
&H$'
&− 1.Thebehaviourof
the second sub-stage is similar to the first one; except that
Register2 isof length5;
&-wordandtheperiodofthecontrol
signalforSwitch2is'&cycleswithadutycycleof50%.The third sub-stage implements addition and subtraction
between the twohalvesofeachcolumn ineachblockusing
Register3(oflength5
&-word).Omittingchangesof4-and4&,
thedataineachcolumnisdividedintotwohalveswithindices
rangingfrom0H$5
&− 1andfrom
5
&H$' − 1.Thewordsof
thefirsthalfofthecolumnarestoredinRegister3,theyarethen fed to theaddersalongwith thecolumn’ssecondhalf.The results of the addition operation are multiplied by theappropriate TFs. After which, the results of the subtractionoperationwhichwerefirststoredinRegister3arefedtothemultiplier for the multiplication by the TFs. The multiplieroutputisinputtothenextbutterflystage.Aswithsub-stages1 and 2, Switch 3 multiplexes data and its control signal is
periodicandchangesitsvalueevery5
&cycles.
In thegeneralcaseof themthbutterflystage,data is split
into 2-u6- cubes ofv
&wx<
-
words; the first butterfly sub-
stage is used to perform the addition and subtractionoperationsbetweenthetwohalvesofeachinputdatacube;
the words involved are indexed from0to58
&y− 1 and from
58
&yto
58
&yx<.Duringthefirst
58
&ycycles,thedatacube’sfirsthalf
isstoredinRegister1;inthenext58
&ycycles,theadditionand
subtractionoperationstakeplace;theresultsoftheadditionoperationarefedtotheadjacentbutterflysub-stagewhiletheresultsofthesubtractionoperationsarestoredinRegister1
before being fed to the adjacent sub-stage in the next58
&y
cycles.Inasimilarwayandwithanappropriateswitching,thesecondandthirdbutterflysub-stagesimplementtheadditionand subtraction operations between the first and secondhalves of the data blocks and columns, respectively. The
lengthsofregisters,Register2andRegister3,is5;
&y-wordand
5
&y-word, respectively, which allows for storing half of each
block and columnofdata as appropriate. Themultiplicationoperation by a twiddle factor (TF) takes place once allarithmeticoperationsofsub-stage3havebeencarriedout.
ButterflyStagelog2N(m=log2N)
ButterflyStage2(m=2)
ButterflyStage1(m=1)
Reordereddata
Out
Figure3.Theblockdiagramofbutterflystages
+
-
N3/2m
Switch1
Register1
+
-
N2/2m
Switch2
Register2
+
-
N/2m
Switch3
Register3
×
Sub-stage1 Sub-stage2 Sub-stage3
in outTF
Figure4.SPDFAmthbutterflyinternalarchitecture
a.
-Register 1 -
Mux 1
PIn
PRegister 2
Register 3
-Mux 2
PRegister 4
-Mux 3
2POut
Post addition Sub-stage 1
P=N
Post addition Sub-stage 2
P=1
Post addition Sub-stage 3
P=N23-D BROIn Out
b.
Figure5.a.Aparameterisedpostadditionsub-stageforSPDFA.b.Apostadditionstageand3-DBRO
4.2PostAdditionand3-DBROStages
The third part of SPDFA is the post addition stage whichperforms the computation of the terms outside the curlybrackets in (6)-(13). Reflecting the three dimensions of theinputdata,thepostadditionstagecanbedividedintothreesub-stages; each sub-stage carries out addition operationsover a given dimension. In the first, second and third postaddition sub-stage, the addition operations are carried outwithin the same N×N-word block, the same column or the
-
6
same data cube; respectively. Hence, the length of theregisters,used inFigure5and labelledasparameterP,mayvary. Still the internal architecture of each sub-stage isidenticallythesame.
Theoutputofthethirdpostadditionstageisfedtothe3-DBRO stage which performs data reordering as the fastalgorithmused introducesabit reversalpermutationon thebinary indicesof the results.Bit reversal is performedalongeachrow,columnandframedirectionsineachN×N×N-worddata cube using a regular bit reversal algorithm [25]. Theoutputfromthisstagerepresentsthe3-DDCTcoefficientsof
the input data. It is implementedusing a3'
4− 1 '
2-word
dual-port block RAM. This stage is placed next to the postadditionstagetoactasabufferforthesubsequentsystemifrequired;forinstance,itcanbeintegratedwithaquantizerasinconventionaldatacompressionalgorithms.
5.DualPathDataFlow3-DDCTArchitecture
DPDFAisadualdatapatharchitectureforthe3-DDCTVRcomputation. It is devised toproduceahigh speed3-DDCTarchitecturewhichcanbeeasilyretimedandpipelined.DPDFAconsistsof3-Ddatareordering,butterflystages,postadditionand3-DBROstages.The3-Ddatareorderingstageisthesameasthatpresentedearlierinthepaper. 5.1ButterflyStages
The scheduling of arithmetic operations in DPFDA isdifferent from SPDFA. Rather than feeding the subtractionoperations intermediate results back to the same sub-stageregisterasinSPDFA,theresultsoftheadditionandsubtractionarefedforwardtothenextsub-stageortothenextstage.Thisreducesthetimeduringwhichregistersareutilisedforstoringpartial results, adds another lineof data for communicationbetweenadjacentstagesandsub-stages,andhenceincreasesthe number of required multipliers to cope with thecomputationoftwocoefficientsperclockcycle.However,thissimplifiespipeliningandretimingintheDPDFA.
DPDFA comprises #$%&' butterfly stages; each can bedivided into three sub-stages, registers, switches and twomultipliers.Agenericsub-stageconsistsof twoadd/subtract
elementsforcarryingoutadditionandsubtractionoperationsbetween the two halves of each input along the threedimensionsofdata.Italsocontainstworegistersandaswitchfordataorderingandmultiplexing; theexception is the firstsub-stageofthefirstbutterflywhichutilisesonlyoneregisterasshowninFigure6.Thefirstbutterfly internalarchitecturetakesintoaccountthefactthatdataisreceivedatitsinputattherateofonewordperclockcyclewhicharethenstoredandprocessedattherateoftwowordsperclockcycle.
The first sub-stage performs addition and subtractionbetween the two halves of the input data cube; Register 1stores the first half that contains words of indices from
0to58
&− 1thenfeedsittotheaddercomponentsduringthe
next58
&cycleswhentheseconddatacubepartthatcontains
wordsindexedfrom58
&to'
-− 1isalsoavailableattheinput
of the adder components. Data multiplexing is carried outusingSwitch1which isusedto twofoldparallelise theserialinput.Itscontrolsignalisperiodicwithaperiodof'-cyclesanda50%dutycycle.
Insub-stage2,theregistersRegister2andRegister3,andSwitch2re-orderdatawiththeaimtoimplementtheadditionand subtraction operations in each block of data; a block is
dividedintotwohalves;wordswithindicesfrom0to5;
&− 1
arestoredinRegister3whilethesecondhalfwhichincludes
words of the same block with indices from5;
&H$'
&− 1 is
storedinRegister2.Theflowofdatabetweensub-stages1and2andtheselectionofwhereandwhenresultsarestored inregistersRegister2andRegister3 iscarriedoutbySwitch2.Suchaswitchhasa50%dutycyclecontrolsignalwithaperiodof'&cycles.
When sub-stage 2 processes blocks of data of the samecube,inasimilarwaythethirdstagecarriesouttheadditionandsubtractionoperationsoncolumnsofdatabelonging tothesameblockofdata.For'/2cycles,theadditionresultsofsub-stage 2 are fed toRegister 5; the subtraction operationresultsstoredinRegister4arefedtotheaddercomponentsofsub-stage 3; during the next '/2cycles, Register 4 isconnected to Register 5 while the results of the addition
Figure6.a.ThefirstbutterflyofDPDFA,b.ThemthbutterflyofDPDFA
-
7
operationofsub-stage2arefedtotheaddercomponentsofsub-stage3.BothregistersRegister4andRegister5areofalengthof'/2-word.Switch3whichallowsfordataswitchingand controls the flow of partial results in sub-stage 3 has aperiodic control signal which changes its value every '/2cycles.Oncealladditionandsubtractionoperationshavebeencarriedoutbythefirstbutterflythreesub-stages,twofurthertaskshavetobecarriedout,namely;themultiplicationbytheappropriateTFsandre-arrangingdatainanordersuitableforthenextbutterflystageoperationstobeexecuted.
Re-arrangingdatainSPDFAbutterfliesissimplycarriedoutbyfeedbackregisters.However,tore-arrangedata inDPSFAone has to cancel out the data order engendered by theselection and switching behaviour of Switch 2, Switch 3,Register 2, Register 3, Register 4and Register 5. Thedesignapproach adopted in thiswork is to use the same set-upofregistersandswitchestore-arrangetheorderofdataandthentoretimeformemoryoptimization.Hence,thebehaviourofSwitch4andSwitch5issimilartothatofSwitch2andSwitch3, respectively.The impactofusingretiming isshown inthelengthofRegister7ofthefirstbutterflyofFigure6.aand inthe length ofRegister 1 in the first sub-stage of the secondbutterflystageillustratedinFigure6.b.Hencetheorderofdatawhen it leavesthefirstbutterfly issimilarto itsorderattheadderelementsofthefirstsub-stage.
Inthegeneralcase,thetwodatainputspresentedatthemthbutterfly stage are the two halves of the '--word cube;howevereachhalfdataisorderedas2{6&setsof2{6+×2{6+
interleaved data cubes of sizev
&wx<
-
words. The control
signalsofallswitchesinFigure6.bareperiodicwitha50%dutycycle.ThecontrolsignalperiodofSwitch1,Switch2andSwitch
3is58
&yx<cycles,
5;
&yx<cyclesand
5
&yx<cycles,respectively.By
carefully controlling the flow of partial results into registersRegister 1,Register 2,Register 3,Register 4,Register 5 andRegister 6 in Figure 6.b, all the addition and subtractionsoperationscanbecarriedoutalongthethreedimensionsofthe data. Switches Switch 4 and Switch 5, share the controlsignalsofSwitch3andSwitch2,respectively.TheirswitchingbehaviourandtheuseofregistersRegister7andRegister8,re-arrangedatatothesameorderitwasreceivedattheinputof the adder elements of sub-stage 1; the multiplicationoperationscanthentakeplace.
5.2PostadditionStageand3-DBROStages
The post addition stage can be divided into three sub-stages.Tocopewithprocessingtwowordsperclockcycle,thefirst two sub-stages are built using two of the sub-stagesshowninFigure5.a;thethirdsub-stageisdepictedinFigure7.aandiscomposedoffiveadd/subtractelements,registersand a multiplexer. The post addition sub-stages areparameterised.TheparameterPshowninFigure7,referstothelengthofregistersused.
AswithSPDFA,the3-DBROstageisrequiredtore-ordertheoutput;itadjustsforthebitreversalpermutationengenderedby the fast transform algorithm used. The 3-D BRO is
implemented using5
&− 1 '
&-word dual port block RAM.
Thismemoryelementcanbemergedwithsystemswherethepresented3-DDCTcoreisused.
Mux 2
1
Register 3
-
Mux 3
1Register 5
-
P
Register 7
-Mux 4
1
P
Register 6
Register 1
-Mux 1
1
P
Register 2 +
3P-1
1
Out
Register 4
Register 8
Register 9
In0
In1
a.
Post addition Sub-stage 1
P=N
Post addition Sub-stage 2
P=1
Post addition Sub-stage 3
P=N23-D BRO Out
b.
In0
In1
Figure7:a.Thirdpostadditionsub-stageforDPDFA.b.A
postadditionstageand3-DBROforDPDFA.
6.PerformanceEvaluation
ThepresentedarchitectureshavebeendesignedusingXilinxSystem generator tool and they have been tested andimplementedonaXilinxVirtex55vlx50tff1136-3FPGAdevice.Variousvideosequencesandwordlengthshavebeenusedtotest and evaluate the presented architectures performanceandattributes.
6.1TestBenchandRateDistortionPerformanceDCTArepresentsthe3-DDCTofeachframecomputedusing
the presented architectures; as they implement the samealgorithm,bothstructuresexhibitvirtuallythesameaccuracy,asshowninTable1andTable2.Whenemploying|}~,theannotation(x,y)referstoafixed-pointwordlengthofx+y-bitswherexandyrepresentthenumberofbitsoftheintegerandfractionalparts,respectively.ResultsofDCTAimplementationusing(12,8),(12,4)and(12,2)-bitwordlengthsareshowninthis section. In comparison, DCTM represents coefficientscalculated usingMatlab code that implements the 3-DDCT,andÄ|}~[representsaMatlabimplementationoftheinverse3-DDCT.BothÄ|}~[and|}~[Matlabimplementationsarefloating-point based. For testing and validation purposes|}~[andDCTAhavebeenappliedonvariousMRIandvideosequences of 512×512×8-word [50]; the Ä|}~[ is thenappliedtotheoutputof|}~toyieldreconstructedframes.Thepeak signal-to-noise ratio (PSNR) and rootmean squareerror (RMSE) are used for evaluating the accuracy of thepresented architectures output. The RMSE between theoriginalandreconstructedframesisdefinedas:
-
8
Ç]ÉÑ *
=1
Ö×ÜÄ|}~[ |}~ L, X, * − Ä(L, X, *)
&
á
D9+
à
Y9+
(14)
WhereÄ(L, X, *) istheoriginalframeand* intherange 1 ≤* ≤ äistheframeindex.FisthenumberofframesinÄ,PandQ are the number of its rows and columns, respectively. Inaddition, the PSNR between the original and reconstructedframesiscomputedasfollows[51]:
ÖÉ'Ç * = 10#$%[ãå çé
è[êë 1
&
(15)
where]í3 Ä1 represents themaximum intensity value ofthekthframe.Further,theaverageofmaximumabsoluteerror(AvgMaxErr) of the coefficients for the presented 3-D DCTarchitectures |}~in comparison to the Matlabimplementation|}~[iscomputedas:
]í3ÑFF * = ]í3 íìM |}~[ L, X, * − |}~ L, X, *
(16)
AvgMaxErr =+
ö]í3ÑFF *
ö
19+ (17)
where]í3ÑFF * represents themaximum absolute errorforeachframe * .
Performance accuracy, for both presented architectures,wasstudiedoveraselectionofimplementationwordlengths.Table1andTable2showthatthePSNR increaseswhenthefractional part increases for both SPDFA and DPDFA,respectively.Assuchandasexpected,thehighestaccuracyisobtained using a 20-bit wordlength (namely, (12, 8)-bit),providing perfect accuracy. The presented architecturesproduce very good image quality using all the selectedwordlengths.TheaveragePSNRoftheeighttestsequencesforSPDFAare∞,57and45dBusing(12,8),(12,4)and(12,2)-bitwordlengths, respectively. DPDFA achieves very comparableresults.Further,inTable1andTable2,theAvgMaxErrofbotharchitecturesarealmostidentical.Fortheaimofusingvisualinspectionasasubjectivefidelitycriterion[52],theimagesoftheoriginalandreconstructedMRI2scansareshowninFigure8. For both architectures, the reconstructed images arecomputedusingwordlengths (12,2), (12,4)and (12,8)bits. Itcanbenoticedthatthe(12,2)-bitwordlengthproducedagoodquality imagewhere the visual error can hardly be noticed;longerwordlengthshoweverleadtoamuchbetterquality.
6.2AreaUsageandComputationTime
Thehardwareusage,speedandcomputationtimeofbotharchitecturesusingdifferentwordlengthsareshowninTable3.Itisimportantthatthepresentedarchitecturesareefficientin terms of area usage; in particular, in resource-limiteddevices such as FPGAs. As it can be seen from Table 3, theaveragedeviceresourcesusageofSPDFAandDPDFAisaslowas12%and18%,respectively.ThehardwareusageofDPDFA
ishigherthanthatoftheSPDFAduetoduplicatecircuitryformultiplication, addition and post addition stages. However,thisextrahardwareusageandthefactthatithasnofeedbackloops improve themaximumoperating frequency of DPDFAoverSPDFA.ItiseasiertoplaceandroutethecomponentsofDPDFA, including theFPGAdevicespecific resourcessuchasthe DSP elements for the implementation of arithmeticoperations.Assuch,thecomputationtimeof512×512×8-wordinDPDFAisshorterthanthatofSPDFA.Itisworthpointingoutthatthememoryrequirementsofbotharchitecturesarelowin comparison with other architectures due to the in-placecomputation and the lowmemory requirement for theBROand3-Dreorderingoperations.Thememoryelementsof5N2and3N
2-wordhavebeenusedfor3-BROforSPDFAandDPDFArespectively. Further, a memory of 5N2-word for 3-Dreordering operation has been used in both architectures.Thus, the total number of block memory used in eacharchitecture is less than 25% of the available memoryresourcesofthe5vlx50tff1136-3FPGAdevice.
Table1.AccuracyanddistortionperformanceofSPDFA
Reconstructedandoriginalframes 3-DDCTCoefficients
PSNR(dB) RMSE AvgMaxErr
Video (12,8)(12,4)(12,2) (12,8) (12,4)(12,2) (12,8) (12,4) (12,2)
MRI1 ∞ 59 48 ≈0 0.28 1.05 0.01 0.18 0.77
MRI2 ∞ 56 45 ≈0 0.40 1.49 0.01 0.23 0.88
Akiyo ∞ 58 47 ≈0 0.31 1.16 0.01 0.21 0.89
Stefan ∞ 56 45 ≈0 0.40 1.48 0.01 0.23 0.97
Suzie ∞ 56 45 ≈0 0.40 1.46 0.01 0.21 0.90
Bus ∞ 56 45 ≈0 0.40 1.49 0.02 0.23 0.92
Flower ∞ 56 45 ≈0 0.39 1.45 0.02 0.23 0.97
Mobile ∞ 56 45 ≈0 0.40 1.49 0.01 0.21 0.92
Average ∞ 57 45 ≈0 0.37 1.38 0.01 0.22 0.90
Table2.AccuracyanddistortionperformanceofDPDFA
Reconstructedandoriginalframes 3-DDCTCoefficients
PSNR(dB) RMSE AvgMaxErr
Video (12,8)(12,4)(12,2) (12,8) (12,4)(12,2) (12,8) (12,4) (12,2)
MRI1 ∞ 60 48 ≈0 0.25 1.05 0.01 0.18 0.77
MRI2 ∞ 57 45 ≈0 0.36 1.49 0.01 0.23 0.86
Akiyo ∞ 59 47 ≈0 0.28 1.16 0.01 0.21 0.89
Stefan ∞ 57 45 ≈0 0.35 1.48 0.02 0.23 0.97
Suzie ∞ 57 45 ≈0 0.35 1.46 0.01 0.21 0.91
Bus ∞ 57 45 ≈0 0.35 1.49 0.02 0.23 0.93
Flower ∞ 57 45 ≈0 0.35 1.45 0.02 0.23 0.97
Mobile ∞ 57 45 ≈0 0.35 1.40 0.01 0.21 0.92
Average ∞ 58 45 ≈0 0.33 1.37 0.01 0.22 0.90
-
9
Figure8.TheoriginalandreconstructedMRI2usingbothArchitecturesforvariouswordlengthsizes
6.3DynamicPowerConsumptionThepowerconsumptioninFPGAisclassifiedintostaticand
dynamicpower.Thestaticpowermainlycomesfromleakagecurrent,whereaschargingswitchcapacitorsandshortcircuitcurrentsarethemainsourcesofdynamicpower;henceitcanbe minimised by switching capacitance reduction [53]. Thedynamicpowerconsumptionofthepresentedarchitecturesisshown in Figure 9. The power consumption has beencomputed using Xilinx Xpower analyser for various clock
frequencies and different wordlengths. The dynamic powerconsumption ishigher inDPDFAbyaround25-100mWthanSPDFA for selected operating frequencies and wordlengths.The reason behind that is the additional multipliers in thebutterflystagesandtheduplicationofsomeresourcesinthefirst twopostadditionstages.Thus,SPDFA isoutperformingDPDFAintermsofpowerconsumptionwhichmakesitabetterchoiceforlowpowerconsumptionapplications.
Figure9:Dynamicpowerconsumptionofbotharchitectures.
6.4ComparisontoSimilarWork
The throughput of both architectures is 1 coefficient perclockcycle; thus,'--clockcyclesareneeded tocomputeallthe3-DDCTcoefficientsofa'--worddatacube.Acomparisonbetween the presented and similar architectures in theliterature isshowninTable4.OfSPDFAandDPDFA,Table4shows that the first architecture outperforms the second intermsofareausageasitrequiresfewermultipliers,addersandregisters. The extra hardware DPDFA utilises is needed toperform the dual line computation of the 3-D DCT.Nevertheless, DPDFA is easily pipelined and it has a lowerlatency andmemory requirement than SPDFA. Thememoryrequirements forSPDFAandDPDFA,as listed inTable4,areusedfordatareorderingandBROonly.
Thenumberofmultipliersandaddersemployed,memoryrequirements,controllercircuitscomplexity,andcomputationtimeofthepresentedarchitectures,arealsocomparedtotherequirementsandperformanceofthearchitecturesin[38-43].As shown in Table 4. It can be seen that the presentedarchitectures require the lowest number of multipliers andmemory usageof all architectures.Only #$%&' and2#$%&'multipliersarerequiredtoperformthe3-DDCTcomputationusing SPDFA and DPDFA, respectively; for instance, Nmultipliersarerequiredin[41,42].Inaddition,exceptforthearchitecturesin[43],thepresentedarchitecturescarryoutthe3-D DCT computation with the lowest latency. Table 4 alsoshows the performance of various architectures in terms ofcomputation time; although the presented architecturesexhibita longercomputationtimethanthework in[39,43],this is largely balanced by the presented architectures lowhardwareusage. This improvement over similarwork in the
20
70
120
170
220
270
320
370
420
200 100 66.67 50 40 33.33D
ynam
ic p
ower
(mW
)Clock Frequency (MHz)
SPDFA (12,8)
DPDFA (12,8)
SPDFA (12,2)
DPDFA (12,2)
Reconstructed Image (One Image)
Reconstructed Image (One Image)
Reconstructed Image (One Image)
Reconstructed; SPDFA; (x,y) = (12,2)
Reconstructed; SPDFA; (x,y) = (12,4)
Reconstructed; SPDFA; (x,y) = (12,8)
Original Image (One Image)
Original MRI2
Reconstructed; DPDFA; (x,y) = (12,2)
Reconstructed; DPDFA; (x,y) = (12,4)
Reconstructed; DPDFA; (x,y) = (12,8)
Reconstructed Image (One Image)
Reconstructed Image (One Image)
Reconstructed Image (One Image)
-
10
literatureismainlyduetothefactthatunlikethearchitecturesin[38-43],thefocusisonemployingandregularisingthedataflowofafastalgorithmwhiletraditionalDCTarchitecturesarebasedonthedirectalgorithm[25,38-43];thishoweverisnotthe only benefit of using a VR approach, in fact the controlcircuitsattachedtothepresentedarchitecturesaresimpleasthere is no data transpose. This makes the controllercomplexitycomparabletothatofparalleldirectapproachesinin[38,42]. 7.Conclusions
This paper has presented two new 3-DDCT architecturesbasedona3-DDCTVRalgorithm.Theuseofafastalgorithmhasyieldedarchitectureswithimprovedprocessingspeedanda reduced hardware usage as they both require the lowestnumberofarithmeticcomponentsandmemoryrequirementamongknownarchitecturesintheliterature;atthesametime,sucharchitectures avoid theneed formemory transpositionandhenceareeasytoimplementandemployasimplecontrolcircuitry.Thepresentedarchitecturesareparameterisable interms of word and transform lengths and exhibit variouspowerconsumption,hardwareusage,processingspeedsandlevels of pipelining, which provides the designer withmoreflexibility and a larger choice when selecting the rightarchitecturefortheapplicationunderconsideration.
References[1] M. Ayinala and K. K. Parhi, "Parallel pipelined FFT architectures withreducednumberofdelays,"inProceedingsoftheACMGreatLakesSymposiumonVLSI(GLSVLSI),2012,pp.63-66.[2]O.Nibouche,S.Boussakta,M.Darnell,andM.Benaissa,"Algorithmsandpipeline architectures for 2-D FFT and FFT-like transforms," Digital SignalProcessing:AReviewJournal,vol.20,pp.1072-1086,2010.[3]M.Ayinala,M.Brown,andK.K.Parhi,"PipelinedparallelFFTarchitecturesviafoldingtransformation,"IEEETransactionsonVeryLargeScaleIntegration(VLSI)Systems,vol.20,pp.1068-1081,2012.[4]S. Saponara and B. Neri, "Radar Sensor Signal Acquisition andMultidimensional FFT Processing for Surveillance Applications in TransportSystems,"IEEETransactionsonInstrumentationandMeasurement,vol.66,pp.604-615,2017.[5]S.SaponaraandB.Neri,"Designofcompactandlow-powerX-bandRadarfor mobility surveillance applications," Computers & Electrical Engineering,vol.56,pp.46-63,2016.[6]A.Das,A.Hazra,andS.Banerjee,"Anefficientarchitecturefor3-Ddiscretewavelet transform," IEEE Transactions on Circuits and Systems for VideoTechnology,vol.20,pp.286-296,2010.[7]B. K.Mohanty and P. K.Meher, "Memory-efficient architecture for 3-DDWT using overlapped grouping of frames," IEEE Transactions on SignalProcessing,vol.59,pp.5605-5616,2011.[8]B. K. Mohanty and P. K. Meher, "Memory efficient modular VLSIarchitectureforhighthroughputandlow-latencyimplementationofmultilevellifting 2-DDWT," IEEE Transactions on Signal Processing,vol. 59, pp. 2072-2084,2011.[9]S. Al-Azawi, "Low-Power, Low-Area Multi-level 2-D Discrete WaveletTransformArchitecture,"Circuits,Systems,andSignalProcessing,vol.37,pp.444-458,2018.[10] R.E.Atani,M.Baboli,S.Mirzakuchaki,S.E.Atani,andB.Zamanlooy,"Design and implementation of a 118 MHz 2D DCT processor," in IEEEInternationalSymposiumonIndustrialElectronics,2008,pp.1076-1081.[11] M.JridiandA.Alfalou,"Alow-power,high-speedDCTarchitectureforimage compression: Principle and implementation," in 18th IEEE/IFIP VLSISystemonChipConference(VLSI-SoC),2010,pp.304-309.
[12] M.ElAakif,S.Belkouch,N.Chabini,andM.M.Hassani,"Lowpowerandfast DCT architecture usingmultiplier-lessmethod," in2011 Faible TensionFaibleConsommation(FTFC),2011,pp.63-66.[13] B.Z.Guo,L.Niu,andZ.M.Liu,"Implementationof2-DDCTbasedonFPGA," in Proceedings of SPIE - The International Society for OpticalEngineering,2010.[14] G. K. a. S. V. Khurram Bukhari, "DCT and IDCT implementations ondifferentFPGAtechnologies,"ComputerEngineeringLab,DelftUniversityofTechnology,2009[15] O.Nibouche,S.Boussakta,andM.Darnell,"PipelineArchitecturesforRadix-2NewMersenneNumberTransform,"IEEETransactionsonCircuitsandSystemsI:RegularPapers,vol.56,pp.1668-1680,2009.[16] H.L.P.A.Madanayake,R. J.Cintra,D.Onen,V.S.Dimitrov,andL.T.Bruton, "Algebraic integerbased8×82-DDCT architecture for digital videoprocessing,"inIEEEInternationalSymposiumonCircuitsandSystems(ISCAS),2011,pp.1247-1250.[17] A.M.Shams,A.Chidanandan,W.Pan,andM.A.Bayoumi, "NEDA:Alow-powerhigh-performanceDCTarchitecture," IEEETransactionsonSignalProcessing,vol.54,pp.955-964,2006.[18] E.D.KusumaandT.S.Widodo,"FPGAimplementationofpipelined2D-DCT and quantization architecture for JPEG image compression," in 2010InternationalSymposiuminInformationTechnology(ITSim),2010,pp.1-6.[19] S.Al-Azawi,Y.A.Abbas,andR.Jidin,"LowcomplexitymultidimensionalCDF5/3DWTarchitecture," inCommunicationSystems,Networks&DigitalSignalProcessing(CSNDSP),20149th InternationalSymposiumon,2014,pp.804-808.[20] G. K. Wallace, "The JPEG still picture compression standard," IEEETransactionsonConsumerElectronics,vol.38,pp.xviii-xxxiv,1992.[21] D. J. Le Gall, "The MPEG video compression standard," in CompconSpring'91:DigestofPapers,1991,pp.334-335.[22] W. Li, "Overview of fine granularity scalability in MPEG-4 videostandard,"IEEETransactionsonCircuitsandSystemsforVideoTechnology,vol.11,pp.301-317,2001.[23] A.MadisettiandA.N.Willson,Jr.,"DCT/IDCTprocessordesignforHDTVapplications,"inInternationalSymposiumonSignals,Systems,andElectronics(ISSSE'95),1995,pp.63-66.[24] T.Wiegand,G.J.Sullivan,G.Bjøntegaard,andA.Luthra,"Overviewofthe H.264/AVC video coding standard," IEEE Transactions on Circuits andSystemsforVideoTechnology,vol.13,pp.560-576,2003.[25] S.BoussaktaandH.O.Alshibami,"Fastalgorithmforthe3-DDCT-II,"IEEETransactionsonSignalProcessing,vol.52,pp.992-1001,2004.[26] X.Li,A.Dick,C.Shen,A.vandenHengel,andH.Wang,"Incrementallearningof3D-DCTcompactrepresentationsforrobustvisualtracking,"IEEETransactionsonPatternAnalysisandMachine Intelligence,vol.35,pp.863-881,2013.[27] S.SawantandD.A.Adjeroh,"Balancedmultipledescriptioncodingfor3DDCTvideo,"IEEETransactionsonBroadcasting,vol.57,pp.765-776,2011.[28] H.Y.Huang,C.H.Yang,andW.H.Hsu,"Avideowatermarkingtechniquebased on pseudo-3-D DCT and quantization index modulation," IEEETransactionsonInformationForensicsandSecurity,vol.5,pp.625-637,2010.[29] R. Atta and M. Ghanbari, "Spatio-temporal scalability-based motion-compensated3-Dsubband/DCTvideocoding," IEEETransactionsonCircuitsandSystemsforVideoTechnology,vol.16,pp.43-55,2006.[30] S. C. Chan and K. L. Ho, "Direct methods for computing discretesinusoidaltransforms," IEEProceedingsonRadarandSignalProcessing,vol.137,pp.433-442,1990.[31] S. An and C. Wang, "Recursive algorithm, architectures and FPGAimplementationofthetwo-dimensionaldiscretecosinetransform,"IET,ImageProcessingvol.2,pp.286-294,2008.[32] G. Jiun-In and L. Chih-Chen, "A generalized architecture for the one-dimensionaldiscretecosineandsinetransforms,"IEEETransactionsonCircuitsandSystemsforVideoTechnology,vol.11,pp.874-881,2001.[33] Z.Wu, J. Sha, Z.Wang, L. Li, andM. Gao, "An improved scaled DCTarchitecture,"IEEETransactionsonConsumerElectronics,vol.55,pp.685-689,2009.[34] C. Yuan-Ho and C. Tsin-Yuan, "A high performance video transformenginebyusing space-time scheduling strategy," IEEETransactionsonVeryLargeScaleIntegration(VLSI)Systemsvol.20,pp.655-664,2012.
-
11
[35] S. Al-Azawi, S. Boussakta, and A. Yakovlev, "High precision and lowpower DCT architectures for image compression applications," in IETConferenceonImageProcessing(IPR),2012,pp.1-6.[36] A.AggounandI.Jalloh,"Two-dimensionalDCT/IDCTarchitecture,"IEEProceedingsonComputersandDigitalTechniques,vol.150,pp.2-10,2003.[37] J. I. Guo, "Efficient parallel adder based design for one-dimensionaldiscretecosinetransform,"IEEProceedingsonCircuits,DevicesandSystems,vol.147,pp.276-282,2000.[38] I. Jalloh, A. Aggoun, and M. McCormick, "3D DCT architecture forcompressionof integral3D images," in IEEEWorkshoponSignalProcessingSystems,SiPS:DesignandImplementation,2000,pp.238-244.[39] A. Aggoun and I. Jalloh, "A parallel 3D DCT architecture for thecompressionofintegral3Dimages,"inThe8thIEEEInternationalConferenceonElectronics,CircuitsandSystems(ICECS)2001,pp.229-232vol.1.[40] M. Bakr and A. E. Salama, "Implementation of 3D-DCT based videoEncoder/Decoder system," inMidwest Symposiumon Circuits and Systems,2002,pp.II13-II16.[41] S.Saponara,L.Fanucci,andP.Terreni,"Low-powerVLSIarchitecturesfor3Ddiscretecosinetransform(DCT),"inIEEE46thMidwestSymposiumonCircuitsandSystems,2003,pp.1567-1570Vol.3.[42] S.Saponara,"Real-timeandlow-powerprocessingof3Ddirect/inversediscrete cosine transform for low-complexity video codec," JournalofReal-TimeImageProcessing,pp.1-11,2012.[43] Y. Ikegaki,T.Miyazaki,andS.G.Sedukhin,"3D-DCTprocessorand itsFPGA implementation," IEICETransactionson InformationandSystems,vol.E94-D,pp.1409-1418,2011.[44] L. Yuanyuan, C. Hexin, Z. Yan, and Y. Chuxi, "Three dimensional DCTsimilar butterfly algorithm and its pipeline architectures," in 2016 IEEEInformation Technology, Networking, Electronic and Automation ControlConference,2016,pp.506-510.[45] S. Al-Azawi, "Efficient Architectures for Multidimensional DiscreteTransforms in Image and Video Processing Applications," PhD Thesis,NewcastleUniversity,UK,2013.[46] L. Yuanyuan, C. Hexin, Z. Yan, and Y. Chuxi, "Device-saving pipelinearchitecturesofmulti-dimensionalDCTsimilarbutterflyalgorithm," in2016International Conference on Integrated Circuits and Microsystems (ICICM),2016,pp.339-344.[47] J. A. Nikara, J. H. Takala, and J. T. Astola, "Discrete cosine and sinetransforms—regularalgorithmsandpipelinearchitectures,"SignalProcessing,vol.86,pp.230-249,2006.[48] O.AlshibamiandS.Boussakta,"Fastalgorithmforthe3DDCT,"inIEEEInternational Conference on Acoustics, Speech, and Signal ProcessingProceedings(ICASSP'01),2001,pp.1945-1948vol.3.[49] M.C. Lee,R. K.W.Chan, andD.A.Adjeroh, "Fast three-dimensionaldiscretecosinetransform,"SIAMJournalonScientificComputing,vol.30,pp.3087-3107,2008.[50] Xiph.org, "Video Test Media [derf's collection]," Xiph.org, [Online].Available:https://media.xiph.org/video/derf/Accessedon20/09/2016.[51] Q.Huynh-Thu andM.Ghanbari, "The accuracy of PSNR in predictingvideoqualityfordifferentvideoscenesandframerates,"TelecommunicationSystems,vol.49,pp.35-48,2012/01/012012.[52] H. R. Sheikh, A. C. Bovik, and G. d. Veciana, "An information fidelitycriterion for image quality assessment using natural scene statistics," IEEETransactionsonImageProcessing,vol.14,pp.2117-2128,2005.[53] S.McKeownandR.Woods,"Lowpowerfieldprogrammablegatearrayimplementationoffastdigitalsignalprocessingalgorithms:Characterisationandmanipulationofdatalocality,"IETComputersandDigitalTechniques,vol.5,pp.136-144,2011.
SaadAl-AzawireceivedtheB.Sc.DegreeinElectrical Engineering from University ofBaghdadandM.Sc.degreeinElectronicandCommunication Engineering from Al-Mustansiriya University, Baghdad, Iraq. Hereceived his Ph.D. degree in Electrical andElectronic Engineering from Newcastle
University, Newcastle Upon Tyne, England, 2013. AssistantProfessor Dr. Saad is currently working as a head of the
departmentofElectronicEngineering,CollegeofEngineering,UniversityofDiyala,Diyala,Iraq.HisresearchinterestsincludeHardware architectures for signal and image processingalgorithms and transforms, Digital Image Processing, DigitalSignalProcessing.
Omar Nibouche received his BEng degree inElectronic Engineering form the PolytechnicSchool of Algiers and Ph.D. degree inComputer Science from Queen's UniversityBelfast. He is a lecturer in computing in theSchoolofComputingandMathematics,UlsterUniversity at Jordanstown. His research
interests include machine learning, applications of artificialintelligence,computervisionandbiometrics.
Said Boussakta received the PhD degree inElectrical Engineering from NewcastleUniversity, U.K., in 1990. Since 1990, he hasbeen working in academia, fully involved inboth research and teaching. From2000-2006hewasattheUniversityofLeedsasaReaderin Digital Communications and Signal
Processing. Since 2006, he has been with the School ofEngineering, Newcastle University as a Professor ofCommunications and Signal Processing, lecturing inCommunicationNetworksandSignalProcessingsubjects.Hehas supervised over 50 students to Ph.D. completion andpublished over 200 conference proceedings and journalarticles. His research interests are in the areas of fast DSPalgorithms,DigitalCommunications,CommunicationNetworkSystems, Cryptography, and Digital Signal/Image Processing.He has also served as Chair in conferences and presentedseveralinvitedtalksincommunications,signalprocessingandsecurity. Prof Boussakta is a Fellowof the IEE, and a SeniorMember of the Communications and Signal ProcessingSocieties.
Gaye Lightbody hasbeena Lecturerwithin the School of Computing inUlster University since 2006. ShereceivedanM.Eng.(1995)andaPhD(2000) in Electrical and ElectronicEngineering fromQueen’sUniversityofBelfast.HerPhD,HighPerformanceVLSIArchitecturesforRecursiveLeast
Squares Adaptive Filtering, involved the research anddevelopment into a scalable efficient architecture for highlyintensive adaptive beamforming applications. Gaye thenworked in industry from 2000 to 2006 for AmphionSemiconductorLimiteddevelopingintellectualpropertycoresforASICandFPGAsolutionsintheareasofaudio,imageandvideoprocessing.Sheisaco-authorofthebookFPGA-basedImplementation of Signal Processing Systems published byWiley in 2008. A second edition has been released in 2017which provides an update to reflect the latest iterations ofFPGA theory, applications, and technology. This revisionincludescoverageofFPGAsolutionsforBigDataApplications.
-
12
TABLE3.Hardwareutilisationrate,maximumoperatingfrequenciesandcomputationtimesforbotharchitecturesusingvariouswordlengths
SliceLogicUtilization Available
SPDFA DPDFA
HardwareusageforwordlengthsizesHardwareusageforwordlengthsizes
(12,8) (12,6) (12,4) (12,2) (12,8) (12,6) (12,4) (12,2)
Hardwareusage
NoofSliceRegisters 28,800 1840 1726 1593 1459 3691 3414 3120 2826
NoofSliceLUTs 28,800 2351 2171 1991 1811 3309 3044 2781 2517
NoofoccupiedSlices 7,200 743 607 588 605 1229 1096 1053 1008
NoofbondedIOBs 480 30 28 26 24 30 28 26 24
Noof36kBlockRAMused 60
1 - - - 1 - - -
Noof18kBlockRAMused 14 15 15 15 15 16 16 16NoofDSP48Es 48 9 8 8 8 16 12 12 12
Averageutilizationrate 12% 12% 11% 11% 18% 16% 15% 15%
Maximumfrequencies(MHz) 241 230 244 226 266 338 258 333
Computationtimesfor512×512×8-pixeldata(ms) 8.7 9.1 8.6 9.3 7.9 6.2 8.1 6.3
-
13
Table4.ComparisontoSimilarArchitecturesintheLiterature
Architectures Adders/Sub. Multipliers Memory Registers InitialLatencyComputationTime(cycles)
ControllerComplexity DCTAlgorithm
[38] 3" 3" "# " + 1 N/R* "& + 3" "& Simple Regular;Row-Column-Frame,cascaded
[39]
5"# + "2
5"#2
"&(transposememory)
"registerbetweenthe2-DDCTandthe1-D-DCT-
framedirection
"& + 32" "# Complex
Regular,Parallel;Row-Column-FrameN×N1-DDCT+1-DDCTforframedirection
[40] 3" − 3 3" "# " + 1 N/R > "# 6"&** Medium Regular1-DDCT;Row-Column-Frame[42]
FullParallel(FP) " 2" + 1 " 2" + 1 ∗∗∗ " "# + 1 N/R N/R 2"# Medium
1-DDCTRadix2Row-Column-FrameCascaded(CS) 3" 3"
∗∗∗ "# " + 1 N/R N/R 2"& Simple
HardwareMultiplexed(HM) " "
∗∗∗ "& N/R N/R 6"& Complex
[43] InputMem.
Sequential "& "& 2"& "& 3" + 4 "& 3" Complex
Regular1DDCT;Row-Column-Frame
Pipelined1 ≈ 2"& 2"& 2"& "& 3" + 4 "& 3" Complex
Piplined2 ≈ 3"& 3"& 2"& "& 3" + 8 "& 3" Complex
Block N&
8 N&8 2N
& ≈ 16N& 3N + 4 N& 3N Complex
SPDFA 12 + 6234#" 234#""4 + " "
#
ForreorderingandBRO
"& + 5"# + 5" + 4 ≅ 2"& "& Simple Vector-Radix3-DDCT
DPDFA 20 + 6234#" 2234#""&
ForreorderingandBRO
3"&2 + 6"
# + 11" + 8 ≅ 32"& "& Simple Vector-Radix3-DDCT
*N/R:Notreportedintheirpaper.**Computedfor4×4×4datablock.***Multiplicationisperformedbyaserialdistributedarithmeticarchitecture.