Research Article Multiple Memory Structure Bit Reversal...
Transcript of Research Article Multiple Memory Structure Bit Reversal...
Research ArticleMultiple Memory Structure Bit Reversal Algorithm Based onRecursive Patterns of Bit Reversal Permutation
K K L B Adikaram123 M A Hussein1 M Effenberger2 and T Becker1
1 Group of Bio-Process Analysis Technology Technische Universitat Munchen Weihenstephaner Steig 20 85354 Freising Germany2 Institut fur Landtechnik und Tierhaltung Vottinger Straszlige 36 85354 Freising Germany3 Computer Unit Faculty of Agriculture University of Ruhuna Mapalana 81100 Kamburupitiya Sri Lanka
Correspondence should be addressed to K K L B Adikaram lasanthadaad-alumnide
Received 6 April 2014 Revised 4 June 2014 Accepted 5 June 2014 Published 17 July 2014
Academic Editor Ker-Wei Yu
Copyright copy 2014 K K L B Adikaram et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited
With the increasing demand for onlineinline data processing efficient Fourier analysis becomes more and more relevant Due tothe fact that the bit reversal process requires considerable processing time of the Fast Fourier Transform (FFT) algorithm it is vitalto optimize the bit reversal algorithm (BRA)This paper is to introduce an efficient BRAwith multiple memory structures In 2009Elster showed the relation between the first and the second halves of the bit reversal permutation (BRP) and stated that it maycause serious impact on cache performance of the computer if implemented We found exceptions especially when the said indexmapping was implemented with multiple one-dimensional memory structures instead of multidimensional or one-dimensionalmemory structure Also we found a new index mapping even after the recursive splitting of BRP into equal sized slots The four-array and the four-vector versions of BRAwith new indexmapping reported 34 and 16 improvement in performance in relationto similar versions of Linear BRA of Elster which uses single one-dimensional memory structure
1 Introduction
The efficiency of a bit reversal algorithm (BRA) plays acritical role in the Fast Fourier Transform (FFT) processbecause it contributes 10 to 50 of total FFT process time[1] Therefore it is vital to optimize the BRA to achieve anefficient FFT algorithm In 2009 Elster showed the relationbetween the first and the second halves of the BRP [2]but did not implement it Elster stated that implementationof this relation may cause serious impact on cache perfor-mance of modern computers As Elster stated use of a two-dimensional memory structure to implement this relationreduced the efficiency of the bit reversal permutation (BRP)In contrast the efficiency of the BRA increased when a one-dimensional memory structure was used for index mappingWhen two equal sided one-dimensional memory structureswere used the performance was even much better than withsingle one-dimensional memory structure Also it was foundout that bit reversal permutation can be further split intoequal size blocks recursively up tomaximumof log
2(n) times
where 119899 is the number of samples and 119899 = 2119896 119896 isin 119885+ These
two findings motivate us to introduce a BRAwhich is capableof using 2119896 (119896 = 2 3 log 2(119899)) equal-sized (1198992119896) one-dimensional memory structures
In 1965 Cooley and Tukey introduced the FFT algorithmwhich is an efficient algorithm to compute the DiscreteFourier Transformation (DFT) and its inverse [3] FFT isa fast algorithm that has replaced the process of DFTwhich had been used frequently in the fields of signal andimage processing [4ndash7] The structure of the FFT algorithmpublished by Cooley and Tukey known as radix-2 algorithm[3] is the most popular one [5] There are several otheralgorithm structures as radix-4 radix-8 radix-16 mixed-radix and split-radix [8]
To apply FFT to a certain signal there are basically twomajor requirements The first requirement is 119899 = 119887
119873 where119899 is the number of samples of the signal 119873 isin 119885
+ and 119887 isthe selected radix structure for example 119887 = 2 4 8 and16 for radix-2 radix-4 radix-8 and radix-16 respectivelyThe second requirement is that the input (or output) samples
Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2014 Article ID 827509 12 pageshttpdxdoiorg1011552014827509
2 Mathematical Problems in Engineering
must be arranged according to a certain order to obtain thecorrect output [3 5 8 9]The BRA is used to create the orderof input or output permutation according to the requiredorder The BRA used in most FFT algorithms including theoriginal Cooley-Tukey algorithm [3] is known as bit reversalmethod (BRM)TheBRM is an operation for exchanging twoelements 119909(119896) and 119909(119896) of an array of length 119898 as shown in(1) and (2) respectively where 119886
119895are either 0 or 1 and 119887 is the
relevant base 2 4 8 or 16 depending on the selected radixstructure
119896 =
119898minus1
sum
119895=0
119886119895119887119895 (1)
119896 =
119898minus1
sum
119895=0
119886119895119887119898minus1minus119895
(2)
All the later algorithms for creating BRP were named as BRA(bit reversal algorithm) though they used other techniqueslike patterns of BRP instead of bit reversing techniques
During the last decades many publications addressednew BRAs [10] by improving the already existing originalBRA (BRM) or using totally different approaches In 1996Karp compared the performance of 30 different algorithms[10] against uniprocessor systems (computer system with asingle central processing unit) with different memory sys-tems Karp found that the performance of a BRA dependedon the memory architecture of the machine and the wayof accessing the memory Karp stated two hardware factsthat influence the BRA namely the memory architectureand the cache size of the machine According to Karp amachine with hierarchical memory is slower than a machinewith vector memory (computers with vector processor) andalgorithms do not perform well when array size is largerthan the cache size Also Karp pointed out four featuresof an algorithm that influence the BRA namely memoryaccess technique data reading sequence size of the indexof memory and type of arithmetic operations Accordingto Karp an algorithm that uses efficient memory accesstechniques is the fastest among algorithms with exactly thesame number of arithmetic operations Algorithms are fasterif (i) they require only a single pass over the data (ii) they useshort indexes and (iii) they operate with addition instead ofmultiplication
Karp especially mentioned that the algorithm publishedby Elster [11] in 1989 was different from other algorithmsbecause it used a pattern of BRP rather than improving deci-mal to binary and binary to decimal conversion According tothe findings of Karp Elsterrsquos ldquoLinear Bit Reversal Algorithmrdquo(LBRA) performs much better in most of the cases Thepublication of Elster (1989) [11] consists of two algorithmsto achieve BRP One algorithm used a pattern of BRP andthe other one used bit shifting operations Both algorithmsare interesting because they eliminate the conventional bitreversingmechanism which needmore computing timeThealgorithm by Rubio et al (BRA-Ru) of 2002 [12] is anotherapproach that uses an existing pattern of BRP However thepattern described in Rubiorsquos algorithm is different from the
pattern described in Elsterrsquos [11] In 2009 Elster and Meyerpublished an improved version of ldquoLinear Register-Level BitReversalrdquo which was published in 1989 [11] as ldquoElsterrsquos BitReversalrdquo (EBR) algorithm Elster mentioned it is possible togenerate the second half of the BRP by incrementing the rel-evant element of the first half by one Also Elster mentionedthere can be a serious impact on cache performance of thecomputer if the said pattern (Figure 1) is used
Programming languages provide different data structures[13] which handle memory in different ways In additionthe performance of the memory depends on machine archi-tecture and operating system [14] Therefore the efficiencyof the memory is the resultant output of the performancesof hardware operating system programming language andselected data structure
Based on the physical arrangement of memory elementsthere are two common ways of allocating memory for aseries of variables ldquoslot of continuousmemory elementsrdquo andldquocollection of noncontinuous memory elementsrdquo commonlyknown as ldquostackrdquo and ldquoheaprdquo [15] In most programminglanguages the term array is used to refer to a ldquoslot ofcontinuous memory elementsrdquo Arrays are the simplest andmost common type of data structure [16 17] and due tocontinuous physical arrangement of memory elements pro-vide faster access than ldquocollection of noncontinuous memoryelementsrdquo memory types However with the development ofprogramming languages different types of data structureswere introduced with very similar names to the standardnames like array stack and heap The names of new datastructures sometimes did not agree with the commonlyaccepted meaning ldquoStackrdquo ldquoArrayrdquo and ldquoArrayListrdquo providedby Microsoft Visual C++ (VC++) [18 19] are good examplesAccording to the commonly accepted meaning they shouldbe a ldquoslot of continuousmemory elementsrdquo but they are in facta ldquocollection of noncontinuousmemory elementsrdquoThereforeit is not a good practice to determine the performance of acertain data structure just by looking at its name To overcomethis ambiguous situation we use ldquoslot of continuous memoryelementsrdquo to refer to ldquoprimitive arrayrdquo (or array) typememorystructures
Due to the very flexible nature vector is the mostcommon one among the different types of data structures[14] Vector was introduced with C++ which is one of themost common and powerful programming languages whichhas been used since 1984 [20] However as most of otherdata structures the term ldquovectorrdquo is used to refer to memoryin computers with processor architecture called ldquovectorprocessorrdquo In this paper the term vector is used to refer tothe vector data structure that is used in the C++ interface
Indexmapping is a technique that can be used to improvethe efficiency of an algorithm by reducing the arithmetic loadof the algorithm [8] If 119899 [0119873 minus 1] and 119873 is not prime 119873can be defined as 119873 = prod
119899119894minus1
119894=0119873119894 where 119899
119894 [0119873
119894minus 1] lt
119873 This allows the usage of small ranges of 119899119894instead of large
range of 119899 andmaps a function119891(119899) into amultidimensionalfunction 1198911015840(119899
1 1198992 119899
119894)
There are two common methods of implementing indexmapping one-dimensional or multidimensional memory
Mathematical Problems in Engineering 3
Normal order
Reverse order Pattern
Dec Bin Bin Dec
0 0000 0000 0
1 0001 1000 8
2 0010 0100 4
3 0011 1100 12
4 0100 0010 2
5 0101 1010 10
6 0110 0110 6
7 0111 1110 14
8 1000 0001 1 0 + 1
9 1001 1000 9 8 + 110 1010 0101 5 4 + 1
11 1011 1101 13 12 + 1
12 1100 0011 3 2 + 1
13 1101 1011 11 10 + 1
14 1110 0111 7 6 + 1
15 1111 1111 15 14 + 1
Bloc
k1
ndash8el
emen
tsBl
ock2
ndash8el
emen
tsFigure 1 Relation between first and second halves of the BRP for 119899 = 16
structures In addition it is also possible to implement theindex mapping using several equal size one-dimensionalmemory structures However this option is not popular asit is uncomfortable for programming The performance ofmodern computers is highly dependent on the effectivenessof the cache memory of the CPU [4] To achieve the best per-formance of a series of memory elements the best practice istomaintain sequential access [4] Otherwise the effectivenessof the cache memory of the central processing unit (CPU)will be reduced Index mapping with multidimensional datastructures violates sequential access due to switching betweencolumnsrows and thus reduces the effectiveness of thecache memory Therefore it is generally accepted that theuse of a multidimensional data structure reduces computerperformance [4]
In this paper an efficient BRA is introduced to createthe BRP based on multiple memory structures and recursivepatterns of BRPThe findings of this paper show that the com-bination of multiple one-dimensional memory structuresindex mapping and the recessive pattern of BRP can be usedto improve the efficiency of BRA These findings are veryimportant to the field of signal processing as well as any fieldthat is involved in index mapping techniques
2 Material and Methods
21 New Algorithm (BRA-Split) Elster stated that it is pos-sible to generate the second half of the BRP by increment-ing items in the first half by one [2] (Figure 1) withoutchanging the order and the total number of calculations of
the algorithm Due to the recursive pattern of BRP it canbe further divided into equal size blocks by splitting eachblock recursively (Maximum log
2N times) After splitting 119904
times BRP is divided into 2119904 equal blocks each containing1198992119904 elements The relation between the elements in blocks is
given as follows
119861 (2119898+ 119895) [119894] = 119861 (119895) [119894] + 2
119878minus119898minus1 (3)
where 119861(2119898 + 119895)[119894] is 119894th element of block 119861(2119898 + 119895) and119898 =
0 1 119904 minus 1 119895 = 1 2119898Table 1 shows the relationship between elements in blocks
according to the index mapping shown in (3) after splittingBRP one time and two times for 119899 = 16 Depending on therequirement the number of splitting can be increased
22 Evaluation Process of New Algorithm To evaluate algo-rithms we used Windows 7 and Visual C++ 2012 on a PCwith multicore CPU (4 cores 8 logical processors) and 12GBmemory Detailed specifications of the PC and the softwareare given in Table 2 To eliminate limits of memory andaddress space related to the selected platform the compileroption ldquoLARGEADDRESSAWARErdquo was set [21] and plat-form was set to ldquox64rdquo All other options of the operatingsystem and the compiler were kept unchanged
The new algorithm was implemented using single one-dimensionalmemory structure and themost commonmulti-dimensional memory structure Furthermore the new BRAwas implemented using several equal size one-dimensionalmemory structures (multiple memory structure)
4 Mathematical Problems in Engineering
Table1Re
lationbetweenelem
entsof
16-elementB
RP(119899=16)for
split
=1and
split
=2
Normal
order
Reverse
order
Split
(119878)=1
Split
(119878=2)
Blocks
(12119904)
Index
ofthe
block119894
Reverseo
rder
calculation
Calculationmetho
d119898=0119904minus1(119898=0)
119895=12119898
119898=0119895=1
119894=11198992119904minus1
119894=17
Blocks
(12119904)
Indexof
the
block119894
Reverseo
rder
calculation
Calculationmetho
d119898=0119904minus1(119898=01)
119895=12119898
119898=0119895=1119898=1119895=12
119894=11198992119904minus1
119894=123
00
1
00
Initialized
00
Initialized
18
18=8+0
Use
Elste
rrsquosLinear
BitR
eversalm
etho
d
11
8=8+0
Use
Elste
rrsquosLinear
BitR
eversalm
etho
d2
42
4=4+0
24=4+0
312
312
=12
+0
312
=12
+0
42
42=2+0
02
Initialized
510
510
=10
+0
21
10=8+2
119861(2119898+119895)[119894]=119861(119895)[119894]+2119904minus119898minus1
66
66=6+0
26=4+2
For119898
=0119895=1119894=123
119861(20+1)[119894]=119861(1)[119894]+22minus0minus1
714
714
=14
+0
314
=12
+2
119861(2)[119894]=119861(1)[119894]+2
81
2
01
Initialized
01
Initialized
99
19=8+1
119861(2119898+119895)[119894]=119861(119895)[119894]+2119904minus119898minus1
For119898
=0119895=1119894=17
119861(20+1)[119894]=119861(1)[119894]+22minus0minus1
119861(2)[119894]=119861(1)[119894]+1
31
9=8+1
119861(2119898+119895)[119894]=119861(119895)[119894]+2119904minus119898minus1
105
25=4+1
25=4+1
For119898
=1119895=1119894=123
119861(21+1)[119894]=119861(1)[119894]+22minus1minus1
1113
313
=12
+1
313
=12
+1
119861(3)[119894]=119861(1)[119894]+1
123
43=2+1
03
Initialized
1311
511=10
+1
41
11=10
+1
119861(2119898+119895)[119894]=119861(119895)[119894]+2119904minus119898minus1
147
67=6+1
27=6+1
For119898
=1119895=2119894=123
119861(21+2)[119894]=119861(2)[119894]+22minus1minus1
1515
715
=14
+1
315
=14
+1
119861(4)[119894]=119861(2)[119894]+2
Mathematical Problems in Engineering 5
Table 2 Hardware and software specifications of the PC
Specifications
Processor Intel Core i7 CPU 870 293GHz (4 cores8 threads)
RAM 12GB DDR3-1333 2 channelsMemory bandwidth 21GBsL1 L2 and L3 cache 4 times 64KB 4 times 256KB and 8MB sharedL1 L2 and L3 cacheline size 64 bit
Brand and type Fujitsu CelsiusBIOS settings Default (hyper threading enabled)OS and service pack Windows 7 professional with service pack 1System type 64 bit operating systemOS settings DefaultVisual Studio 2012 Version 110507271 RTMRELNET Framework Version 4550709
Thenext taskwas to identify a suitable data structure fromdifferent types of available data structures We consideredseveral common techniques as summarized in Table 3 Datastructure 1 mentioned in Table 3 is not supporting dynamicmemory allocation (need to mention the size of the arraywhen array is being declared) For general bit reversalalgorithm it is a must to have dynamic memory allocation tocater different sizes of samples Even after setting the compileroption ldquoLARGEADDRESSAWARErdquo [21] data structures 3and 4 mentioned in Table 3 were not supported for accessingmemory greater than 2GB Therefore structures 1 3 and 4were rejected and memory structures 2 (array) and 5 (vector)were used to create all one-dimensional memory structuresThe same versions of array and vector were used to createmultidimensional memory structures
The new algorithm mentioned in Section 21 was imple-mented usingC++ in 24 types ofmemory structures as shownin Table 4 The performance of these algorithms was evalu-ated considering the ldquoclocks per elementrdquo (CPE) consumedbyeach algorithm To retrieve this value first average CPE foreach sample size of 2119899 where 119899 [21 31] (11 sample sizes) werecalculated after executing each algorithm 100 timesThis gave11 CPE representing each sample size Finally the combinedaverageof CPEwas calculated for each algorithmby averagingthose 11 values along with ldquocombined standard deviationrdquoThe combined average of CPE was considered as the CPEfor each algorithm The built-in ldquoclockrdquo function of C++ wasused to calculate the clocks Combined standard deviationwas calculated using the following
120590119888= (
sum119896
119904=1(1198991199041205902
119904+ 119899119904(119883119904minus 119883119888)
2
)
sum119896
119904=1119899119904
)
12
(4)
where119883119888= sum119896
119904=1119899119904119883119904sum119896
119904=1119899119904 119904 is the number of samples 119899
119904
is number of samples in each sample and 120590119904is the standard
deviation of each sample
Algorithms 1 2 and 3 illustrate the implementation ofnew BRA with single one-dimensional memory structuremultidimensional memory structure and multiple mem-ory structures respectively The algorithm illustrated inAlgorithm 1 (BRA Split 1 1A) was implemented using prim-itive array for split = 1 The algorithm BRM Split 2 4A(Algorithm 2) was implemented using vectors for split =2 The algorithm BRM Split 2 4A (Algorithm 3) was imple-mented using primitive array for split = 2 A sample per-mutation filling sequence of algorithms with single one-dimensional memory structures is illustrated in Figure 2Figure 3 illustrates a sample permutation filling sequence ofboth multidimensional and multiple memory structures
Secondly arithmetic operations per element (OPPE)werecalculated for each algorithm Arithmetic operations withineach algorithm were located in three regions of the codeinner FOR loop outer FOR loop and outside of the loopsThen the total number of operations (OP) can be defined as
OP = 1198701lowast 1198621+ 1198702lowast 1198622+ 1198623 (5)
where 1198621 1198622 and 119862
3are the number of operations in inner
FOR loop outer FOR loop and outside of the loops 1198701and
1198702are the number of iterations of outer loop and inner loop
Equation (5) can be represented as
OP =log2(NS)minus119904minus1
sum
119905=0
2119905lowast 1198621+
log2(NS)minus119904minus1
sum
119905=0
1198622+ 1198623 (6)
where NS is the number of samples and 119904 is the number ofsplits
The main contribution to calculations comes from theinner loop Comparing with the contribution of operationsin the inner loop the contribution of operations in rest ofthe code is very small For example consider the algorithmBRA Split 1 1A shown in Algorithm 1 As sample size is 2311198701lowast 1198621asymp 215 lowast 10
9lowast 1198621and 119870
2lowast 1198622+ 1198623asymp 1000 where
1198621gt 1 Therefore only the operations of inner loop were
considered for evaluation Then (6) can be simplified as
OP =log2(NS)minus119904minus1
sum
119905=0
2119905lowast 1198621 (7)
The ldquooperations per elementrdquo (OPPE) can be defined as
OPPE =(sum
log2(NS)minus119904minus1
119905=02119905lowast 1198621)
NS
(8)
For FFT always NS = 2119896 119896 isin 119885+Then from (7)
OPPE =(sum119896minus119904minus1
119905=02119905lowast 1198621)
2119896
(9)
119896
sum
119905=0
2119905+ 1 = 2
119896+1 (10)
6 Mathematical Problems in Engineering
Table 3 Common memory allocating methods that are used in Visual C++
Number Name Syntax Nature of memory layout1 Array int BRP[1000] Slot of continuous memory elements2 Array intlowast BRP = new int[119899] Slot of continuous memory elements3 Array array⟨int⟩andBRP = gcnew array⟨int⟩(119899) Collection of noncontinuous memory elements4 ArrayList ArrayList andBRP = gcnew ArrayList() Collection of noncontinuous memory elements5 Vector stdvector⟨int⟩ BRP(119899) Collection of noncontinuous memory elements
Table 4 Different versions of new BRA implemented with different data structures
Split (s) Data structure type AlgorithmSingle memory structure Multidimensional memory structure Multiple memory structure
1 Array (A) BRA Split 1 1A BRA Split 1 2DA BRA Split 1 2A1 Vector (V) BRA Split 1 1V BRA Split 1 2DV BRA Split 1 2V2 Array (A) BRA Split 2 1A BRA Split 2 4DA BRA Split 2 4A2 Vector (V) BRA Split 2 1V BRA Split 2 4DV BRA Split 2 4V3 Array (A) BRA Split 3 1A BRA Split 3 8DA BRA Split 3 8A3 Vector (V) BRA Split 3 1V BRA Split 3 8DV BRA Split 3 8V4 Array (A) BRA Split 4 1A BRA Split 4 16DA BRA Split 4 16A4 Vector (V) BRA Split 4 1V BRA Split 4 16DV BRA Split 4 16VNaming convention for algorithms ldquoBRA Splitrdquo + ltNumber of splitsgt + ltnature of memory structuregt xA xV x number of arrays and x number of vectorsxDA xDV single x-dimensional array and single x-dimensional vector
void mf BRM Split 1 1A (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DL
ui N = ui NS Number of samplesui t = ui log2NS minus 1ui EB = ui N2ui L = 1unsigned intlowast BRP = new unsigned int[ui N]BRP[0] = 0BRP[ui EB] = 1for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP[j] = BRP[j minus ui L] + ui NBRP[ui EB + j] = BRP[j] + 1
ui L = ui L + ui L
delete[] BRP
Algorithm 1 C++ implementation of new BRA with single array for split = 1 (BRA Split 1 1A)
Mathematical Problems in Engineering 7
Void mf BRM Split 2 4DV (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DLui N = ui NS Number of samplesui t = ui log2NS minus 2ui EB = ui N4ui L = 1stdvectorltstdvectorltunsigned intgtgt BRP(4
stdvectorltunsigned intgt(ui EB))BRP[0][0] = 0BRP[1][0] = 2BRP[2][0] = 1BRP[3][0] = 3for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP[0][j] = BRP[0][j minus ui L] + ui NBRP[1][j] = BRP[0][j] + 2BRP[2][j] = BRP[0][j] + 1BRP[3][j] = BRP[1][j] + 1
ui L = ui L + ui L
Algorithm 2 C++ implementation of new BRA with four arrays for split = 2 (BRM Split 2 4A)
0 8 4 12 2 10 6 14 1 9 5 13 3 11 7 15
Initialization
First iteration Seventh (last) iteration
Figure 2 Permutation filling sequence of new BRA with single memory structure for 119899 = 16 and split =1 (BRA Split 1 1A)
For large 119896 2119896 + 1 asymp 2119896 Then
119896
sum
119905=0
2119905= 2119896+1
(11)
Because the considered sample size is 221 to 231 the value 119896can be considered as large Then from (9)
OPPE =1198621lowast (2119896minus119904)
2119896
(12)
OPPE = 1198621
2119904 (13)
According to (13) OPPE is 119891(1198621 119904) The value 119862
1(oper-
ations in inner loop) and the value 119904 (number of splits) are aconstant for a certain algorithm
To evaluate the performance of new BRA we selectedthree algorithms (LBRA EBR and BRA-Ru) which used apattern instead of conventional bit reversing method Theperformance of vector and array versions of the best version
Table 5 The number of operations in the inner loop (1198621) of each
algorithm for different splits (119904) (The total number of operations ineach algorithm asymp 119862
1)
Memory structure type Value of 1198621
119904 = 1 119904 = 2 119904 = 3 119904 = 4
Single one-dimension (array and vector) 8 15 30 61Multidimension (array and vector) 7 11 19 35Multiple one-dimension (array and vector) 7 11 19 35
of new BRA was compared with the relevant versions ofselected algorithms
3 Results and Discussion
Our objective was to introduce BRA using recursive patternof the BRP that we identified We used multiple memorystructures which is a feasible yet unpopular technique toimplement index mapping According to Table 5 the num-bers of operations in all the array and vector versions of both
8 Mathematical Problems in Engineering
Void mf BRM Split 2 4A (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DL
ui N = ui NS Number of samplesui t = ui log2NS minus 2ui EB = ui N4ui L = 1unsigned intlowast BRP1 = new unsigned int[ui EB]unsigned intlowast BRP2 = new unsigned int[ui EB]unsigned intlowast BRP3 = new unsigned int[ui EB]unsigned intlowast BRP4 = new unsigned int[ui EB]
BRP1[0] = 0BRP2[0] = 2BRP3[0] = 1BRP4[0] = 3for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP1[j] = BRP1[j minus ui L]+ ui NBRP2[j] = BRP1[j] + 2BRP3[j] = BRP1[j] + 1BRP4[j] = BRP2[j] + 1
ui L = ui L + ui L
delete[] BRO1delete[] BRO2delete[] BRO3delete[] BRO4
Algorithm 3 C++ implementation of new BRA with four one-dimensional arrays for split = 2 (BRM Split 2 4A)
0 8 4 12
2 10 6 14
1 9 5 13
3 11 7 15
Initi
aliz
atio
n
Third
(las
t) ite
ratio
n
First iteration
Figure 3 Permutation filling sequence of 4 individual and single 4-dimensional memory structure for 119899 = 16
multidimensional and multiple memory structures are thesame Also Figure 5 shows continuous decrement of OPPEwhen the number of splits increases Then the algorithmwith the highest number of splits and the lowest number ofoperations is the one which is expected to be most efficient
However results in relation with CPE (Figure 5) show thatthe new algorithm with four memory structures of array isthe fastest and most consistent in the selected range Twofour eight and sixteen multiple array implementations ofnew BRA reported 25 34 33 and 18 higher efficiency
Mathematical Problems in Engineering 9
0
1
2
3
4
5
6
Refalgorithms
Ope
ratio
ns p
er el
emen
t (O
PPE)
Split = 1 Split = 2 Split = 3 Split = 4
Corresponds to both array and vector versions of algorithm of rudioCorresponds to both array and vector versions of ldquolinear bit reversalrdquo algorithmCorresponds to both array and vector versions of ldquoelsterrsquos bit reversalrdquo algorithmCorresponds to both array and vector versions of the new algorithm in
Corresponds to both array and vector versions of the new algorithm in
Corresponds to both array and vector versions of the new algorithm in
signal one-dimensional memory structure
signal multi-dimensional memory structure
signal multiple one-dimensional memory structure
Figure 4 Operations per element versus reference algorithms and new algorithm with different s (splits) where dashed column correspondsto both array and vector versions of algorithm of Rudio cross lines column corresponds to both array and vector versions of ldquoLinear BitReversalrdquo algorithm dotted column corresponds to both array and vector versions of ldquoElsterrsquos Bit Reversalrdquo algorithm vertical lines columncorresponds to both array and vector versions of the new algorithm in single one-dimensional memory structure inclined lines columncorresponds to both array and vector versions of the new algorithm in single multidimensional memory structure and horizontal linescolumn corresponds to both array and vector versions of the new algorithm in multiple one-dimensional memory structures
respectively in relation to the array version of LBRA Thealgorithm with eight memory structures has nearly the sameCPE as the four-array and four-vector versions but is lessconsistent On the other hand the four-vector implementa-tion of the new algorithm is the fastest and most consistentamong all vector versions Two four eight and sixteenmultiple vector implementations of new BRA reported 1316 and 16 higher and 23 lower efficiency respectively inrelation to the vector version of LBRAThis result proves thatat a certain point multiple memory structure gives the bestperformances in the considered domain Also usage of mul-tiple memory structures of primitive array is a good optionfor implementing index mapping of BRP compared to multi-dimensional or single one-dimensional memory structures
Due to the flexible nature of the vector it is commonlyused for implementing algorithms According to Figure 4there is no difference in OPPE between array and vectorversionsHowever our results in Figure 5 show that the vectorversions of BRA always required more CPE (44ndash142)than the array version The structure of vector gives priorityto generality and flexibility rather than to execution speedmemory economy cache efficiency and code size [22]There-fore vector is not a good option with respect to efficiency
The results in Table 5 and Figure 4 show that there is nodifference between the number of calculations and OPPEfor equal versions of algorithms with multidimensional andmultiple memory structure Structure and type of calcula-tions are the same for both types The only difference is thenature of the memory structure multidimension or multipleone-dimension When CPE is considered it shows 19ndash79
performance efficiency from algorithms with multiple one-dimension memory structures The reason for that observa-tion is that the memory access of multidimensional memorystructure is less efficient than one-dimensional memorystructure [22]
We agree with the statement of Elster about indexmapping of BRP [2] and the generally accepted fact (theusage of multidimensional memory structures reduces theperformance of a computer) [4] only with respect to multi-dimensional memory structures of vector Our results showthat even with multidimensional arrays there are situationswhere the new BRA performs better than the same typeof one-dimensional memory structure The four eight andsixteen dimensional array versions of new BRA perform 810 and 2 in relation to one-dimensional array version ofnew BRA Some results in relation to single one-dimensionalmemory structure implementation of new BRA are also notin agreement with the general accepted idea For examplesample size = 231 the two-dimensional vector version of newBRA (BRA Split 1 2DV) reported 542119864 minus 05 CPE which is389 higher in relation to average CPE of sample size rangeof 221 to 230 Also the inconsistency was very highThereforewe excluded values related to sample size = 231 for the two-dimensional vector version
We observed very high memory utilization with the two-dimensional vector version especially with sample size =231 Windows task manager showed that the memory uti-lization of all the considered algorithms was nearly thesame for all sample sizes except for multidimension versionsof vector The multidimensional version of vector initially
10 Mathematical Problems in Engineering
Cloc
ks p
er el
emen
t (CP
E)
000E + 00
500E minus 06
100E minus 05
150E minus 05
200E minus 05
250E minus 05
300E minus 05
BRM
split
01x
BRM
split
12x
BRM
split
24x
BRM
split
38x
BRM
split
416x
lowastBR
Msp
lit12
Dx
BRM
split
24
Dx
BRM
split
38
Dx
BRM
split
416
Dx
BRM
split
11x
BRM
split
21x
BRM
split
31x
BRM
split
41x
x-A or V (array or vector)Algorithms
Refe
renc
e alg
orith
m o
ne-
dim
ensio
nal s
ingl
e mem
ory
struc
ture
of s
izen
Multiple memorystructures
1
1
2
k
k
k
k = 2s
n sample size
Multidimensionalmemory structure
One-dimensionalmemory structure
nknk
nk
middotmiddotmiddot
middotmiddotmiddot
(s = 1 2 3 4)
Corresponds to array version of the algorithmCorresponds to vector version of the algorithmCorresponds to minimum reported CPE of the category
Figure 5 Clocks per element (combined average) versus algorithm for the sample size range from 221 to 231 where blue dotted columncorresponds to array version of the algorithm green column corresponds to vector version of the algorithm and red rhombus corresponds tominimum reported CPE of the category lowastFor vector version sample size 231 was excluded because at sample size 231 it showed huge deviationdue to memory limitation of the machine
utilizes highermemory and drops down to normal valueTheresults in relation to sample size = 2
30 showed that the extramemory requirement of two dimension vector was higherthan that of the four-dimensional vector Based upon that itcan be predicted that BRA Split 1 2DV needs an extra 3GB(total 13 GB) for normal execution at sample size = 2
31 butthe total memory barrier of 12GB of the machine slows theprocess down The most likely reason for this observation isthe influence ofmemory poolingmechanismWhen defininga memory structure it is possible to allocate the requiredamount of memory This is known as memory pooling [23]In all the memory structures used in algorithms discussedin this paper we used memory pooling Memory poolingallocates the required memory at startup and divides thisblock into small chunks If there is no memory poolingmemory will get fragmented Accessing fragmented memoryis inefficient When the existing memory is not enough forallocating then it switches to use fragmented memory forallocating memory elements In the considered situation theexisting memory (12GB) is not sufficient for allocating therequired amount (13GB) which switches to use fragmentedmemory
The total cache size of the machine is 825MB whichis less than the minimum memory utilization of consideredalgorithms (16MB to 8GB) in relation to the sample sizerange from 222 to 231 Only the sample size 221 occupies 8MBmemory which is less than the total cache memory ExceptBRA Split 4 1V structure all algorithms reported constantCPE in relation to the entire sample size range The bestalgorithms of each category especially reported very steady
behaviour This observation is in disagreement with thestatement of Karp ldquothat a machine with hierarchical memorydoes not performwell when array size is larger than the cachesizerdquo [10]
Comparison (Figure 6) of best reported version in theconsidered domain (four memory structure version) andthe selected algorithms shows that the array version ofEBR performs the best The four-array version of new BRAreported 1 lower performance than the array version ofEBR However the four-array version of new BRA reported34 and 23 higher performances than array versions ofLBRA and BRA-Ru Also the four-vector version of newBRA is reported to have the best performance among all thevector versions It reported 16 10 and 22 performancescompared to vector versions of LBRA EBR and BRA-Rurespectively
4 Conclusion and Outlook
Themain finding of this paper is the recursive pattern of BPRand the implementationmethod of it using multiple memorystructures With multiple memory structures especiallythe newly identified index mapping performs much betterthan multidimensional or single one-dimensional memorystructure Furthermore findings of this paper show that theperformance of primitive array is higher than vector typeThe result is in disagreement with the statement of Karp ldquothata machine with hierarchical memory does not perform wellwhen array size is larger than the cache sizerdquo Almost all thesample sizes we used were higher than the total cache size
Mathematical Problems in Engineering 11
120E ndash 05
100E minus 05
800E minus 06
600E minus 06
400E minus 06
200E ndash 06
000E + 00
Cloc
ks p
er el
emen
t (CP
E)
BRM
split
24x
(the b
est v
ersio
nof
the n
ewal
gorit
hm)
Line
ar b
it-re
vers
al1x
Elste
rrsquos b
it-
Rubi
o1x
reve
rsal
1x
x-A or V (array or vector)Algorithms
Corresponds to array version of the algorithmCorresponds to vector version of the algorithm
Figure 6 Clocks per element versus best version of new algorithm and selected algorithms where blue dotted column corresponds to arrayversion and green column corresponds to vector version of the algorithm
1
1
1
1
1
0
1
0
1
0
1
0
Process 1
Process 2
Process 5
Process 6Process 3
Process 4
Process 7
4 parallel processes
2 parallel processes
Single process
Tota
l sig
nal i
s div
ided
into
equa
l fou
r ind
epen
dent
mem
ory
struc
ture
s usin
g ne
w al
gorit
hm (B
RMsp
lit24x
)
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
minus1
w04
w14
w24
w34
w04
w14
w24
w34
A(0)
A(1)
A(2)
A(3)
B(0)
B(1)
B(2)
B(3)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
minus1
minus1
minus1
Figure 7 Four-memory-structure version of algorithm and the parallel processes for sample size = 16
of the computer However multiple memory structure andthe multidimensional memory structure versions showedreasonable steady performancewith those samples In generalthese results show the effects of data structures and memoryallocation techniques and open a new window of creatingefficient algorithms with multiple memory structures inmany other fields where index mapping is involved
The new bit reversal algorithm with 2119904 independent
memory structures splits the total signal into 119904 independentportions and the total FFT process into 119904 + 1 levels Thenthese 119904 signal portions can be processed independently bymeans of 119904 independent processes on the first level Onthe next level the results from the previous level stored inindependent memory structures can be processed with 1199042
12 Mathematical Problems in Engineering
processes and so on until the last level Therefore we suggestusing the concept of multiple memory structures in total FFTprocess along with the new algorithm with multiple memorystructures and suitable parallel processing technique Weexpect that it is possible to achieve higher performance fromFFT process with proper combination of parallel processingtechnique and new algorithm compared to using the newalgorithm only to create bit reversal permutation Figure 7shows such approach with four (when 119904 = 2) independentmemory structures for sample size = 16
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This publication is financially supported by the UniversityGrants Commission Sri Lanka
References
[1] C S Burrus ldquoUnscrambling for fast DFT algorithmsrdquo IEEETransactions on Acoustics Speech and Signal Processing vol 36no 7 pp 1086ndash1087 1988
[2] A C Elster and J C Meyer ldquoA super-efficient adaptable bit-reversal algorithm for multithreaded architecturesrdquo in Pro-ceedings of the IEEE International Symposium on Parallel ampDistributed Processing (IPDPS 09) vol 1ndash5 pp 1ndash8 Rome ItalyMay 2009
[3] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965
[4] M L Massar R Bhagavatula M Fickus and J KovacevicldquoLocal histograms and image occlusion modelsrdquo Applied andComputational Harmonic Analysis vol 34 no 3 pp 469ndash4872013
[5] R G Lyons Understanding Digital Signal Processing PrenticeHall PTR 2004
[6] C Deyun L Zhiqiang G Ming W Lili and Y XiaoyangldquoA superresolution image reconstruction algorithm based onlandweber in electrical capacitance tomographyrdquoMathematicalProblems in Engineering vol 2013 Article ID 128172 8 pages2013
[7] Q Yang and D An ldquoEMD and wavelet transform based faultdiagnosis for wind turbine gear boxrdquo Advances in MechanicalEngineering vol 2013 Article ID 212836 9 pages 2013
[8] C S Burrus Fast Fourier Transforms Rice University HoustonTex USA 2008
[9] C van Loan Computational Frameworks for the Fast FourierTransform vol 10 of Frontiers in Applied Mathematics Societyfor Industrial and Applied Mathematics (SIAM) PhiladelphiaPa USA 1992
[10] A H Karp ldquoBit reversal on uniprocessorsrdquo SIAM Review vol38 no 1 pp 1ndash26 1996
[11] A C Elster ldquoFast bit-reversal algorithmsrdquo in Proceedings ofthe International Conference on Acoustics Speech and SignalProcessing pp 1099ndash1102 IEEE Press Glasgow UK May 1989
[12] M Rubio P Gomez and K Drouiche ldquoA new superfast bitreversal algorithmrdquo International Journal of Adaptive Controland Signal Processing vol 16 no 10 pp 703ndash707 2002
[13] B Stroustrup Programming Principles and Practice Using C++Addison-Wesley 2009
[14] B Stroustrup ldquoSoftware development for infrastructurerdquo IEEEComputer Society vol 45 no 1 Article ID 6081841 pp 47ndash582012
[15] B Stroustrup The C++ Programming Language ATampT Labs3rd edition 1997
[16] S DonovanC++ by Example ldquoUnderCrdquo Learning Edition QUECorporation 2002
[17] S Mcconnell Code Complete Microsoft Press 2nd edition2004
[18] Microsoft STL Containers Microsoft New York NY USA2012 httpmsdnmicrosoftcomen-uslibrary1fe2x6ktaspx
[19] Microsoft Arrays (C++ Component Extensions) MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryvstudiots4c4dw6(v=vs110)aspx
[20] B Stroustrup ldquoEvolving a language in and for the real worldC++ 1991ndash2006rdquo in Proceedings of the 3rd ACM SIGPLANHistory of Programming Languages Conference (HOPL-III rsquo07)June 2007
[21] Microsoft Memory Limits for Windows Releases MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryaa36677828VS8529aspxmemory limits
[22] A Fog ldquoOptimizing software in C++rdquo 2014 httpwwwagnerorgoptimize
[23] Code Project ldquoC++ Memory Poolrdquo 2014 httpwwwcodepro-jectcomArticles15527C-Memory-Pool
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
2 Mathematical Problems in Engineering
must be arranged according to a certain order to obtain thecorrect output [3 5 8 9]The BRA is used to create the orderof input or output permutation according to the requiredorder The BRA used in most FFT algorithms including theoriginal Cooley-Tukey algorithm [3] is known as bit reversalmethod (BRM)TheBRM is an operation for exchanging twoelements 119909(119896) and 119909(119896) of an array of length 119898 as shown in(1) and (2) respectively where 119886
119895are either 0 or 1 and 119887 is the
relevant base 2 4 8 or 16 depending on the selected radixstructure
119896 =
119898minus1
sum
119895=0
119886119895119887119895 (1)
119896 =
119898minus1
sum
119895=0
119886119895119887119898minus1minus119895
(2)
All the later algorithms for creating BRP were named as BRA(bit reversal algorithm) though they used other techniqueslike patterns of BRP instead of bit reversing techniques
During the last decades many publications addressednew BRAs [10] by improving the already existing originalBRA (BRM) or using totally different approaches In 1996Karp compared the performance of 30 different algorithms[10] against uniprocessor systems (computer system with asingle central processing unit) with different memory sys-tems Karp found that the performance of a BRA dependedon the memory architecture of the machine and the wayof accessing the memory Karp stated two hardware factsthat influence the BRA namely the memory architectureand the cache size of the machine According to Karp amachine with hierarchical memory is slower than a machinewith vector memory (computers with vector processor) andalgorithms do not perform well when array size is largerthan the cache size Also Karp pointed out four featuresof an algorithm that influence the BRA namely memoryaccess technique data reading sequence size of the indexof memory and type of arithmetic operations Accordingto Karp an algorithm that uses efficient memory accesstechniques is the fastest among algorithms with exactly thesame number of arithmetic operations Algorithms are fasterif (i) they require only a single pass over the data (ii) they useshort indexes and (iii) they operate with addition instead ofmultiplication
Karp especially mentioned that the algorithm publishedby Elster [11] in 1989 was different from other algorithmsbecause it used a pattern of BRP rather than improving deci-mal to binary and binary to decimal conversion According tothe findings of Karp Elsterrsquos ldquoLinear Bit Reversal Algorithmrdquo(LBRA) performs much better in most of the cases Thepublication of Elster (1989) [11] consists of two algorithmsto achieve BRP One algorithm used a pattern of BRP andthe other one used bit shifting operations Both algorithmsare interesting because they eliminate the conventional bitreversingmechanism which needmore computing timeThealgorithm by Rubio et al (BRA-Ru) of 2002 [12] is anotherapproach that uses an existing pattern of BRP However thepattern described in Rubiorsquos algorithm is different from the
pattern described in Elsterrsquos [11] In 2009 Elster and Meyerpublished an improved version of ldquoLinear Register-Level BitReversalrdquo which was published in 1989 [11] as ldquoElsterrsquos BitReversalrdquo (EBR) algorithm Elster mentioned it is possible togenerate the second half of the BRP by incrementing the rel-evant element of the first half by one Also Elster mentionedthere can be a serious impact on cache performance of thecomputer if the said pattern (Figure 1) is used
Programming languages provide different data structures[13] which handle memory in different ways In additionthe performance of the memory depends on machine archi-tecture and operating system [14] Therefore the efficiencyof the memory is the resultant output of the performancesof hardware operating system programming language andselected data structure
Based on the physical arrangement of memory elementsthere are two common ways of allocating memory for aseries of variables ldquoslot of continuousmemory elementsrdquo andldquocollection of noncontinuous memory elementsrdquo commonlyknown as ldquostackrdquo and ldquoheaprdquo [15] In most programminglanguages the term array is used to refer to a ldquoslot ofcontinuous memory elementsrdquo Arrays are the simplest andmost common type of data structure [16 17] and due tocontinuous physical arrangement of memory elements pro-vide faster access than ldquocollection of noncontinuous memoryelementsrdquo memory types However with the development ofprogramming languages different types of data structureswere introduced with very similar names to the standardnames like array stack and heap The names of new datastructures sometimes did not agree with the commonlyaccepted meaning ldquoStackrdquo ldquoArrayrdquo and ldquoArrayListrdquo providedby Microsoft Visual C++ (VC++) [18 19] are good examplesAccording to the commonly accepted meaning they shouldbe a ldquoslot of continuousmemory elementsrdquo but they are in facta ldquocollection of noncontinuousmemory elementsrdquoThereforeit is not a good practice to determine the performance of acertain data structure just by looking at its name To overcomethis ambiguous situation we use ldquoslot of continuous memoryelementsrdquo to refer to ldquoprimitive arrayrdquo (or array) typememorystructures
Due to the very flexible nature vector is the mostcommon one among the different types of data structures[14] Vector was introduced with C++ which is one of themost common and powerful programming languages whichhas been used since 1984 [20] However as most of otherdata structures the term ldquovectorrdquo is used to refer to memoryin computers with processor architecture called ldquovectorprocessorrdquo In this paper the term vector is used to refer tothe vector data structure that is used in the C++ interface
Indexmapping is a technique that can be used to improvethe efficiency of an algorithm by reducing the arithmetic loadof the algorithm [8] If 119899 [0119873 minus 1] and 119873 is not prime 119873can be defined as 119873 = prod
119899119894minus1
119894=0119873119894 where 119899
119894 [0119873
119894minus 1] lt
119873 This allows the usage of small ranges of 119899119894instead of large
range of 119899 andmaps a function119891(119899) into amultidimensionalfunction 1198911015840(119899
1 1198992 119899
119894)
There are two common methods of implementing indexmapping one-dimensional or multidimensional memory
Mathematical Problems in Engineering 3
Normal order
Reverse order Pattern
Dec Bin Bin Dec
0 0000 0000 0
1 0001 1000 8
2 0010 0100 4
3 0011 1100 12
4 0100 0010 2
5 0101 1010 10
6 0110 0110 6
7 0111 1110 14
8 1000 0001 1 0 + 1
9 1001 1000 9 8 + 110 1010 0101 5 4 + 1
11 1011 1101 13 12 + 1
12 1100 0011 3 2 + 1
13 1101 1011 11 10 + 1
14 1110 0111 7 6 + 1
15 1111 1111 15 14 + 1
Bloc
k1
ndash8el
emen
tsBl
ock2
ndash8el
emen
tsFigure 1 Relation between first and second halves of the BRP for 119899 = 16
structures In addition it is also possible to implement theindex mapping using several equal size one-dimensionalmemory structures However this option is not popular asit is uncomfortable for programming The performance ofmodern computers is highly dependent on the effectivenessof the cache memory of the CPU [4] To achieve the best per-formance of a series of memory elements the best practice istomaintain sequential access [4] Otherwise the effectivenessof the cache memory of the central processing unit (CPU)will be reduced Index mapping with multidimensional datastructures violates sequential access due to switching betweencolumnsrows and thus reduces the effectiveness of thecache memory Therefore it is generally accepted that theuse of a multidimensional data structure reduces computerperformance [4]
In this paper an efficient BRA is introduced to createthe BRP based on multiple memory structures and recursivepatterns of BRPThe findings of this paper show that the com-bination of multiple one-dimensional memory structuresindex mapping and the recessive pattern of BRP can be usedto improve the efficiency of BRA These findings are veryimportant to the field of signal processing as well as any fieldthat is involved in index mapping techniques
2 Material and Methods
21 New Algorithm (BRA-Split) Elster stated that it is pos-sible to generate the second half of the BRP by increment-ing items in the first half by one [2] (Figure 1) withoutchanging the order and the total number of calculations of
the algorithm Due to the recursive pattern of BRP it canbe further divided into equal size blocks by splitting eachblock recursively (Maximum log
2N times) After splitting 119904
times BRP is divided into 2119904 equal blocks each containing1198992119904 elements The relation between the elements in blocks is
given as follows
119861 (2119898+ 119895) [119894] = 119861 (119895) [119894] + 2
119878minus119898minus1 (3)
where 119861(2119898 + 119895)[119894] is 119894th element of block 119861(2119898 + 119895) and119898 =
0 1 119904 minus 1 119895 = 1 2119898Table 1 shows the relationship between elements in blocks
according to the index mapping shown in (3) after splittingBRP one time and two times for 119899 = 16 Depending on therequirement the number of splitting can be increased
22 Evaluation Process of New Algorithm To evaluate algo-rithms we used Windows 7 and Visual C++ 2012 on a PCwith multicore CPU (4 cores 8 logical processors) and 12GBmemory Detailed specifications of the PC and the softwareare given in Table 2 To eliminate limits of memory andaddress space related to the selected platform the compileroption ldquoLARGEADDRESSAWARErdquo was set [21] and plat-form was set to ldquox64rdquo All other options of the operatingsystem and the compiler were kept unchanged
The new algorithm was implemented using single one-dimensionalmemory structure and themost commonmulti-dimensional memory structure Furthermore the new BRAwas implemented using several equal size one-dimensionalmemory structures (multiple memory structure)
4 Mathematical Problems in Engineering
Table1Re
lationbetweenelem
entsof
16-elementB
RP(119899=16)for
split
=1and
split
=2
Normal
order
Reverse
order
Split
(119878)=1
Split
(119878=2)
Blocks
(12119904)
Index
ofthe
block119894
Reverseo
rder
calculation
Calculationmetho
d119898=0119904minus1(119898=0)
119895=12119898
119898=0119895=1
119894=11198992119904minus1
119894=17
Blocks
(12119904)
Indexof
the
block119894
Reverseo
rder
calculation
Calculationmetho
d119898=0119904minus1(119898=01)
119895=12119898
119898=0119895=1119898=1119895=12
119894=11198992119904minus1
119894=123
00
1
00
Initialized
00
Initialized
18
18=8+0
Use
Elste
rrsquosLinear
BitR
eversalm
etho
d
11
8=8+0
Use
Elste
rrsquosLinear
BitR
eversalm
etho
d2
42
4=4+0
24=4+0
312
312
=12
+0
312
=12
+0
42
42=2+0
02
Initialized
510
510
=10
+0
21
10=8+2
119861(2119898+119895)[119894]=119861(119895)[119894]+2119904minus119898minus1
66
66=6+0
26=4+2
For119898
=0119895=1119894=123
119861(20+1)[119894]=119861(1)[119894]+22minus0minus1
714
714
=14
+0
314
=12
+2
119861(2)[119894]=119861(1)[119894]+2
81
2
01
Initialized
01
Initialized
99
19=8+1
119861(2119898+119895)[119894]=119861(119895)[119894]+2119904minus119898minus1
For119898
=0119895=1119894=17
119861(20+1)[119894]=119861(1)[119894]+22minus0minus1
119861(2)[119894]=119861(1)[119894]+1
31
9=8+1
119861(2119898+119895)[119894]=119861(119895)[119894]+2119904minus119898minus1
105
25=4+1
25=4+1
For119898
=1119895=1119894=123
119861(21+1)[119894]=119861(1)[119894]+22minus1minus1
1113
313
=12
+1
313
=12
+1
119861(3)[119894]=119861(1)[119894]+1
123
43=2+1
03
Initialized
1311
511=10
+1
41
11=10
+1
119861(2119898+119895)[119894]=119861(119895)[119894]+2119904minus119898minus1
147
67=6+1
27=6+1
For119898
=1119895=2119894=123
119861(21+2)[119894]=119861(2)[119894]+22minus1minus1
1515
715
=14
+1
315
=14
+1
119861(4)[119894]=119861(2)[119894]+2
Mathematical Problems in Engineering 5
Table 2 Hardware and software specifications of the PC
Specifications
Processor Intel Core i7 CPU 870 293GHz (4 cores8 threads)
RAM 12GB DDR3-1333 2 channelsMemory bandwidth 21GBsL1 L2 and L3 cache 4 times 64KB 4 times 256KB and 8MB sharedL1 L2 and L3 cacheline size 64 bit
Brand and type Fujitsu CelsiusBIOS settings Default (hyper threading enabled)OS and service pack Windows 7 professional with service pack 1System type 64 bit operating systemOS settings DefaultVisual Studio 2012 Version 110507271 RTMRELNET Framework Version 4550709
Thenext taskwas to identify a suitable data structure fromdifferent types of available data structures We consideredseveral common techniques as summarized in Table 3 Datastructure 1 mentioned in Table 3 is not supporting dynamicmemory allocation (need to mention the size of the arraywhen array is being declared) For general bit reversalalgorithm it is a must to have dynamic memory allocation tocater different sizes of samples Even after setting the compileroption ldquoLARGEADDRESSAWARErdquo [21] data structures 3and 4 mentioned in Table 3 were not supported for accessingmemory greater than 2GB Therefore structures 1 3 and 4were rejected and memory structures 2 (array) and 5 (vector)were used to create all one-dimensional memory structuresThe same versions of array and vector were used to createmultidimensional memory structures
The new algorithm mentioned in Section 21 was imple-mented usingC++ in 24 types ofmemory structures as shownin Table 4 The performance of these algorithms was evalu-ated considering the ldquoclocks per elementrdquo (CPE) consumedbyeach algorithm To retrieve this value first average CPE foreach sample size of 2119899 where 119899 [21 31] (11 sample sizes) werecalculated after executing each algorithm 100 timesThis gave11 CPE representing each sample size Finally the combinedaverageof CPEwas calculated for each algorithmby averagingthose 11 values along with ldquocombined standard deviationrdquoThe combined average of CPE was considered as the CPEfor each algorithm The built-in ldquoclockrdquo function of C++ wasused to calculate the clocks Combined standard deviationwas calculated using the following
120590119888= (
sum119896
119904=1(1198991199041205902
119904+ 119899119904(119883119904minus 119883119888)
2
)
sum119896
119904=1119899119904
)
12
(4)
where119883119888= sum119896
119904=1119899119904119883119904sum119896
119904=1119899119904 119904 is the number of samples 119899
119904
is number of samples in each sample and 120590119904is the standard
deviation of each sample
Algorithms 1 2 and 3 illustrate the implementation ofnew BRA with single one-dimensional memory structuremultidimensional memory structure and multiple mem-ory structures respectively The algorithm illustrated inAlgorithm 1 (BRA Split 1 1A) was implemented using prim-itive array for split = 1 The algorithm BRM Split 2 4A(Algorithm 2) was implemented using vectors for split =2 The algorithm BRM Split 2 4A (Algorithm 3) was imple-mented using primitive array for split = 2 A sample per-mutation filling sequence of algorithms with single one-dimensional memory structures is illustrated in Figure 2Figure 3 illustrates a sample permutation filling sequence ofboth multidimensional and multiple memory structures
Secondly arithmetic operations per element (OPPE)werecalculated for each algorithm Arithmetic operations withineach algorithm were located in three regions of the codeinner FOR loop outer FOR loop and outside of the loopsThen the total number of operations (OP) can be defined as
OP = 1198701lowast 1198621+ 1198702lowast 1198622+ 1198623 (5)
where 1198621 1198622 and 119862
3are the number of operations in inner
FOR loop outer FOR loop and outside of the loops 1198701and
1198702are the number of iterations of outer loop and inner loop
Equation (5) can be represented as
OP =log2(NS)minus119904minus1
sum
119905=0
2119905lowast 1198621+
log2(NS)minus119904minus1
sum
119905=0
1198622+ 1198623 (6)
where NS is the number of samples and 119904 is the number ofsplits
The main contribution to calculations comes from theinner loop Comparing with the contribution of operationsin the inner loop the contribution of operations in rest ofthe code is very small For example consider the algorithmBRA Split 1 1A shown in Algorithm 1 As sample size is 2311198701lowast 1198621asymp 215 lowast 10
9lowast 1198621and 119870
2lowast 1198622+ 1198623asymp 1000 where
1198621gt 1 Therefore only the operations of inner loop were
considered for evaluation Then (6) can be simplified as
OP =log2(NS)minus119904minus1
sum
119905=0
2119905lowast 1198621 (7)
The ldquooperations per elementrdquo (OPPE) can be defined as
OPPE =(sum
log2(NS)minus119904minus1
119905=02119905lowast 1198621)
NS
(8)
For FFT always NS = 2119896 119896 isin 119885+Then from (7)
OPPE =(sum119896minus119904minus1
119905=02119905lowast 1198621)
2119896
(9)
119896
sum
119905=0
2119905+ 1 = 2
119896+1 (10)
6 Mathematical Problems in Engineering
Table 3 Common memory allocating methods that are used in Visual C++
Number Name Syntax Nature of memory layout1 Array int BRP[1000] Slot of continuous memory elements2 Array intlowast BRP = new int[119899] Slot of continuous memory elements3 Array array⟨int⟩andBRP = gcnew array⟨int⟩(119899) Collection of noncontinuous memory elements4 ArrayList ArrayList andBRP = gcnew ArrayList() Collection of noncontinuous memory elements5 Vector stdvector⟨int⟩ BRP(119899) Collection of noncontinuous memory elements
Table 4 Different versions of new BRA implemented with different data structures
Split (s) Data structure type AlgorithmSingle memory structure Multidimensional memory structure Multiple memory structure
1 Array (A) BRA Split 1 1A BRA Split 1 2DA BRA Split 1 2A1 Vector (V) BRA Split 1 1V BRA Split 1 2DV BRA Split 1 2V2 Array (A) BRA Split 2 1A BRA Split 2 4DA BRA Split 2 4A2 Vector (V) BRA Split 2 1V BRA Split 2 4DV BRA Split 2 4V3 Array (A) BRA Split 3 1A BRA Split 3 8DA BRA Split 3 8A3 Vector (V) BRA Split 3 1V BRA Split 3 8DV BRA Split 3 8V4 Array (A) BRA Split 4 1A BRA Split 4 16DA BRA Split 4 16A4 Vector (V) BRA Split 4 1V BRA Split 4 16DV BRA Split 4 16VNaming convention for algorithms ldquoBRA Splitrdquo + ltNumber of splitsgt + ltnature of memory structuregt xA xV x number of arrays and x number of vectorsxDA xDV single x-dimensional array and single x-dimensional vector
void mf BRM Split 1 1A (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DL
ui N = ui NS Number of samplesui t = ui log2NS minus 1ui EB = ui N2ui L = 1unsigned intlowast BRP = new unsigned int[ui N]BRP[0] = 0BRP[ui EB] = 1for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP[j] = BRP[j minus ui L] + ui NBRP[ui EB + j] = BRP[j] + 1
ui L = ui L + ui L
delete[] BRP
Algorithm 1 C++ implementation of new BRA with single array for split = 1 (BRA Split 1 1A)
Mathematical Problems in Engineering 7
Void mf BRM Split 2 4DV (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DLui N = ui NS Number of samplesui t = ui log2NS minus 2ui EB = ui N4ui L = 1stdvectorltstdvectorltunsigned intgtgt BRP(4
stdvectorltunsigned intgt(ui EB))BRP[0][0] = 0BRP[1][0] = 2BRP[2][0] = 1BRP[3][0] = 3for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP[0][j] = BRP[0][j minus ui L] + ui NBRP[1][j] = BRP[0][j] + 2BRP[2][j] = BRP[0][j] + 1BRP[3][j] = BRP[1][j] + 1
ui L = ui L + ui L
Algorithm 2 C++ implementation of new BRA with four arrays for split = 2 (BRM Split 2 4A)
0 8 4 12 2 10 6 14 1 9 5 13 3 11 7 15
Initialization
First iteration Seventh (last) iteration
Figure 2 Permutation filling sequence of new BRA with single memory structure for 119899 = 16 and split =1 (BRA Split 1 1A)
For large 119896 2119896 + 1 asymp 2119896 Then
119896
sum
119905=0
2119905= 2119896+1
(11)
Because the considered sample size is 221 to 231 the value 119896can be considered as large Then from (9)
OPPE =1198621lowast (2119896minus119904)
2119896
(12)
OPPE = 1198621
2119904 (13)
According to (13) OPPE is 119891(1198621 119904) The value 119862
1(oper-
ations in inner loop) and the value 119904 (number of splits) are aconstant for a certain algorithm
To evaluate the performance of new BRA we selectedthree algorithms (LBRA EBR and BRA-Ru) which used apattern instead of conventional bit reversing method Theperformance of vector and array versions of the best version
Table 5 The number of operations in the inner loop (1198621) of each
algorithm for different splits (119904) (The total number of operations ineach algorithm asymp 119862
1)
Memory structure type Value of 1198621
119904 = 1 119904 = 2 119904 = 3 119904 = 4
Single one-dimension (array and vector) 8 15 30 61Multidimension (array and vector) 7 11 19 35Multiple one-dimension (array and vector) 7 11 19 35
of new BRA was compared with the relevant versions ofselected algorithms
3 Results and Discussion
Our objective was to introduce BRA using recursive patternof the BRP that we identified We used multiple memorystructures which is a feasible yet unpopular technique toimplement index mapping According to Table 5 the num-bers of operations in all the array and vector versions of both
8 Mathematical Problems in Engineering
Void mf BRM Split 2 4A (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DL
ui N = ui NS Number of samplesui t = ui log2NS minus 2ui EB = ui N4ui L = 1unsigned intlowast BRP1 = new unsigned int[ui EB]unsigned intlowast BRP2 = new unsigned int[ui EB]unsigned intlowast BRP3 = new unsigned int[ui EB]unsigned intlowast BRP4 = new unsigned int[ui EB]
BRP1[0] = 0BRP2[0] = 2BRP3[0] = 1BRP4[0] = 3for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP1[j] = BRP1[j minus ui L]+ ui NBRP2[j] = BRP1[j] + 2BRP3[j] = BRP1[j] + 1BRP4[j] = BRP2[j] + 1
ui L = ui L + ui L
delete[] BRO1delete[] BRO2delete[] BRO3delete[] BRO4
Algorithm 3 C++ implementation of new BRA with four one-dimensional arrays for split = 2 (BRM Split 2 4A)
0 8 4 12
2 10 6 14
1 9 5 13
3 11 7 15
Initi
aliz
atio
n
Third
(las
t) ite
ratio
n
First iteration
Figure 3 Permutation filling sequence of 4 individual and single 4-dimensional memory structure for 119899 = 16
multidimensional and multiple memory structures are thesame Also Figure 5 shows continuous decrement of OPPEwhen the number of splits increases Then the algorithmwith the highest number of splits and the lowest number ofoperations is the one which is expected to be most efficient
However results in relation with CPE (Figure 5) show thatthe new algorithm with four memory structures of array isthe fastest and most consistent in the selected range Twofour eight and sixteen multiple array implementations ofnew BRA reported 25 34 33 and 18 higher efficiency
Mathematical Problems in Engineering 9
0
1
2
3
4
5
6
Refalgorithms
Ope
ratio
ns p
er el
emen
t (O
PPE)
Split = 1 Split = 2 Split = 3 Split = 4
Corresponds to both array and vector versions of algorithm of rudioCorresponds to both array and vector versions of ldquolinear bit reversalrdquo algorithmCorresponds to both array and vector versions of ldquoelsterrsquos bit reversalrdquo algorithmCorresponds to both array and vector versions of the new algorithm in
Corresponds to both array and vector versions of the new algorithm in
Corresponds to both array and vector versions of the new algorithm in
signal one-dimensional memory structure
signal multi-dimensional memory structure
signal multiple one-dimensional memory structure
Figure 4 Operations per element versus reference algorithms and new algorithm with different s (splits) where dashed column correspondsto both array and vector versions of algorithm of Rudio cross lines column corresponds to both array and vector versions of ldquoLinear BitReversalrdquo algorithm dotted column corresponds to both array and vector versions of ldquoElsterrsquos Bit Reversalrdquo algorithm vertical lines columncorresponds to both array and vector versions of the new algorithm in single one-dimensional memory structure inclined lines columncorresponds to both array and vector versions of the new algorithm in single multidimensional memory structure and horizontal linescolumn corresponds to both array and vector versions of the new algorithm in multiple one-dimensional memory structures
respectively in relation to the array version of LBRA Thealgorithm with eight memory structures has nearly the sameCPE as the four-array and four-vector versions but is lessconsistent On the other hand the four-vector implementa-tion of the new algorithm is the fastest and most consistentamong all vector versions Two four eight and sixteenmultiple vector implementations of new BRA reported 1316 and 16 higher and 23 lower efficiency respectively inrelation to the vector version of LBRAThis result proves thatat a certain point multiple memory structure gives the bestperformances in the considered domain Also usage of mul-tiple memory structures of primitive array is a good optionfor implementing index mapping of BRP compared to multi-dimensional or single one-dimensional memory structures
Due to the flexible nature of the vector it is commonlyused for implementing algorithms According to Figure 4there is no difference in OPPE between array and vectorversionsHowever our results in Figure 5 show that the vectorversions of BRA always required more CPE (44ndash142)than the array version The structure of vector gives priorityto generality and flexibility rather than to execution speedmemory economy cache efficiency and code size [22]There-fore vector is not a good option with respect to efficiency
The results in Table 5 and Figure 4 show that there is nodifference between the number of calculations and OPPEfor equal versions of algorithms with multidimensional andmultiple memory structure Structure and type of calcula-tions are the same for both types The only difference is thenature of the memory structure multidimension or multipleone-dimension When CPE is considered it shows 19ndash79
performance efficiency from algorithms with multiple one-dimension memory structures The reason for that observa-tion is that the memory access of multidimensional memorystructure is less efficient than one-dimensional memorystructure [22]
We agree with the statement of Elster about indexmapping of BRP [2] and the generally accepted fact (theusage of multidimensional memory structures reduces theperformance of a computer) [4] only with respect to multi-dimensional memory structures of vector Our results showthat even with multidimensional arrays there are situationswhere the new BRA performs better than the same typeof one-dimensional memory structure The four eight andsixteen dimensional array versions of new BRA perform 810 and 2 in relation to one-dimensional array version ofnew BRA Some results in relation to single one-dimensionalmemory structure implementation of new BRA are also notin agreement with the general accepted idea For examplesample size = 231 the two-dimensional vector version of newBRA (BRA Split 1 2DV) reported 542119864 minus 05 CPE which is389 higher in relation to average CPE of sample size rangeof 221 to 230 Also the inconsistency was very highThereforewe excluded values related to sample size = 231 for the two-dimensional vector version
We observed very high memory utilization with the two-dimensional vector version especially with sample size =231 Windows task manager showed that the memory uti-lization of all the considered algorithms was nearly thesame for all sample sizes except for multidimension versionsof vector The multidimensional version of vector initially
10 Mathematical Problems in Engineering
Cloc
ks p
er el
emen
t (CP
E)
000E + 00
500E minus 06
100E minus 05
150E minus 05
200E minus 05
250E minus 05
300E minus 05
BRM
split
01x
BRM
split
12x
BRM
split
24x
BRM
split
38x
BRM
split
416x
lowastBR
Msp
lit12
Dx
BRM
split
24
Dx
BRM
split
38
Dx
BRM
split
416
Dx
BRM
split
11x
BRM
split
21x
BRM
split
31x
BRM
split
41x
x-A or V (array or vector)Algorithms
Refe
renc
e alg
orith
m o
ne-
dim
ensio
nal s
ingl
e mem
ory
struc
ture
of s
izen
Multiple memorystructures
1
1
2
k
k
k
k = 2s
n sample size
Multidimensionalmemory structure
One-dimensionalmemory structure
nknk
nk
middotmiddotmiddot
middotmiddotmiddot
(s = 1 2 3 4)
Corresponds to array version of the algorithmCorresponds to vector version of the algorithmCorresponds to minimum reported CPE of the category
Figure 5 Clocks per element (combined average) versus algorithm for the sample size range from 221 to 231 where blue dotted columncorresponds to array version of the algorithm green column corresponds to vector version of the algorithm and red rhombus corresponds tominimum reported CPE of the category lowastFor vector version sample size 231 was excluded because at sample size 231 it showed huge deviationdue to memory limitation of the machine
utilizes highermemory and drops down to normal valueTheresults in relation to sample size = 2
30 showed that the extramemory requirement of two dimension vector was higherthan that of the four-dimensional vector Based upon that itcan be predicted that BRA Split 1 2DV needs an extra 3GB(total 13 GB) for normal execution at sample size = 2
31 butthe total memory barrier of 12GB of the machine slows theprocess down The most likely reason for this observation isthe influence ofmemory poolingmechanismWhen defininga memory structure it is possible to allocate the requiredamount of memory This is known as memory pooling [23]In all the memory structures used in algorithms discussedin this paper we used memory pooling Memory poolingallocates the required memory at startup and divides thisblock into small chunks If there is no memory poolingmemory will get fragmented Accessing fragmented memoryis inefficient When the existing memory is not enough forallocating then it switches to use fragmented memory forallocating memory elements In the considered situation theexisting memory (12GB) is not sufficient for allocating therequired amount (13GB) which switches to use fragmentedmemory
The total cache size of the machine is 825MB whichis less than the minimum memory utilization of consideredalgorithms (16MB to 8GB) in relation to the sample sizerange from 222 to 231 Only the sample size 221 occupies 8MBmemory which is less than the total cache memory ExceptBRA Split 4 1V structure all algorithms reported constantCPE in relation to the entire sample size range The bestalgorithms of each category especially reported very steady
behaviour This observation is in disagreement with thestatement of Karp ldquothat a machine with hierarchical memorydoes not performwell when array size is larger than the cachesizerdquo [10]
Comparison (Figure 6) of best reported version in theconsidered domain (four memory structure version) andthe selected algorithms shows that the array version ofEBR performs the best The four-array version of new BRAreported 1 lower performance than the array version ofEBR However the four-array version of new BRA reported34 and 23 higher performances than array versions ofLBRA and BRA-Ru Also the four-vector version of newBRA is reported to have the best performance among all thevector versions It reported 16 10 and 22 performancescompared to vector versions of LBRA EBR and BRA-Rurespectively
4 Conclusion and Outlook
Themain finding of this paper is the recursive pattern of BPRand the implementationmethod of it using multiple memorystructures With multiple memory structures especiallythe newly identified index mapping performs much betterthan multidimensional or single one-dimensional memorystructure Furthermore findings of this paper show that theperformance of primitive array is higher than vector typeThe result is in disagreement with the statement of Karp ldquothata machine with hierarchical memory does not perform wellwhen array size is larger than the cache sizerdquo Almost all thesample sizes we used were higher than the total cache size
Mathematical Problems in Engineering 11
120E ndash 05
100E minus 05
800E minus 06
600E minus 06
400E minus 06
200E ndash 06
000E + 00
Cloc
ks p
er el
emen
t (CP
E)
BRM
split
24x
(the b
est v
ersio
nof
the n
ewal
gorit
hm)
Line
ar b
it-re
vers
al1x
Elste
rrsquos b
it-
Rubi
o1x
reve
rsal
1x
x-A or V (array or vector)Algorithms
Corresponds to array version of the algorithmCorresponds to vector version of the algorithm
Figure 6 Clocks per element versus best version of new algorithm and selected algorithms where blue dotted column corresponds to arrayversion and green column corresponds to vector version of the algorithm
1
1
1
1
1
0
1
0
1
0
1
0
Process 1
Process 2
Process 5
Process 6Process 3
Process 4
Process 7
4 parallel processes
2 parallel processes
Single process
Tota
l sig
nal i
s div
ided
into
equa
l fou
r ind
epen
dent
mem
ory
struc
ture
s usin
g ne
w al
gorit
hm (B
RMsp
lit24x
)
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
minus1
w04
w14
w24
w34
w04
w14
w24
w34
A(0)
A(1)
A(2)
A(3)
B(0)
B(1)
B(2)
B(3)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
minus1
minus1
minus1
Figure 7 Four-memory-structure version of algorithm and the parallel processes for sample size = 16
of the computer However multiple memory structure andthe multidimensional memory structure versions showedreasonable steady performancewith those samples In generalthese results show the effects of data structures and memoryallocation techniques and open a new window of creatingefficient algorithms with multiple memory structures inmany other fields where index mapping is involved
The new bit reversal algorithm with 2119904 independent
memory structures splits the total signal into 119904 independentportions and the total FFT process into 119904 + 1 levels Thenthese 119904 signal portions can be processed independently bymeans of 119904 independent processes on the first level Onthe next level the results from the previous level stored inindependent memory structures can be processed with 1199042
12 Mathematical Problems in Engineering
processes and so on until the last level Therefore we suggestusing the concept of multiple memory structures in total FFTprocess along with the new algorithm with multiple memorystructures and suitable parallel processing technique Weexpect that it is possible to achieve higher performance fromFFT process with proper combination of parallel processingtechnique and new algorithm compared to using the newalgorithm only to create bit reversal permutation Figure 7shows such approach with four (when 119904 = 2) independentmemory structures for sample size = 16
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This publication is financially supported by the UniversityGrants Commission Sri Lanka
References
[1] C S Burrus ldquoUnscrambling for fast DFT algorithmsrdquo IEEETransactions on Acoustics Speech and Signal Processing vol 36no 7 pp 1086ndash1087 1988
[2] A C Elster and J C Meyer ldquoA super-efficient adaptable bit-reversal algorithm for multithreaded architecturesrdquo in Pro-ceedings of the IEEE International Symposium on Parallel ampDistributed Processing (IPDPS 09) vol 1ndash5 pp 1ndash8 Rome ItalyMay 2009
[3] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965
[4] M L Massar R Bhagavatula M Fickus and J KovacevicldquoLocal histograms and image occlusion modelsrdquo Applied andComputational Harmonic Analysis vol 34 no 3 pp 469ndash4872013
[5] R G Lyons Understanding Digital Signal Processing PrenticeHall PTR 2004
[6] C Deyun L Zhiqiang G Ming W Lili and Y XiaoyangldquoA superresolution image reconstruction algorithm based onlandweber in electrical capacitance tomographyrdquoMathematicalProblems in Engineering vol 2013 Article ID 128172 8 pages2013
[7] Q Yang and D An ldquoEMD and wavelet transform based faultdiagnosis for wind turbine gear boxrdquo Advances in MechanicalEngineering vol 2013 Article ID 212836 9 pages 2013
[8] C S Burrus Fast Fourier Transforms Rice University HoustonTex USA 2008
[9] C van Loan Computational Frameworks for the Fast FourierTransform vol 10 of Frontiers in Applied Mathematics Societyfor Industrial and Applied Mathematics (SIAM) PhiladelphiaPa USA 1992
[10] A H Karp ldquoBit reversal on uniprocessorsrdquo SIAM Review vol38 no 1 pp 1ndash26 1996
[11] A C Elster ldquoFast bit-reversal algorithmsrdquo in Proceedings ofthe International Conference on Acoustics Speech and SignalProcessing pp 1099ndash1102 IEEE Press Glasgow UK May 1989
[12] M Rubio P Gomez and K Drouiche ldquoA new superfast bitreversal algorithmrdquo International Journal of Adaptive Controland Signal Processing vol 16 no 10 pp 703ndash707 2002
[13] B Stroustrup Programming Principles and Practice Using C++Addison-Wesley 2009
[14] B Stroustrup ldquoSoftware development for infrastructurerdquo IEEEComputer Society vol 45 no 1 Article ID 6081841 pp 47ndash582012
[15] B Stroustrup The C++ Programming Language ATampT Labs3rd edition 1997
[16] S DonovanC++ by Example ldquoUnderCrdquo Learning Edition QUECorporation 2002
[17] S Mcconnell Code Complete Microsoft Press 2nd edition2004
[18] Microsoft STL Containers Microsoft New York NY USA2012 httpmsdnmicrosoftcomen-uslibrary1fe2x6ktaspx
[19] Microsoft Arrays (C++ Component Extensions) MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryvstudiots4c4dw6(v=vs110)aspx
[20] B Stroustrup ldquoEvolving a language in and for the real worldC++ 1991ndash2006rdquo in Proceedings of the 3rd ACM SIGPLANHistory of Programming Languages Conference (HOPL-III rsquo07)June 2007
[21] Microsoft Memory Limits for Windows Releases MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryaa36677828VS8529aspxmemory limits
[22] A Fog ldquoOptimizing software in C++rdquo 2014 httpwwwagnerorgoptimize
[23] Code Project ldquoC++ Memory Poolrdquo 2014 httpwwwcodepro-jectcomArticles15527C-Memory-Pool
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 3
Normal order
Reverse order Pattern
Dec Bin Bin Dec
0 0000 0000 0
1 0001 1000 8
2 0010 0100 4
3 0011 1100 12
4 0100 0010 2
5 0101 1010 10
6 0110 0110 6
7 0111 1110 14
8 1000 0001 1 0 + 1
9 1001 1000 9 8 + 110 1010 0101 5 4 + 1
11 1011 1101 13 12 + 1
12 1100 0011 3 2 + 1
13 1101 1011 11 10 + 1
14 1110 0111 7 6 + 1
15 1111 1111 15 14 + 1
Bloc
k1
ndash8el
emen
tsBl
ock2
ndash8el
emen
tsFigure 1 Relation between first and second halves of the BRP for 119899 = 16
structures In addition it is also possible to implement theindex mapping using several equal size one-dimensionalmemory structures However this option is not popular asit is uncomfortable for programming The performance ofmodern computers is highly dependent on the effectivenessof the cache memory of the CPU [4] To achieve the best per-formance of a series of memory elements the best practice istomaintain sequential access [4] Otherwise the effectivenessof the cache memory of the central processing unit (CPU)will be reduced Index mapping with multidimensional datastructures violates sequential access due to switching betweencolumnsrows and thus reduces the effectiveness of thecache memory Therefore it is generally accepted that theuse of a multidimensional data structure reduces computerperformance [4]
In this paper an efficient BRA is introduced to createthe BRP based on multiple memory structures and recursivepatterns of BRPThe findings of this paper show that the com-bination of multiple one-dimensional memory structuresindex mapping and the recessive pattern of BRP can be usedto improve the efficiency of BRA These findings are veryimportant to the field of signal processing as well as any fieldthat is involved in index mapping techniques
2 Material and Methods
21 New Algorithm (BRA-Split) Elster stated that it is pos-sible to generate the second half of the BRP by increment-ing items in the first half by one [2] (Figure 1) withoutchanging the order and the total number of calculations of
the algorithm Due to the recursive pattern of BRP it canbe further divided into equal size blocks by splitting eachblock recursively (Maximum log
2N times) After splitting 119904
times BRP is divided into 2119904 equal blocks each containing1198992119904 elements The relation between the elements in blocks is
given as follows
119861 (2119898+ 119895) [119894] = 119861 (119895) [119894] + 2
119878minus119898minus1 (3)
where 119861(2119898 + 119895)[119894] is 119894th element of block 119861(2119898 + 119895) and119898 =
0 1 119904 minus 1 119895 = 1 2119898Table 1 shows the relationship between elements in blocks
according to the index mapping shown in (3) after splittingBRP one time and two times for 119899 = 16 Depending on therequirement the number of splitting can be increased
22 Evaluation Process of New Algorithm To evaluate algo-rithms we used Windows 7 and Visual C++ 2012 on a PCwith multicore CPU (4 cores 8 logical processors) and 12GBmemory Detailed specifications of the PC and the softwareare given in Table 2 To eliminate limits of memory andaddress space related to the selected platform the compileroption ldquoLARGEADDRESSAWARErdquo was set [21] and plat-form was set to ldquox64rdquo All other options of the operatingsystem and the compiler were kept unchanged
The new algorithm was implemented using single one-dimensionalmemory structure and themost commonmulti-dimensional memory structure Furthermore the new BRAwas implemented using several equal size one-dimensionalmemory structures (multiple memory structure)
4 Mathematical Problems in Engineering
Table1Re
lationbetweenelem
entsof
16-elementB
RP(119899=16)for
split
=1and
split
=2
Normal
order
Reverse
order
Split
(119878)=1
Split
(119878=2)
Blocks
(12119904)
Index
ofthe
block119894
Reverseo
rder
calculation
Calculationmetho
d119898=0119904minus1(119898=0)
119895=12119898
119898=0119895=1
119894=11198992119904minus1
119894=17
Blocks
(12119904)
Indexof
the
block119894
Reverseo
rder
calculation
Calculationmetho
d119898=0119904minus1(119898=01)
119895=12119898
119898=0119895=1119898=1119895=12
119894=11198992119904minus1
119894=123
00
1
00
Initialized
00
Initialized
18
18=8+0
Use
Elste
rrsquosLinear
BitR
eversalm
etho
d
11
8=8+0
Use
Elste
rrsquosLinear
BitR
eversalm
etho
d2
42
4=4+0
24=4+0
312
312
=12
+0
312
=12
+0
42
42=2+0
02
Initialized
510
510
=10
+0
21
10=8+2
119861(2119898+119895)[119894]=119861(119895)[119894]+2119904minus119898minus1
66
66=6+0
26=4+2
For119898
=0119895=1119894=123
119861(20+1)[119894]=119861(1)[119894]+22minus0minus1
714
714
=14
+0
314
=12
+2
119861(2)[119894]=119861(1)[119894]+2
81
2
01
Initialized
01
Initialized
99
19=8+1
119861(2119898+119895)[119894]=119861(119895)[119894]+2119904minus119898minus1
For119898
=0119895=1119894=17
119861(20+1)[119894]=119861(1)[119894]+22minus0minus1
119861(2)[119894]=119861(1)[119894]+1
31
9=8+1
119861(2119898+119895)[119894]=119861(119895)[119894]+2119904minus119898minus1
105
25=4+1
25=4+1
For119898
=1119895=1119894=123
119861(21+1)[119894]=119861(1)[119894]+22minus1minus1
1113
313
=12
+1
313
=12
+1
119861(3)[119894]=119861(1)[119894]+1
123
43=2+1
03
Initialized
1311
511=10
+1
41
11=10
+1
119861(2119898+119895)[119894]=119861(119895)[119894]+2119904minus119898minus1
147
67=6+1
27=6+1
For119898
=1119895=2119894=123
119861(21+2)[119894]=119861(2)[119894]+22minus1minus1
1515
715
=14
+1
315
=14
+1
119861(4)[119894]=119861(2)[119894]+2
Mathematical Problems in Engineering 5
Table 2 Hardware and software specifications of the PC
Specifications
Processor Intel Core i7 CPU 870 293GHz (4 cores8 threads)
RAM 12GB DDR3-1333 2 channelsMemory bandwidth 21GBsL1 L2 and L3 cache 4 times 64KB 4 times 256KB and 8MB sharedL1 L2 and L3 cacheline size 64 bit
Brand and type Fujitsu CelsiusBIOS settings Default (hyper threading enabled)OS and service pack Windows 7 professional with service pack 1System type 64 bit operating systemOS settings DefaultVisual Studio 2012 Version 110507271 RTMRELNET Framework Version 4550709
Thenext taskwas to identify a suitable data structure fromdifferent types of available data structures We consideredseveral common techniques as summarized in Table 3 Datastructure 1 mentioned in Table 3 is not supporting dynamicmemory allocation (need to mention the size of the arraywhen array is being declared) For general bit reversalalgorithm it is a must to have dynamic memory allocation tocater different sizes of samples Even after setting the compileroption ldquoLARGEADDRESSAWARErdquo [21] data structures 3and 4 mentioned in Table 3 were not supported for accessingmemory greater than 2GB Therefore structures 1 3 and 4were rejected and memory structures 2 (array) and 5 (vector)were used to create all one-dimensional memory structuresThe same versions of array and vector were used to createmultidimensional memory structures
The new algorithm mentioned in Section 21 was imple-mented usingC++ in 24 types ofmemory structures as shownin Table 4 The performance of these algorithms was evalu-ated considering the ldquoclocks per elementrdquo (CPE) consumedbyeach algorithm To retrieve this value first average CPE foreach sample size of 2119899 where 119899 [21 31] (11 sample sizes) werecalculated after executing each algorithm 100 timesThis gave11 CPE representing each sample size Finally the combinedaverageof CPEwas calculated for each algorithmby averagingthose 11 values along with ldquocombined standard deviationrdquoThe combined average of CPE was considered as the CPEfor each algorithm The built-in ldquoclockrdquo function of C++ wasused to calculate the clocks Combined standard deviationwas calculated using the following
120590119888= (
sum119896
119904=1(1198991199041205902
119904+ 119899119904(119883119904minus 119883119888)
2
)
sum119896
119904=1119899119904
)
12
(4)
where119883119888= sum119896
119904=1119899119904119883119904sum119896
119904=1119899119904 119904 is the number of samples 119899
119904
is number of samples in each sample and 120590119904is the standard
deviation of each sample
Algorithms 1 2 and 3 illustrate the implementation ofnew BRA with single one-dimensional memory structuremultidimensional memory structure and multiple mem-ory structures respectively The algorithm illustrated inAlgorithm 1 (BRA Split 1 1A) was implemented using prim-itive array for split = 1 The algorithm BRM Split 2 4A(Algorithm 2) was implemented using vectors for split =2 The algorithm BRM Split 2 4A (Algorithm 3) was imple-mented using primitive array for split = 2 A sample per-mutation filling sequence of algorithms with single one-dimensional memory structures is illustrated in Figure 2Figure 3 illustrates a sample permutation filling sequence ofboth multidimensional and multiple memory structures
Secondly arithmetic operations per element (OPPE)werecalculated for each algorithm Arithmetic operations withineach algorithm were located in three regions of the codeinner FOR loop outer FOR loop and outside of the loopsThen the total number of operations (OP) can be defined as
OP = 1198701lowast 1198621+ 1198702lowast 1198622+ 1198623 (5)
where 1198621 1198622 and 119862
3are the number of operations in inner
FOR loop outer FOR loop and outside of the loops 1198701and
1198702are the number of iterations of outer loop and inner loop
Equation (5) can be represented as
OP =log2(NS)minus119904minus1
sum
119905=0
2119905lowast 1198621+
log2(NS)minus119904minus1
sum
119905=0
1198622+ 1198623 (6)
where NS is the number of samples and 119904 is the number ofsplits
The main contribution to calculations comes from theinner loop Comparing with the contribution of operationsin the inner loop the contribution of operations in rest ofthe code is very small For example consider the algorithmBRA Split 1 1A shown in Algorithm 1 As sample size is 2311198701lowast 1198621asymp 215 lowast 10
9lowast 1198621and 119870
2lowast 1198622+ 1198623asymp 1000 where
1198621gt 1 Therefore only the operations of inner loop were
considered for evaluation Then (6) can be simplified as
OP =log2(NS)minus119904minus1
sum
119905=0
2119905lowast 1198621 (7)
The ldquooperations per elementrdquo (OPPE) can be defined as
OPPE =(sum
log2(NS)minus119904minus1
119905=02119905lowast 1198621)
NS
(8)
For FFT always NS = 2119896 119896 isin 119885+Then from (7)
OPPE =(sum119896minus119904minus1
119905=02119905lowast 1198621)
2119896
(9)
119896
sum
119905=0
2119905+ 1 = 2
119896+1 (10)
6 Mathematical Problems in Engineering
Table 3 Common memory allocating methods that are used in Visual C++
Number Name Syntax Nature of memory layout1 Array int BRP[1000] Slot of continuous memory elements2 Array intlowast BRP = new int[119899] Slot of continuous memory elements3 Array array⟨int⟩andBRP = gcnew array⟨int⟩(119899) Collection of noncontinuous memory elements4 ArrayList ArrayList andBRP = gcnew ArrayList() Collection of noncontinuous memory elements5 Vector stdvector⟨int⟩ BRP(119899) Collection of noncontinuous memory elements
Table 4 Different versions of new BRA implemented with different data structures
Split (s) Data structure type AlgorithmSingle memory structure Multidimensional memory structure Multiple memory structure
1 Array (A) BRA Split 1 1A BRA Split 1 2DA BRA Split 1 2A1 Vector (V) BRA Split 1 1V BRA Split 1 2DV BRA Split 1 2V2 Array (A) BRA Split 2 1A BRA Split 2 4DA BRA Split 2 4A2 Vector (V) BRA Split 2 1V BRA Split 2 4DV BRA Split 2 4V3 Array (A) BRA Split 3 1A BRA Split 3 8DA BRA Split 3 8A3 Vector (V) BRA Split 3 1V BRA Split 3 8DV BRA Split 3 8V4 Array (A) BRA Split 4 1A BRA Split 4 16DA BRA Split 4 16A4 Vector (V) BRA Split 4 1V BRA Split 4 16DV BRA Split 4 16VNaming convention for algorithms ldquoBRA Splitrdquo + ltNumber of splitsgt + ltnature of memory structuregt xA xV x number of arrays and x number of vectorsxDA xDV single x-dimensional array and single x-dimensional vector
void mf BRM Split 1 1A (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DL
ui N = ui NS Number of samplesui t = ui log2NS minus 1ui EB = ui N2ui L = 1unsigned intlowast BRP = new unsigned int[ui N]BRP[0] = 0BRP[ui EB] = 1for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP[j] = BRP[j minus ui L] + ui NBRP[ui EB + j] = BRP[j] + 1
ui L = ui L + ui L
delete[] BRP
Algorithm 1 C++ implementation of new BRA with single array for split = 1 (BRA Split 1 1A)
Mathematical Problems in Engineering 7
Void mf BRM Split 2 4DV (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DLui N = ui NS Number of samplesui t = ui log2NS minus 2ui EB = ui N4ui L = 1stdvectorltstdvectorltunsigned intgtgt BRP(4
stdvectorltunsigned intgt(ui EB))BRP[0][0] = 0BRP[1][0] = 2BRP[2][0] = 1BRP[3][0] = 3for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP[0][j] = BRP[0][j minus ui L] + ui NBRP[1][j] = BRP[0][j] + 2BRP[2][j] = BRP[0][j] + 1BRP[3][j] = BRP[1][j] + 1
ui L = ui L + ui L
Algorithm 2 C++ implementation of new BRA with four arrays for split = 2 (BRM Split 2 4A)
0 8 4 12 2 10 6 14 1 9 5 13 3 11 7 15
Initialization
First iteration Seventh (last) iteration
Figure 2 Permutation filling sequence of new BRA with single memory structure for 119899 = 16 and split =1 (BRA Split 1 1A)
For large 119896 2119896 + 1 asymp 2119896 Then
119896
sum
119905=0
2119905= 2119896+1
(11)
Because the considered sample size is 221 to 231 the value 119896can be considered as large Then from (9)
OPPE =1198621lowast (2119896minus119904)
2119896
(12)
OPPE = 1198621
2119904 (13)
According to (13) OPPE is 119891(1198621 119904) The value 119862
1(oper-
ations in inner loop) and the value 119904 (number of splits) are aconstant for a certain algorithm
To evaluate the performance of new BRA we selectedthree algorithms (LBRA EBR and BRA-Ru) which used apattern instead of conventional bit reversing method Theperformance of vector and array versions of the best version
Table 5 The number of operations in the inner loop (1198621) of each
algorithm for different splits (119904) (The total number of operations ineach algorithm asymp 119862
1)
Memory structure type Value of 1198621
119904 = 1 119904 = 2 119904 = 3 119904 = 4
Single one-dimension (array and vector) 8 15 30 61Multidimension (array and vector) 7 11 19 35Multiple one-dimension (array and vector) 7 11 19 35
of new BRA was compared with the relevant versions ofselected algorithms
3 Results and Discussion
Our objective was to introduce BRA using recursive patternof the BRP that we identified We used multiple memorystructures which is a feasible yet unpopular technique toimplement index mapping According to Table 5 the num-bers of operations in all the array and vector versions of both
8 Mathematical Problems in Engineering
Void mf BRM Split 2 4A (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DL
ui N = ui NS Number of samplesui t = ui log2NS minus 2ui EB = ui N4ui L = 1unsigned intlowast BRP1 = new unsigned int[ui EB]unsigned intlowast BRP2 = new unsigned int[ui EB]unsigned intlowast BRP3 = new unsigned int[ui EB]unsigned intlowast BRP4 = new unsigned int[ui EB]
BRP1[0] = 0BRP2[0] = 2BRP3[0] = 1BRP4[0] = 3for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP1[j] = BRP1[j minus ui L]+ ui NBRP2[j] = BRP1[j] + 2BRP3[j] = BRP1[j] + 1BRP4[j] = BRP2[j] + 1
ui L = ui L + ui L
delete[] BRO1delete[] BRO2delete[] BRO3delete[] BRO4
Algorithm 3 C++ implementation of new BRA with four one-dimensional arrays for split = 2 (BRM Split 2 4A)
0 8 4 12
2 10 6 14
1 9 5 13
3 11 7 15
Initi
aliz
atio
n
Third
(las
t) ite
ratio
n
First iteration
Figure 3 Permutation filling sequence of 4 individual and single 4-dimensional memory structure for 119899 = 16
multidimensional and multiple memory structures are thesame Also Figure 5 shows continuous decrement of OPPEwhen the number of splits increases Then the algorithmwith the highest number of splits and the lowest number ofoperations is the one which is expected to be most efficient
However results in relation with CPE (Figure 5) show thatthe new algorithm with four memory structures of array isthe fastest and most consistent in the selected range Twofour eight and sixteen multiple array implementations ofnew BRA reported 25 34 33 and 18 higher efficiency
Mathematical Problems in Engineering 9
0
1
2
3
4
5
6
Refalgorithms
Ope
ratio
ns p
er el
emen
t (O
PPE)
Split = 1 Split = 2 Split = 3 Split = 4
Corresponds to both array and vector versions of algorithm of rudioCorresponds to both array and vector versions of ldquolinear bit reversalrdquo algorithmCorresponds to both array and vector versions of ldquoelsterrsquos bit reversalrdquo algorithmCorresponds to both array and vector versions of the new algorithm in
Corresponds to both array and vector versions of the new algorithm in
Corresponds to both array and vector versions of the new algorithm in
signal one-dimensional memory structure
signal multi-dimensional memory structure
signal multiple one-dimensional memory structure
Figure 4 Operations per element versus reference algorithms and new algorithm with different s (splits) where dashed column correspondsto both array and vector versions of algorithm of Rudio cross lines column corresponds to both array and vector versions of ldquoLinear BitReversalrdquo algorithm dotted column corresponds to both array and vector versions of ldquoElsterrsquos Bit Reversalrdquo algorithm vertical lines columncorresponds to both array and vector versions of the new algorithm in single one-dimensional memory structure inclined lines columncorresponds to both array and vector versions of the new algorithm in single multidimensional memory structure and horizontal linescolumn corresponds to both array and vector versions of the new algorithm in multiple one-dimensional memory structures
respectively in relation to the array version of LBRA Thealgorithm with eight memory structures has nearly the sameCPE as the four-array and four-vector versions but is lessconsistent On the other hand the four-vector implementa-tion of the new algorithm is the fastest and most consistentamong all vector versions Two four eight and sixteenmultiple vector implementations of new BRA reported 1316 and 16 higher and 23 lower efficiency respectively inrelation to the vector version of LBRAThis result proves thatat a certain point multiple memory structure gives the bestperformances in the considered domain Also usage of mul-tiple memory structures of primitive array is a good optionfor implementing index mapping of BRP compared to multi-dimensional or single one-dimensional memory structures
Due to the flexible nature of the vector it is commonlyused for implementing algorithms According to Figure 4there is no difference in OPPE between array and vectorversionsHowever our results in Figure 5 show that the vectorversions of BRA always required more CPE (44ndash142)than the array version The structure of vector gives priorityto generality and flexibility rather than to execution speedmemory economy cache efficiency and code size [22]There-fore vector is not a good option with respect to efficiency
The results in Table 5 and Figure 4 show that there is nodifference between the number of calculations and OPPEfor equal versions of algorithms with multidimensional andmultiple memory structure Structure and type of calcula-tions are the same for both types The only difference is thenature of the memory structure multidimension or multipleone-dimension When CPE is considered it shows 19ndash79
performance efficiency from algorithms with multiple one-dimension memory structures The reason for that observa-tion is that the memory access of multidimensional memorystructure is less efficient than one-dimensional memorystructure [22]
We agree with the statement of Elster about indexmapping of BRP [2] and the generally accepted fact (theusage of multidimensional memory structures reduces theperformance of a computer) [4] only with respect to multi-dimensional memory structures of vector Our results showthat even with multidimensional arrays there are situationswhere the new BRA performs better than the same typeof one-dimensional memory structure The four eight andsixteen dimensional array versions of new BRA perform 810 and 2 in relation to one-dimensional array version ofnew BRA Some results in relation to single one-dimensionalmemory structure implementation of new BRA are also notin agreement with the general accepted idea For examplesample size = 231 the two-dimensional vector version of newBRA (BRA Split 1 2DV) reported 542119864 minus 05 CPE which is389 higher in relation to average CPE of sample size rangeof 221 to 230 Also the inconsistency was very highThereforewe excluded values related to sample size = 231 for the two-dimensional vector version
We observed very high memory utilization with the two-dimensional vector version especially with sample size =231 Windows task manager showed that the memory uti-lization of all the considered algorithms was nearly thesame for all sample sizes except for multidimension versionsof vector The multidimensional version of vector initially
10 Mathematical Problems in Engineering
Cloc
ks p
er el
emen
t (CP
E)
000E + 00
500E minus 06
100E minus 05
150E minus 05
200E minus 05
250E minus 05
300E minus 05
BRM
split
01x
BRM
split
12x
BRM
split
24x
BRM
split
38x
BRM
split
416x
lowastBR
Msp
lit12
Dx
BRM
split
24
Dx
BRM
split
38
Dx
BRM
split
416
Dx
BRM
split
11x
BRM
split
21x
BRM
split
31x
BRM
split
41x
x-A or V (array or vector)Algorithms
Refe
renc
e alg
orith
m o
ne-
dim
ensio
nal s
ingl
e mem
ory
struc
ture
of s
izen
Multiple memorystructures
1
1
2
k
k
k
k = 2s
n sample size
Multidimensionalmemory structure
One-dimensionalmemory structure
nknk
nk
middotmiddotmiddot
middotmiddotmiddot
(s = 1 2 3 4)
Corresponds to array version of the algorithmCorresponds to vector version of the algorithmCorresponds to minimum reported CPE of the category
Figure 5 Clocks per element (combined average) versus algorithm for the sample size range from 221 to 231 where blue dotted columncorresponds to array version of the algorithm green column corresponds to vector version of the algorithm and red rhombus corresponds tominimum reported CPE of the category lowastFor vector version sample size 231 was excluded because at sample size 231 it showed huge deviationdue to memory limitation of the machine
utilizes highermemory and drops down to normal valueTheresults in relation to sample size = 2
30 showed that the extramemory requirement of two dimension vector was higherthan that of the four-dimensional vector Based upon that itcan be predicted that BRA Split 1 2DV needs an extra 3GB(total 13 GB) for normal execution at sample size = 2
31 butthe total memory barrier of 12GB of the machine slows theprocess down The most likely reason for this observation isthe influence ofmemory poolingmechanismWhen defininga memory structure it is possible to allocate the requiredamount of memory This is known as memory pooling [23]In all the memory structures used in algorithms discussedin this paper we used memory pooling Memory poolingallocates the required memory at startup and divides thisblock into small chunks If there is no memory poolingmemory will get fragmented Accessing fragmented memoryis inefficient When the existing memory is not enough forallocating then it switches to use fragmented memory forallocating memory elements In the considered situation theexisting memory (12GB) is not sufficient for allocating therequired amount (13GB) which switches to use fragmentedmemory
The total cache size of the machine is 825MB whichis less than the minimum memory utilization of consideredalgorithms (16MB to 8GB) in relation to the sample sizerange from 222 to 231 Only the sample size 221 occupies 8MBmemory which is less than the total cache memory ExceptBRA Split 4 1V structure all algorithms reported constantCPE in relation to the entire sample size range The bestalgorithms of each category especially reported very steady
behaviour This observation is in disagreement with thestatement of Karp ldquothat a machine with hierarchical memorydoes not performwell when array size is larger than the cachesizerdquo [10]
Comparison (Figure 6) of best reported version in theconsidered domain (four memory structure version) andthe selected algorithms shows that the array version ofEBR performs the best The four-array version of new BRAreported 1 lower performance than the array version ofEBR However the four-array version of new BRA reported34 and 23 higher performances than array versions ofLBRA and BRA-Ru Also the four-vector version of newBRA is reported to have the best performance among all thevector versions It reported 16 10 and 22 performancescompared to vector versions of LBRA EBR and BRA-Rurespectively
4 Conclusion and Outlook
Themain finding of this paper is the recursive pattern of BPRand the implementationmethod of it using multiple memorystructures With multiple memory structures especiallythe newly identified index mapping performs much betterthan multidimensional or single one-dimensional memorystructure Furthermore findings of this paper show that theperformance of primitive array is higher than vector typeThe result is in disagreement with the statement of Karp ldquothata machine with hierarchical memory does not perform wellwhen array size is larger than the cache sizerdquo Almost all thesample sizes we used were higher than the total cache size
Mathematical Problems in Engineering 11
120E ndash 05
100E minus 05
800E minus 06
600E minus 06
400E minus 06
200E ndash 06
000E + 00
Cloc
ks p
er el
emen
t (CP
E)
BRM
split
24x
(the b
est v
ersio
nof
the n
ewal
gorit
hm)
Line
ar b
it-re
vers
al1x
Elste
rrsquos b
it-
Rubi
o1x
reve
rsal
1x
x-A or V (array or vector)Algorithms
Corresponds to array version of the algorithmCorresponds to vector version of the algorithm
Figure 6 Clocks per element versus best version of new algorithm and selected algorithms where blue dotted column corresponds to arrayversion and green column corresponds to vector version of the algorithm
1
1
1
1
1
0
1
0
1
0
1
0
Process 1
Process 2
Process 5
Process 6Process 3
Process 4
Process 7
4 parallel processes
2 parallel processes
Single process
Tota
l sig
nal i
s div
ided
into
equa
l fou
r ind
epen
dent
mem
ory
struc
ture
s usin
g ne
w al
gorit
hm (B
RMsp
lit24x
)
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
minus1
w04
w14
w24
w34
w04
w14
w24
w34
A(0)
A(1)
A(2)
A(3)
B(0)
B(1)
B(2)
B(3)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
minus1
minus1
minus1
Figure 7 Four-memory-structure version of algorithm and the parallel processes for sample size = 16
of the computer However multiple memory structure andthe multidimensional memory structure versions showedreasonable steady performancewith those samples In generalthese results show the effects of data structures and memoryallocation techniques and open a new window of creatingefficient algorithms with multiple memory structures inmany other fields where index mapping is involved
The new bit reversal algorithm with 2119904 independent
memory structures splits the total signal into 119904 independentportions and the total FFT process into 119904 + 1 levels Thenthese 119904 signal portions can be processed independently bymeans of 119904 independent processes on the first level Onthe next level the results from the previous level stored inindependent memory structures can be processed with 1199042
12 Mathematical Problems in Engineering
processes and so on until the last level Therefore we suggestusing the concept of multiple memory structures in total FFTprocess along with the new algorithm with multiple memorystructures and suitable parallel processing technique Weexpect that it is possible to achieve higher performance fromFFT process with proper combination of parallel processingtechnique and new algorithm compared to using the newalgorithm only to create bit reversal permutation Figure 7shows such approach with four (when 119904 = 2) independentmemory structures for sample size = 16
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This publication is financially supported by the UniversityGrants Commission Sri Lanka
References
[1] C S Burrus ldquoUnscrambling for fast DFT algorithmsrdquo IEEETransactions on Acoustics Speech and Signal Processing vol 36no 7 pp 1086ndash1087 1988
[2] A C Elster and J C Meyer ldquoA super-efficient adaptable bit-reversal algorithm for multithreaded architecturesrdquo in Pro-ceedings of the IEEE International Symposium on Parallel ampDistributed Processing (IPDPS 09) vol 1ndash5 pp 1ndash8 Rome ItalyMay 2009
[3] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965
[4] M L Massar R Bhagavatula M Fickus and J KovacevicldquoLocal histograms and image occlusion modelsrdquo Applied andComputational Harmonic Analysis vol 34 no 3 pp 469ndash4872013
[5] R G Lyons Understanding Digital Signal Processing PrenticeHall PTR 2004
[6] C Deyun L Zhiqiang G Ming W Lili and Y XiaoyangldquoA superresolution image reconstruction algorithm based onlandweber in electrical capacitance tomographyrdquoMathematicalProblems in Engineering vol 2013 Article ID 128172 8 pages2013
[7] Q Yang and D An ldquoEMD and wavelet transform based faultdiagnosis for wind turbine gear boxrdquo Advances in MechanicalEngineering vol 2013 Article ID 212836 9 pages 2013
[8] C S Burrus Fast Fourier Transforms Rice University HoustonTex USA 2008
[9] C van Loan Computational Frameworks for the Fast FourierTransform vol 10 of Frontiers in Applied Mathematics Societyfor Industrial and Applied Mathematics (SIAM) PhiladelphiaPa USA 1992
[10] A H Karp ldquoBit reversal on uniprocessorsrdquo SIAM Review vol38 no 1 pp 1ndash26 1996
[11] A C Elster ldquoFast bit-reversal algorithmsrdquo in Proceedings ofthe International Conference on Acoustics Speech and SignalProcessing pp 1099ndash1102 IEEE Press Glasgow UK May 1989
[12] M Rubio P Gomez and K Drouiche ldquoA new superfast bitreversal algorithmrdquo International Journal of Adaptive Controland Signal Processing vol 16 no 10 pp 703ndash707 2002
[13] B Stroustrup Programming Principles and Practice Using C++Addison-Wesley 2009
[14] B Stroustrup ldquoSoftware development for infrastructurerdquo IEEEComputer Society vol 45 no 1 Article ID 6081841 pp 47ndash582012
[15] B Stroustrup The C++ Programming Language ATampT Labs3rd edition 1997
[16] S DonovanC++ by Example ldquoUnderCrdquo Learning Edition QUECorporation 2002
[17] S Mcconnell Code Complete Microsoft Press 2nd edition2004
[18] Microsoft STL Containers Microsoft New York NY USA2012 httpmsdnmicrosoftcomen-uslibrary1fe2x6ktaspx
[19] Microsoft Arrays (C++ Component Extensions) MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryvstudiots4c4dw6(v=vs110)aspx
[20] B Stroustrup ldquoEvolving a language in and for the real worldC++ 1991ndash2006rdquo in Proceedings of the 3rd ACM SIGPLANHistory of Programming Languages Conference (HOPL-III rsquo07)June 2007
[21] Microsoft Memory Limits for Windows Releases MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryaa36677828VS8529aspxmemory limits
[22] A Fog ldquoOptimizing software in C++rdquo 2014 httpwwwagnerorgoptimize
[23] Code Project ldquoC++ Memory Poolrdquo 2014 httpwwwcodepro-jectcomArticles15527C-Memory-Pool
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
4 Mathematical Problems in Engineering
Table1Re
lationbetweenelem
entsof
16-elementB
RP(119899=16)for
split
=1and
split
=2
Normal
order
Reverse
order
Split
(119878)=1
Split
(119878=2)
Blocks
(12119904)
Index
ofthe
block119894
Reverseo
rder
calculation
Calculationmetho
d119898=0119904minus1(119898=0)
119895=12119898
119898=0119895=1
119894=11198992119904minus1
119894=17
Blocks
(12119904)
Indexof
the
block119894
Reverseo
rder
calculation
Calculationmetho
d119898=0119904minus1(119898=01)
119895=12119898
119898=0119895=1119898=1119895=12
119894=11198992119904minus1
119894=123
00
1
00
Initialized
00
Initialized
18
18=8+0
Use
Elste
rrsquosLinear
BitR
eversalm
etho
d
11
8=8+0
Use
Elste
rrsquosLinear
BitR
eversalm
etho
d2
42
4=4+0
24=4+0
312
312
=12
+0
312
=12
+0
42
42=2+0
02
Initialized
510
510
=10
+0
21
10=8+2
119861(2119898+119895)[119894]=119861(119895)[119894]+2119904minus119898minus1
66
66=6+0
26=4+2
For119898
=0119895=1119894=123
119861(20+1)[119894]=119861(1)[119894]+22minus0minus1
714
714
=14
+0
314
=12
+2
119861(2)[119894]=119861(1)[119894]+2
81
2
01
Initialized
01
Initialized
99
19=8+1
119861(2119898+119895)[119894]=119861(119895)[119894]+2119904minus119898minus1
For119898
=0119895=1119894=17
119861(20+1)[119894]=119861(1)[119894]+22minus0minus1
119861(2)[119894]=119861(1)[119894]+1
31
9=8+1
119861(2119898+119895)[119894]=119861(119895)[119894]+2119904minus119898minus1
105
25=4+1
25=4+1
For119898
=1119895=1119894=123
119861(21+1)[119894]=119861(1)[119894]+22minus1minus1
1113
313
=12
+1
313
=12
+1
119861(3)[119894]=119861(1)[119894]+1
123
43=2+1
03
Initialized
1311
511=10
+1
41
11=10
+1
119861(2119898+119895)[119894]=119861(119895)[119894]+2119904minus119898minus1
147
67=6+1
27=6+1
For119898
=1119895=2119894=123
119861(21+2)[119894]=119861(2)[119894]+22minus1minus1
1515
715
=14
+1
315
=14
+1
119861(4)[119894]=119861(2)[119894]+2
Mathematical Problems in Engineering 5
Table 2 Hardware and software specifications of the PC
Specifications
Processor Intel Core i7 CPU 870 293GHz (4 cores8 threads)
RAM 12GB DDR3-1333 2 channelsMemory bandwidth 21GBsL1 L2 and L3 cache 4 times 64KB 4 times 256KB and 8MB sharedL1 L2 and L3 cacheline size 64 bit
Brand and type Fujitsu CelsiusBIOS settings Default (hyper threading enabled)OS and service pack Windows 7 professional with service pack 1System type 64 bit operating systemOS settings DefaultVisual Studio 2012 Version 110507271 RTMRELNET Framework Version 4550709
Thenext taskwas to identify a suitable data structure fromdifferent types of available data structures We consideredseveral common techniques as summarized in Table 3 Datastructure 1 mentioned in Table 3 is not supporting dynamicmemory allocation (need to mention the size of the arraywhen array is being declared) For general bit reversalalgorithm it is a must to have dynamic memory allocation tocater different sizes of samples Even after setting the compileroption ldquoLARGEADDRESSAWARErdquo [21] data structures 3and 4 mentioned in Table 3 were not supported for accessingmemory greater than 2GB Therefore structures 1 3 and 4were rejected and memory structures 2 (array) and 5 (vector)were used to create all one-dimensional memory structuresThe same versions of array and vector were used to createmultidimensional memory structures
The new algorithm mentioned in Section 21 was imple-mented usingC++ in 24 types ofmemory structures as shownin Table 4 The performance of these algorithms was evalu-ated considering the ldquoclocks per elementrdquo (CPE) consumedbyeach algorithm To retrieve this value first average CPE foreach sample size of 2119899 where 119899 [21 31] (11 sample sizes) werecalculated after executing each algorithm 100 timesThis gave11 CPE representing each sample size Finally the combinedaverageof CPEwas calculated for each algorithmby averagingthose 11 values along with ldquocombined standard deviationrdquoThe combined average of CPE was considered as the CPEfor each algorithm The built-in ldquoclockrdquo function of C++ wasused to calculate the clocks Combined standard deviationwas calculated using the following
120590119888= (
sum119896
119904=1(1198991199041205902
119904+ 119899119904(119883119904minus 119883119888)
2
)
sum119896
119904=1119899119904
)
12
(4)
where119883119888= sum119896
119904=1119899119904119883119904sum119896
119904=1119899119904 119904 is the number of samples 119899
119904
is number of samples in each sample and 120590119904is the standard
deviation of each sample
Algorithms 1 2 and 3 illustrate the implementation ofnew BRA with single one-dimensional memory structuremultidimensional memory structure and multiple mem-ory structures respectively The algorithm illustrated inAlgorithm 1 (BRA Split 1 1A) was implemented using prim-itive array for split = 1 The algorithm BRM Split 2 4A(Algorithm 2) was implemented using vectors for split =2 The algorithm BRM Split 2 4A (Algorithm 3) was imple-mented using primitive array for split = 2 A sample per-mutation filling sequence of algorithms with single one-dimensional memory structures is illustrated in Figure 2Figure 3 illustrates a sample permutation filling sequence ofboth multidimensional and multiple memory structures
Secondly arithmetic operations per element (OPPE)werecalculated for each algorithm Arithmetic operations withineach algorithm were located in three regions of the codeinner FOR loop outer FOR loop and outside of the loopsThen the total number of operations (OP) can be defined as
OP = 1198701lowast 1198621+ 1198702lowast 1198622+ 1198623 (5)
where 1198621 1198622 and 119862
3are the number of operations in inner
FOR loop outer FOR loop and outside of the loops 1198701and
1198702are the number of iterations of outer loop and inner loop
Equation (5) can be represented as
OP =log2(NS)minus119904minus1
sum
119905=0
2119905lowast 1198621+
log2(NS)minus119904minus1
sum
119905=0
1198622+ 1198623 (6)
where NS is the number of samples and 119904 is the number ofsplits
The main contribution to calculations comes from theinner loop Comparing with the contribution of operationsin the inner loop the contribution of operations in rest ofthe code is very small For example consider the algorithmBRA Split 1 1A shown in Algorithm 1 As sample size is 2311198701lowast 1198621asymp 215 lowast 10
9lowast 1198621and 119870
2lowast 1198622+ 1198623asymp 1000 where
1198621gt 1 Therefore only the operations of inner loop were
considered for evaluation Then (6) can be simplified as
OP =log2(NS)minus119904minus1
sum
119905=0
2119905lowast 1198621 (7)
The ldquooperations per elementrdquo (OPPE) can be defined as
OPPE =(sum
log2(NS)minus119904minus1
119905=02119905lowast 1198621)
NS
(8)
For FFT always NS = 2119896 119896 isin 119885+Then from (7)
OPPE =(sum119896minus119904minus1
119905=02119905lowast 1198621)
2119896
(9)
119896
sum
119905=0
2119905+ 1 = 2
119896+1 (10)
6 Mathematical Problems in Engineering
Table 3 Common memory allocating methods that are used in Visual C++
Number Name Syntax Nature of memory layout1 Array int BRP[1000] Slot of continuous memory elements2 Array intlowast BRP = new int[119899] Slot of continuous memory elements3 Array array⟨int⟩andBRP = gcnew array⟨int⟩(119899) Collection of noncontinuous memory elements4 ArrayList ArrayList andBRP = gcnew ArrayList() Collection of noncontinuous memory elements5 Vector stdvector⟨int⟩ BRP(119899) Collection of noncontinuous memory elements
Table 4 Different versions of new BRA implemented with different data structures
Split (s) Data structure type AlgorithmSingle memory structure Multidimensional memory structure Multiple memory structure
1 Array (A) BRA Split 1 1A BRA Split 1 2DA BRA Split 1 2A1 Vector (V) BRA Split 1 1V BRA Split 1 2DV BRA Split 1 2V2 Array (A) BRA Split 2 1A BRA Split 2 4DA BRA Split 2 4A2 Vector (V) BRA Split 2 1V BRA Split 2 4DV BRA Split 2 4V3 Array (A) BRA Split 3 1A BRA Split 3 8DA BRA Split 3 8A3 Vector (V) BRA Split 3 1V BRA Split 3 8DV BRA Split 3 8V4 Array (A) BRA Split 4 1A BRA Split 4 16DA BRA Split 4 16A4 Vector (V) BRA Split 4 1V BRA Split 4 16DV BRA Split 4 16VNaming convention for algorithms ldquoBRA Splitrdquo + ltNumber of splitsgt + ltnature of memory structuregt xA xV x number of arrays and x number of vectorsxDA xDV single x-dimensional array and single x-dimensional vector
void mf BRM Split 1 1A (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DL
ui N = ui NS Number of samplesui t = ui log2NS minus 1ui EB = ui N2ui L = 1unsigned intlowast BRP = new unsigned int[ui N]BRP[0] = 0BRP[ui EB] = 1for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP[j] = BRP[j minus ui L] + ui NBRP[ui EB + j] = BRP[j] + 1
ui L = ui L + ui L
delete[] BRP
Algorithm 1 C++ implementation of new BRA with single array for split = 1 (BRA Split 1 1A)
Mathematical Problems in Engineering 7
Void mf BRM Split 2 4DV (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DLui N = ui NS Number of samplesui t = ui log2NS minus 2ui EB = ui N4ui L = 1stdvectorltstdvectorltunsigned intgtgt BRP(4
stdvectorltunsigned intgt(ui EB))BRP[0][0] = 0BRP[1][0] = 2BRP[2][0] = 1BRP[3][0] = 3for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP[0][j] = BRP[0][j minus ui L] + ui NBRP[1][j] = BRP[0][j] + 2BRP[2][j] = BRP[0][j] + 1BRP[3][j] = BRP[1][j] + 1
ui L = ui L + ui L
Algorithm 2 C++ implementation of new BRA with four arrays for split = 2 (BRM Split 2 4A)
0 8 4 12 2 10 6 14 1 9 5 13 3 11 7 15
Initialization
First iteration Seventh (last) iteration
Figure 2 Permutation filling sequence of new BRA with single memory structure for 119899 = 16 and split =1 (BRA Split 1 1A)
For large 119896 2119896 + 1 asymp 2119896 Then
119896
sum
119905=0
2119905= 2119896+1
(11)
Because the considered sample size is 221 to 231 the value 119896can be considered as large Then from (9)
OPPE =1198621lowast (2119896minus119904)
2119896
(12)
OPPE = 1198621
2119904 (13)
According to (13) OPPE is 119891(1198621 119904) The value 119862
1(oper-
ations in inner loop) and the value 119904 (number of splits) are aconstant for a certain algorithm
To evaluate the performance of new BRA we selectedthree algorithms (LBRA EBR and BRA-Ru) which used apattern instead of conventional bit reversing method Theperformance of vector and array versions of the best version
Table 5 The number of operations in the inner loop (1198621) of each
algorithm for different splits (119904) (The total number of operations ineach algorithm asymp 119862
1)
Memory structure type Value of 1198621
119904 = 1 119904 = 2 119904 = 3 119904 = 4
Single one-dimension (array and vector) 8 15 30 61Multidimension (array and vector) 7 11 19 35Multiple one-dimension (array and vector) 7 11 19 35
of new BRA was compared with the relevant versions ofselected algorithms
3 Results and Discussion
Our objective was to introduce BRA using recursive patternof the BRP that we identified We used multiple memorystructures which is a feasible yet unpopular technique toimplement index mapping According to Table 5 the num-bers of operations in all the array and vector versions of both
8 Mathematical Problems in Engineering
Void mf BRM Split 2 4A (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DL
ui N = ui NS Number of samplesui t = ui log2NS minus 2ui EB = ui N4ui L = 1unsigned intlowast BRP1 = new unsigned int[ui EB]unsigned intlowast BRP2 = new unsigned int[ui EB]unsigned intlowast BRP3 = new unsigned int[ui EB]unsigned intlowast BRP4 = new unsigned int[ui EB]
BRP1[0] = 0BRP2[0] = 2BRP3[0] = 1BRP4[0] = 3for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP1[j] = BRP1[j minus ui L]+ ui NBRP2[j] = BRP1[j] + 2BRP3[j] = BRP1[j] + 1BRP4[j] = BRP2[j] + 1
ui L = ui L + ui L
delete[] BRO1delete[] BRO2delete[] BRO3delete[] BRO4
Algorithm 3 C++ implementation of new BRA with four one-dimensional arrays for split = 2 (BRM Split 2 4A)
0 8 4 12
2 10 6 14
1 9 5 13
3 11 7 15
Initi
aliz
atio
n
Third
(las
t) ite
ratio
n
First iteration
Figure 3 Permutation filling sequence of 4 individual and single 4-dimensional memory structure for 119899 = 16
multidimensional and multiple memory structures are thesame Also Figure 5 shows continuous decrement of OPPEwhen the number of splits increases Then the algorithmwith the highest number of splits and the lowest number ofoperations is the one which is expected to be most efficient
However results in relation with CPE (Figure 5) show thatthe new algorithm with four memory structures of array isthe fastest and most consistent in the selected range Twofour eight and sixteen multiple array implementations ofnew BRA reported 25 34 33 and 18 higher efficiency
Mathematical Problems in Engineering 9
0
1
2
3
4
5
6
Refalgorithms
Ope
ratio
ns p
er el
emen
t (O
PPE)
Split = 1 Split = 2 Split = 3 Split = 4
Corresponds to both array and vector versions of algorithm of rudioCorresponds to both array and vector versions of ldquolinear bit reversalrdquo algorithmCorresponds to both array and vector versions of ldquoelsterrsquos bit reversalrdquo algorithmCorresponds to both array and vector versions of the new algorithm in
Corresponds to both array and vector versions of the new algorithm in
Corresponds to both array and vector versions of the new algorithm in
signal one-dimensional memory structure
signal multi-dimensional memory structure
signal multiple one-dimensional memory structure
Figure 4 Operations per element versus reference algorithms and new algorithm with different s (splits) where dashed column correspondsto both array and vector versions of algorithm of Rudio cross lines column corresponds to both array and vector versions of ldquoLinear BitReversalrdquo algorithm dotted column corresponds to both array and vector versions of ldquoElsterrsquos Bit Reversalrdquo algorithm vertical lines columncorresponds to both array and vector versions of the new algorithm in single one-dimensional memory structure inclined lines columncorresponds to both array and vector versions of the new algorithm in single multidimensional memory structure and horizontal linescolumn corresponds to both array and vector versions of the new algorithm in multiple one-dimensional memory structures
respectively in relation to the array version of LBRA Thealgorithm with eight memory structures has nearly the sameCPE as the four-array and four-vector versions but is lessconsistent On the other hand the four-vector implementa-tion of the new algorithm is the fastest and most consistentamong all vector versions Two four eight and sixteenmultiple vector implementations of new BRA reported 1316 and 16 higher and 23 lower efficiency respectively inrelation to the vector version of LBRAThis result proves thatat a certain point multiple memory structure gives the bestperformances in the considered domain Also usage of mul-tiple memory structures of primitive array is a good optionfor implementing index mapping of BRP compared to multi-dimensional or single one-dimensional memory structures
Due to the flexible nature of the vector it is commonlyused for implementing algorithms According to Figure 4there is no difference in OPPE between array and vectorversionsHowever our results in Figure 5 show that the vectorversions of BRA always required more CPE (44ndash142)than the array version The structure of vector gives priorityto generality and flexibility rather than to execution speedmemory economy cache efficiency and code size [22]There-fore vector is not a good option with respect to efficiency
The results in Table 5 and Figure 4 show that there is nodifference between the number of calculations and OPPEfor equal versions of algorithms with multidimensional andmultiple memory structure Structure and type of calcula-tions are the same for both types The only difference is thenature of the memory structure multidimension or multipleone-dimension When CPE is considered it shows 19ndash79
performance efficiency from algorithms with multiple one-dimension memory structures The reason for that observa-tion is that the memory access of multidimensional memorystructure is less efficient than one-dimensional memorystructure [22]
We agree with the statement of Elster about indexmapping of BRP [2] and the generally accepted fact (theusage of multidimensional memory structures reduces theperformance of a computer) [4] only with respect to multi-dimensional memory structures of vector Our results showthat even with multidimensional arrays there are situationswhere the new BRA performs better than the same typeof one-dimensional memory structure The four eight andsixteen dimensional array versions of new BRA perform 810 and 2 in relation to one-dimensional array version ofnew BRA Some results in relation to single one-dimensionalmemory structure implementation of new BRA are also notin agreement with the general accepted idea For examplesample size = 231 the two-dimensional vector version of newBRA (BRA Split 1 2DV) reported 542119864 minus 05 CPE which is389 higher in relation to average CPE of sample size rangeof 221 to 230 Also the inconsistency was very highThereforewe excluded values related to sample size = 231 for the two-dimensional vector version
We observed very high memory utilization with the two-dimensional vector version especially with sample size =231 Windows task manager showed that the memory uti-lization of all the considered algorithms was nearly thesame for all sample sizes except for multidimension versionsof vector The multidimensional version of vector initially
10 Mathematical Problems in Engineering
Cloc
ks p
er el
emen
t (CP
E)
000E + 00
500E minus 06
100E minus 05
150E minus 05
200E minus 05
250E minus 05
300E minus 05
BRM
split
01x
BRM
split
12x
BRM
split
24x
BRM
split
38x
BRM
split
416x
lowastBR
Msp
lit12
Dx
BRM
split
24
Dx
BRM
split
38
Dx
BRM
split
416
Dx
BRM
split
11x
BRM
split
21x
BRM
split
31x
BRM
split
41x
x-A or V (array or vector)Algorithms
Refe
renc
e alg
orith
m o
ne-
dim
ensio
nal s
ingl
e mem
ory
struc
ture
of s
izen
Multiple memorystructures
1
1
2
k
k
k
k = 2s
n sample size
Multidimensionalmemory structure
One-dimensionalmemory structure
nknk
nk
middotmiddotmiddot
middotmiddotmiddot
(s = 1 2 3 4)
Corresponds to array version of the algorithmCorresponds to vector version of the algorithmCorresponds to minimum reported CPE of the category
Figure 5 Clocks per element (combined average) versus algorithm for the sample size range from 221 to 231 where blue dotted columncorresponds to array version of the algorithm green column corresponds to vector version of the algorithm and red rhombus corresponds tominimum reported CPE of the category lowastFor vector version sample size 231 was excluded because at sample size 231 it showed huge deviationdue to memory limitation of the machine
utilizes highermemory and drops down to normal valueTheresults in relation to sample size = 2
30 showed that the extramemory requirement of two dimension vector was higherthan that of the four-dimensional vector Based upon that itcan be predicted that BRA Split 1 2DV needs an extra 3GB(total 13 GB) for normal execution at sample size = 2
31 butthe total memory barrier of 12GB of the machine slows theprocess down The most likely reason for this observation isthe influence ofmemory poolingmechanismWhen defininga memory structure it is possible to allocate the requiredamount of memory This is known as memory pooling [23]In all the memory structures used in algorithms discussedin this paper we used memory pooling Memory poolingallocates the required memory at startup and divides thisblock into small chunks If there is no memory poolingmemory will get fragmented Accessing fragmented memoryis inefficient When the existing memory is not enough forallocating then it switches to use fragmented memory forallocating memory elements In the considered situation theexisting memory (12GB) is not sufficient for allocating therequired amount (13GB) which switches to use fragmentedmemory
The total cache size of the machine is 825MB whichis less than the minimum memory utilization of consideredalgorithms (16MB to 8GB) in relation to the sample sizerange from 222 to 231 Only the sample size 221 occupies 8MBmemory which is less than the total cache memory ExceptBRA Split 4 1V structure all algorithms reported constantCPE in relation to the entire sample size range The bestalgorithms of each category especially reported very steady
behaviour This observation is in disagreement with thestatement of Karp ldquothat a machine with hierarchical memorydoes not performwell when array size is larger than the cachesizerdquo [10]
Comparison (Figure 6) of best reported version in theconsidered domain (four memory structure version) andthe selected algorithms shows that the array version ofEBR performs the best The four-array version of new BRAreported 1 lower performance than the array version ofEBR However the four-array version of new BRA reported34 and 23 higher performances than array versions ofLBRA and BRA-Ru Also the four-vector version of newBRA is reported to have the best performance among all thevector versions It reported 16 10 and 22 performancescompared to vector versions of LBRA EBR and BRA-Rurespectively
4 Conclusion and Outlook
Themain finding of this paper is the recursive pattern of BPRand the implementationmethod of it using multiple memorystructures With multiple memory structures especiallythe newly identified index mapping performs much betterthan multidimensional or single one-dimensional memorystructure Furthermore findings of this paper show that theperformance of primitive array is higher than vector typeThe result is in disagreement with the statement of Karp ldquothata machine with hierarchical memory does not perform wellwhen array size is larger than the cache sizerdquo Almost all thesample sizes we used were higher than the total cache size
Mathematical Problems in Engineering 11
120E ndash 05
100E minus 05
800E minus 06
600E minus 06
400E minus 06
200E ndash 06
000E + 00
Cloc
ks p
er el
emen
t (CP
E)
BRM
split
24x
(the b
est v
ersio
nof
the n
ewal
gorit
hm)
Line
ar b
it-re
vers
al1x
Elste
rrsquos b
it-
Rubi
o1x
reve
rsal
1x
x-A or V (array or vector)Algorithms
Corresponds to array version of the algorithmCorresponds to vector version of the algorithm
Figure 6 Clocks per element versus best version of new algorithm and selected algorithms where blue dotted column corresponds to arrayversion and green column corresponds to vector version of the algorithm
1
1
1
1
1
0
1
0
1
0
1
0
Process 1
Process 2
Process 5
Process 6Process 3
Process 4
Process 7
4 parallel processes
2 parallel processes
Single process
Tota
l sig
nal i
s div
ided
into
equa
l fou
r ind
epen
dent
mem
ory
struc
ture
s usin
g ne
w al
gorit
hm (B
RMsp
lit24x
)
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
minus1
w04
w14
w24
w34
w04
w14
w24
w34
A(0)
A(1)
A(2)
A(3)
B(0)
B(1)
B(2)
B(3)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
minus1
minus1
minus1
Figure 7 Four-memory-structure version of algorithm and the parallel processes for sample size = 16
of the computer However multiple memory structure andthe multidimensional memory structure versions showedreasonable steady performancewith those samples In generalthese results show the effects of data structures and memoryallocation techniques and open a new window of creatingefficient algorithms with multiple memory structures inmany other fields where index mapping is involved
The new bit reversal algorithm with 2119904 independent
memory structures splits the total signal into 119904 independentportions and the total FFT process into 119904 + 1 levels Thenthese 119904 signal portions can be processed independently bymeans of 119904 independent processes on the first level Onthe next level the results from the previous level stored inindependent memory structures can be processed with 1199042
12 Mathematical Problems in Engineering
processes and so on until the last level Therefore we suggestusing the concept of multiple memory structures in total FFTprocess along with the new algorithm with multiple memorystructures and suitable parallel processing technique Weexpect that it is possible to achieve higher performance fromFFT process with proper combination of parallel processingtechnique and new algorithm compared to using the newalgorithm only to create bit reversal permutation Figure 7shows such approach with four (when 119904 = 2) independentmemory structures for sample size = 16
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This publication is financially supported by the UniversityGrants Commission Sri Lanka
References
[1] C S Burrus ldquoUnscrambling for fast DFT algorithmsrdquo IEEETransactions on Acoustics Speech and Signal Processing vol 36no 7 pp 1086ndash1087 1988
[2] A C Elster and J C Meyer ldquoA super-efficient adaptable bit-reversal algorithm for multithreaded architecturesrdquo in Pro-ceedings of the IEEE International Symposium on Parallel ampDistributed Processing (IPDPS 09) vol 1ndash5 pp 1ndash8 Rome ItalyMay 2009
[3] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965
[4] M L Massar R Bhagavatula M Fickus and J KovacevicldquoLocal histograms and image occlusion modelsrdquo Applied andComputational Harmonic Analysis vol 34 no 3 pp 469ndash4872013
[5] R G Lyons Understanding Digital Signal Processing PrenticeHall PTR 2004
[6] C Deyun L Zhiqiang G Ming W Lili and Y XiaoyangldquoA superresolution image reconstruction algorithm based onlandweber in electrical capacitance tomographyrdquoMathematicalProblems in Engineering vol 2013 Article ID 128172 8 pages2013
[7] Q Yang and D An ldquoEMD and wavelet transform based faultdiagnosis for wind turbine gear boxrdquo Advances in MechanicalEngineering vol 2013 Article ID 212836 9 pages 2013
[8] C S Burrus Fast Fourier Transforms Rice University HoustonTex USA 2008
[9] C van Loan Computational Frameworks for the Fast FourierTransform vol 10 of Frontiers in Applied Mathematics Societyfor Industrial and Applied Mathematics (SIAM) PhiladelphiaPa USA 1992
[10] A H Karp ldquoBit reversal on uniprocessorsrdquo SIAM Review vol38 no 1 pp 1ndash26 1996
[11] A C Elster ldquoFast bit-reversal algorithmsrdquo in Proceedings ofthe International Conference on Acoustics Speech and SignalProcessing pp 1099ndash1102 IEEE Press Glasgow UK May 1989
[12] M Rubio P Gomez and K Drouiche ldquoA new superfast bitreversal algorithmrdquo International Journal of Adaptive Controland Signal Processing vol 16 no 10 pp 703ndash707 2002
[13] B Stroustrup Programming Principles and Practice Using C++Addison-Wesley 2009
[14] B Stroustrup ldquoSoftware development for infrastructurerdquo IEEEComputer Society vol 45 no 1 Article ID 6081841 pp 47ndash582012
[15] B Stroustrup The C++ Programming Language ATampT Labs3rd edition 1997
[16] S DonovanC++ by Example ldquoUnderCrdquo Learning Edition QUECorporation 2002
[17] S Mcconnell Code Complete Microsoft Press 2nd edition2004
[18] Microsoft STL Containers Microsoft New York NY USA2012 httpmsdnmicrosoftcomen-uslibrary1fe2x6ktaspx
[19] Microsoft Arrays (C++ Component Extensions) MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryvstudiots4c4dw6(v=vs110)aspx
[20] B Stroustrup ldquoEvolving a language in and for the real worldC++ 1991ndash2006rdquo in Proceedings of the 3rd ACM SIGPLANHistory of Programming Languages Conference (HOPL-III rsquo07)June 2007
[21] Microsoft Memory Limits for Windows Releases MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryaa36677828VS8529aspxmemory limits
[22] A Fog ldquoOptimizing software in C++rdquo 2014 httpwwwagnerorgoptimize
[23] Code Project ldquoC++ Memory Poolrdquo 2014 httpwwwcodepro-jectcomArticles15527C-Memory-Pool
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 5
Table 2 Hardware and software specifications of the PC
Specifications
Processor Intel Core i7 CPU 870 293GHz (4 cores8 threads)
RAM 12GB DDR3-1333 2 channelsMemory bandwidth 21GBsL1 L2 and L3 cache 4 times 64KB 4 times 256KB and 8MB sharedL1 L2 and L3 cacheline size 64 bit
Brand and type Fujitsu CelsiusBIOS settings Default (hyper threading enabled)OS and service pack Windows 7 professional with service pack 1System type 64 bit operating systemOS settings DefaultVisual Studio 2012 Version 110507271 RTMRELNET Framework Version 4550709
Thenext taskwas to identify a suitable data structure fromdifferent types of available data structures We consideredseveral common techniques as summarized in Table 3 Datastructure 1 mentioned in Table 3 is not supporting dynamicmemory allocation (need to mention the size of the arraywhen array is being declared) For general bit reversalalgorithm it is a must to have dynamic memory allocation tocater different sizes of samples Even after setting the compileroption ldquoLARGEADDRESSAWARErdquo [21] data structures 3and 4 mentioned in Table 3 were not supported for accessingmemory greater than 2GB Therefore structures 1 3 and 4were rejected and memory structures 2 (array) and 5 (vector)were used to create all one-dimensional memory structuresThe same versions of array and vector were used to createmultidimensional memory structures
The new algorithm mentioned in Section 21 was imple-mented usingC++ in 24 types ofmemory structures as shownin Table 4 The performance of these algorithms was evalu-ated considering the ldquoclocks per elementrdquo (CPE) consumedbyeach algorithm To retrieve this value first average CPE foreach sample size of 2119899 where 119899 [21 31] (11 sample sizes) werecalculated after executing each algorithm 100 timesThis gave11 CPE representing each sample size Finally the combinedaverageof CPEwas calculated for each algorithmby averagingthose 11 values along with ldquocombined standard deviationrdquoThe combined average of CPE was considered as the CPEfor each algorithm The built-in ldquoclockrdquo function of C++ wasused to calculate the clocks Combined standard deviationwas calculated using the following
120590119888= (
sum119896
119904=1(1198991199041205902
119904+ 119899119904(119883119904minus 119883119888)
2
)
sum119896
119904=1119899119904
)
12
(4)
where119883119888= sum119896
119904=1119899119904119883119904sum119896
119904=1119899119904 119904 is the number of samples 119899
119904
is number of samples in each sample and 120590119904is the standard
deviation of each sample
Algorithms 1 2 and 3 illustrate the implementation ofnew BRA with single one-dimensional memory structuremultidimensional memory structure and multiple mem-ory structures respectively The algorithm illustrated inAlgorithm 1 (BRA Split 1 1A) was implemented using prim-itive array for split = 1 The algorithm BRM Split 2 4A(Algorithm 2) was implemented using vectors for split =2 The algorithm BRM Split 2 4A (Algorithm 3) was imple-mented using primitive array for split = 2 A sample per-mutation filling sequence of algorithms with single one-dimensional memory structures is illustrated in Figure 2Figure 3 illustrates a sample permutation filling sequence ofboth multidimensional and multiple memory structures
Secondly arithmetic operations per element (OPPE)werecalculated for each algorithm Arithmetic operations withineach algorithm were located in three regions of the codeinner FOR loop outer FOR loop and outside of the loopsThen the total number of operations (OP) can be defined as
OP = 1198701lowast 1198621+ 1198702lowast 1198622+ 1198623 (5)
where 1198621 1198622 and 119862
3are the number of operations in inner
FOR loop outer FOR loop and outside of the loops 1198701and
1198702are the number of iterations of outer loop and inner loop
Equation (5) can be represented as
OP =log2(NS)minus119904minus1
sum
119905=0
2119905lowast 1198621+
log2(NS)minus119904minus1
sum
119905=0
1198622+ 1198623 (6)
where NS is the number of samples and 119904 is the number ofsplits
The main contribution to calculations comes from theinner loop Comparing with the contribution of operationsin the inner loop the contribution of operations in rest ofthe code is very small For example consider the algorithmBRA Split 1 1A shown in Algorithm 1 As sample size is 2311198701lowast 1198621asymp 215 lowast 10
9lowast 1198621and 119870
2lowast 1198622+ 1198623asymp 1000 where
1198621gt 1 Therefore only the operations of inner loop were
considered for evaluation Then (6) can be simplified as
OP =log2(NS)minus119904minus1
sum
119905=0
2119905lowast 1198621 (7)
The ldquooperations per elementrdquo (OPPE) can be defined as
OPPE =(sum
log2(NS)minus119904minus1
119905=02119905lowast 1198621)
NS
(8)
For FFT always NS = 2119896 119896 isin 119885+Then from (7)
OPPE =(sum119896minus119904minus1
119905=02119905lowast 1198621)
2119896
(9)
119896
sum
119905=0
2119905+ 1 = 2
119896+1 (10)
6 Mathematical Problems in Engineering
Table 3 Common memory allocating methods that are used in Visual C++
Number Name Syntax Nature of memory layout1 Array int BRP[1000] Slot of continuous memory elements2 Array intlowast BRP = new int[119899] Slot of continuous memory elements3 Array array⟨int⟩andBRP = gcnew array⟨int⟩(119899) Collection of noncontinuous memory elements4 ArrayList ArrayList andBRP = gcnew ArrayList() Collection of noncontinuous memory elements5 Vector stdvector⟨int⟩ BRP(119899) Collection of noncontinuous memory elements
Table 4 Different versions of new BRA implemented with different data structures
Split (s) Data structure type AlgorithmSingle memory structure Multidimensional memory structure Multiple memory structure
1 Array (A) BRA Split 1 1A BRA Split 1 2DA BRA Split 1 2A1 Vector (V) BRA Split 1 1V BRA Split 1 2DV BRA Split 1 2V2 Array (A) BRA Split 2 1A BRA Split 2 4DA BRA Split 2 4A2 Vector (V) BRA Split 2 1V BRA Split 2 4DV BRA Split 2 4V3 Array (A) BRA Split 3 1A BRA Split 3 8DA BRA Split 3 8A3 Vector (V) BRA Split 3 1V BRA Split 3 8DV BRA Split 3 8V4 Array (A) BRA Split 4 1A BRA Split 4 16DA BRA Split 4 16A4 Vector (V) BRA Split 4 1V BRA Split 4 16DV BRA Split 4 16VNaming convention for algorithms ldquoBRA Splitrdquo + ltNumber of splitsgt + ltnature of memory structuregt xA xV x number of arrays and x number of vectorsxDA xDV single x-dimensional array and single x-dimensional vector
void mf BRM Split 1 1A (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DL
ui N = ui NS Number of samplesui t = ui log2NS minus 1ui EB = ui N2ui L = 1unsigned intlowast BRP = new unsigned int[ui N]BRP[0] = 0BRP[ui EB] = 1for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP[j] = BRP[j minus ui L] + ui NBRP[ui EB + j] = BRP[j] + 1
ui L = ui L + ui L
delete[] BRP
Algorithm 1 C++ implementation of new BRA with single array for split = 1 (BRA Split 1 1A)
Mathematical Problems in Engineering 7
Void mf BRM Split 2 4DV (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DLui N = ui NS Number of samplesui t = ui log2NS minus 2ui EB = ui N4ui L = 1stdvectorltstdvectorltunsigned intgtgt BRP(4
stdvectorltunsigned intgt(ui EB))BRP[0][0] = 0BRP[1][0] = 2BRP[2][0] = 1BRP[3][0] = 3for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP[0][j] = BRP[0][j minus ui L] + ui NBRP[1][j] = BRP[0][j] + 2BRP[2][j] = BRP[0][j] + 1BRP[3][j] = BRP[1][j] + 1
ui L = ui L + ui L
Algorithm 2 C++ implementation of new BRA with four arrays for split = 2 (BRM Split 2 4A)
0 8 4 12 2 10 6 14 1 9 5 13 3 11 7 15
Initialization
First iteration Seventh (last) iteration
Figure 2 Permutation filling sequence of new BRA with single memory structure for 119899 = 16 and split =1 (BRA Split 1 1A)
For large 119896 2119896 + 1 asymp 2119896 Then
119896
sum
119905=0
2119905= 2119896+1
(11)
Because the considered sample size is 221 to 231 the value 119896can be considered as large Then from (9)
OPPE =1198621lowast (2119896minus119904)
2119896
(12)
OPPE = 1198621
2119904 (13)
According to (13) OPPE is 119891(1198621 119904) The value 119862
1(oper-
ations in inner loop) and the value 119904 (number of splits) are aconstant for a certain algorithm
To evaluate the performance of new BRA we selectedthree algorithms (LBRA EBR and BRA-Ru) which used apattern instead of conventional bit reversing method Theperformance of vector and array versions of the best version
Table 5 The number of operations in the inner loop (1198621) of each
algorithm for different splits (119904) (The total number of operations ineach algorithm asymp 119862
1)
Memory structure type Value of 1198621
119904 = 1 119904 = 2 119904 = 3 119904 = 4
Single one-dimension (array and vector) 8 15 30 61Multidimension (array and vector) 7 11 19 35Multiple one-dimension (array and vector) 7 11 19 35
of new BRA was compared with the relevant versions ofselected algorithms
3 Results and Discussion
Our objective was to introduce BRA using recursive patternof the BRP that we identified We used multiple memorystructures which is a feasible yet unpopular technique toimplement index mapping According to Table 5 the num-bers of operations in all the array and vector versions of both
8 Mathematical Problems in Engineering
Void mf BRM Split 2 4A (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DL
ui N = ui NS Number of samplesui t = ui log2NS minus 2ui EB = ui N4ui L = 1unsigned intlowast BRP1 = new unsigned int[ui EB]unsigned intlowast BRP2 = new unsigned int[ui EB]unsigned intlowast BRP3 = new unsigned int[ui EB]unsigned intlowast BRP4 = new unsigned int[ui EB]
BRP1[0] = 0BRP2[0] = 2BRP3[0] = 1BRP4[0] = 3for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP1[j] = BRP1[j minus ui L]+ ui NBRP2[j] = BRP1[j] + 2BRP3[j] = BRP1[j] + 1BRP4[j] = BRP2[j] + 1
ui L = ui L + ui L
delete[] BRO1delete[] BRO2delete[] BRO3delete[] BRO4
Algorithm 3 C++ implementation of new BRA with four one-dimensional arrays for split = 2 (BRM Split 2 4A)
0 8 4 12
2 10 6 14
1 9 5 13
3 11 7 15
Initi
aliz
atio
n
Third
(las
t) ite
ratio
n
First iteration
Figure 3 Permutation filling sequence of 4 individual and single 4-dimensional memory structure for 119899 = 16
multidimensional and multiple memory structures are thesame Also Figure 5 shows continuous decrement of OPPEwhen the number of splits increases Then the algorithmwith the highest number of splits and the lowest number ofoperations is the one which is expected to be most efficient
However results in relation with CPE (Figure 5) show thatthe new algorithm with four memory structures of array isthe fastest and most consistent in the selected range Twofour eight and sixteen multiple array implementations ofnew BRA reported 25 34 33 and 18 higher efficiency
Mathematical Problems in Engineering 9
0
1
2
3
4
5
6
Refalgorithms
Ope
ratio
ns p
er el
emen
t (O
PPE)
Split = 1 Split = 2 Split = 3 Split = 4
Corresponds to both array and vector versions of algorithm of rudioCorresponds to both array and vector versions of ldquolinear bit reversalrdquo algorithmCorresponds to both array and vector versions of ldquoelsterrsquos bit reversalrdquo algorithmCorresponds to both array and vector versions of the new algorithm in
Corresponds to both array and vector versions of the new algorithm in
Corresponds to both array and vector versions of the new algorithm in
signal one-dimensional memory structure
signal multi-dimensional memory structure
signal multiple one-dimensional memory structure
Figure 4 Operations per element versus reference algorithms and new algorithm with different s (splits) where dashed column correspondsto both array and vector versions of algorithm of Rudio cross lines column corresponds to both array and vector versions of ldquoLinear BitReversalrdquo algorithm dotted column corresponds to both array and vector versions of ldquoElsterrsquos Bit Reversalrdquo algorithm vertical lines columncorresponds to both array and vector versions of the new algorithm in single one-dimensional memory structure inclined lines columncorresponds to both array and vector versions of the new algorithm in single multidimensional memory structure and horizontal linescolumn corresponds to both array and vector versions of the new algorithm in multiple one-dimensional memory structures
respectively in relation to the array version of LBRA Thealgorithm with eight memory structures has nearly the sameCPE as the four-array and four-vector versions but is lessconsistent On the other hand the four-vector implementa-tion of the new algorithm is the fastest and most consistentamong all vector versions Two four eight and sixteenmultiple vector implementations of new BRA reported 1316 and 16 higher and 23 lower efficiency respectively inrelation to the vector version of LBRAThis result proves thatat a certain point multiple memory structure gives the bestperformances in the considered domain Also usage of mul-tiple memory structures of primitive array is a good optionfor implementing index mapping of BRP compared to multi-dimensional or single one-dimensional memory structures
Due to the flexible nature of the vector it is commonlyused for implementing algorithms According to Figure 4there is no difference in OPPE between array and vectorversionsHowever our results in Figure 5 show that the vectorversions of BRA always required more CPE (44ndash142)than the array version The structure of vector gives priorityto generality and flexibility rather than to execution speedmemory economy cache efficiency and code size [22]There-fore vector is not a good option with respect to efficiency
The results in Table 5 and Figure 4 show that there is nodifference between the number of calculations and OPPEfor equal versions of algorithms with multidimensional andmultiple memory structure Structure and type of calcula-tions are the same for both types The only difference is thenature of the memory structure multidimension or multipleone-dimension When CPE is considered it shows 19ndash79
performance efficiency from algorithms with multiple one-dimension memory structures The reason for that observa-tion is that the memory access of multidimensional memorystructure is less efficient than one-dimensional memorystructure [22]
We agree with the statement of Elster about indexmapping of BRP [2] and the generally accepted fact (theusage of multidimensional memory structures reduces theperformance of a computer) [4] only with respect to multi-dimensional memory structures of vector Our results showthat even with multidimensional arrays there are situationswhere the new BRA performs better than the same typeof one-dimensional memory structure The four eight andsixteen dimensional array versions of new BRA perform 810 and 2 in relation to one-dimensional array version ofnew BRA Some results in relation to single one-dimensionalmemory structure implementation of new BRA are also notin agreement with the general accepted idea For examplesample size = 231 the two-dimensional vector version of newBRA (BRA Split 1 2DV) reported 542119864 minus 05 CPE which is389 higher in relation to average CPE of sample size rangeof 221 to 230 Also the inconsistency was very highThereforewe excluded values related to sample size = 231 for the two-dimensional vector version
We observed very high memory utilization with the two-dimensional vector version especially with sample size =231 Windows task manager showed that the memory uti-lization of all the considered algorithms was nearly thesame for all sample sizes except for multidimension versionsof vector The multidimensional version of vector initially
10 Mathematical Problems in Engineering
Cloc
ks p
er el
emen
t (CP
E)
000E + 00
500E minus 06
100E minus 05
150E minus 05
200E minus 05
250E minus 05
300E minus 05
BRM
split
01x
BRM
split
12x
BRM
split
24x
BRM
split
38x
BRM
split
416x
lowastBR
Msp
lit12
Dx
BRM
split
24
Dx
BRM
split
38
Dx
BRM
split
416
Dx
BRM
split
11x
BRM
split
21x
BRM
split
31x
BRM
split
41x
x-A or V (array or vector)Algorithms
Refe
renc
e alg
orith
m o
ne-
dim
ensio
nal s
ingl
e mem
ory
struc
ture
of s
izen
Multiple memorystructures
1
1
2
k
k
k
k = 2s
n sample size
Multidimensionalmemory structure
One-dimensionalmemory structure
nknk
nk
middotmiddotmiddot
middotmiddotmiddot
(s = 1 2 3 4)
Corresponds to array version of the algorithmCorresponds to vector version of the algorithmCorresponds to minimum reported CPE of the category
Figure 5 Clocks per element (combined average) versus algorithm for the sample size range from 221 to 231 where blue dotted columncorresponds to array version of the algorithm green column corresponds to vector version of the algorithm and red rhombus corresponds tominimum reported CPE of the category lowastFor vector version sample size 231 was excluded because at sample size 231 it showed huge deviationdue to memory limitation of the machine
utilizes highermemory and drops down to normal valueTheresults in relation to sample size = 2
30 showed that the extramemory requirement of two dimension vector was higherthan that of the four-dimensional vector Based upon that itcan be predicted that BRA Split 1 2DV needs an extra 3GB(total 13 GB) for normal execution at sample size = 2
31 butthe total memory barrier of 12GB of the machine slows theprocess down The most likely reason for this observation isthe influence ofmemory poolingmechanismWhen defininga memory structure it is possible to allocate the requiredamount of memory This is known as memory pooling [23]In all the memory structures used in algorithms discussedin this paper we used memory pooling Memory poolingallocates the required memory at startup and divides thisblock into small chunks If there is no memory poolingmemory will get fragmented Accessing fragmented memoryis inefficient When the existing memory is not enough forallocating then it switches to use fragmented memory forallocating memory elements In the considered situation theexisting memory (12GB) is not sufficient for allocating therequired amount (13GB) which switches to use fragmentedmemory
The total cache size of the machine is 825MB whichis less than the minimum memory utilization of consideredalgorithms (16MB to 8GB) in relation to the sample sizerange from 222 to 231 Only the sample size 221 occupies 8MBmemory which is less than the total cache memory ExceptBRA Split 4 1V structure all algorithms reported constantCPE in relation to the entire sample size range The bestalgorithms of each category especially reported very steady
behaviour This observation is in disagreement with thestatement of Karp ldquothat a machine with hierarchical memorydoes not performwell when array size is larger than the cachesizerdquo [10]
Comparison (Figure 6) of best reported version in theconsidered domain (four memory structure version) andthe selected algorithms shows that the array version ofEBR performs the best The four-array version of new BRAreported 1 lower performance than the array version ofEBR However the four-array version of new BRA reported34 and 23 higher performances than array versions ofLBRA and BRA-Ru Also the four-vector version of newBRA is reported to have the best performance among all thevector versions It reported 16 10 and 22 performancescompared to vector versions of LBRA EBR and BRA-Rurespectively
4 Conclusion and Outlook
Themain finding of this paper is the recursive pattern of BPRand the implementationmethod of it using multiple memorystructures With multiple memory structures especiallythe newly identified index mapping performs much betterthan multidimensional or single one-dimensional memorystructure Furthermore findings of this paper show that theperformance of primitive array is higher than vector typeThe result is in disagreement with the statement of Karp ldquothata machine with hierarchical memory does not perform wellwhen array size is larger than the cache sizerdquo Almost all thesample sizes we used were higher than the total cache size
Mathematical Problems in Engineering 11
120E ndash 05
100E minus 05
800E minus 06
600E minus 06
400E minus 06
200E ndash 06
000E + 00
Cloc
ks p
er el
emen
t (CP
E)
BRM
split
24x
(the b
est v
ersio
nof
the n
ewal
gorit
hm)
Line
ar b
it-re
vers
al1x
Elste
rrsquos b
it-
Rubi
o1x
reve
rsal
1x
x-A or V (array or vector)Algorithms
Corresponds to array version of the algorithmCorresponds to vector version of the algorithm
Figure 6 Clocks per element versus best version of new algorithm and selected algorithms where blue dotted column corresponds to arrayversion and green column corresponds to vector version of the algorithm
1
1
1
1
1
0
1
0
1
0
1
0
Process 1
Process 2
Process 5
Process 6Process 3
Process 4
Process 7
4 parallel processes
2 parallel processes
Single process
Tota
l sig
nal i
s div
ided
into
equa
l fou
r ind
epen
dent
mem
ory
struc
ture
s usin
g ne
w al
gorit
hm (B
RMsp
lit24x
)
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
minus1
w04
w14
w24
w34
w04
w14
w24
w34
A(0)
A(1)
A(2)
A(3)
B(0)
B(1)
B(2)
B(3)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
minus1
minus1
minus1
Figure 7 Four-memory-structure version of algorithm and the parallel processes for sample size = 16
of the computer However multiple memory structure andthe multidimensional memory structure versions showedreasonable steady performancewith those samples In generalthese results show the effects of data structures and memoryallocation techniques and open a new window of creatingefficient algorithms with multiple memory structures inmany other fields where index mapping is involved
The new bit reversal algorithm with 2119904 independent
memory structures splits the total signal into 119904 independentportions and the total FFT process into 119904 + 1 levels Thenthese 119904 signal portions can be processed independently bymeans of 119904 independent processes on the first level Onthe next level the results from the previous level stored inindependent memory structures can be processed with 1199042
12 Mathematical Problems in Engineering
processes and so on until the last level Therefore we suggestusing the concept of multiple memory structures in total FFTprocess along with the new algorithm with multiple memorystructures and suitable parallel processing technique Weexpect that it is possible to achieve higher performance fromFFT process with proper combination of parallel processingtechnique and new algorithm compared to using the newalgorithm only to create bit reversal permutation Figure 7shows such approach with four (when 119904 = 2) independentmemory structures for sample size = 16
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This publication is financially supported by the UniversityGrants Commission Sri Lanka
References
[1] C S Burrus ldquoUnscrambling for fast DFT algorithmsrdquo IEEETransactions on Acoustics Speech and Signal Processing vol 36no 7 pp 1086ndash1087 1988
[2] A C Elster and J C Meyer ldquoA super-efficient adaptable bit-reversal algorithm for multithreaded architecturesrdquo in Pro-ceedings of the IEEE International Symposium on Parallel ampDistributed Processing (IPDPS 09) vol 1ndash5 pp 1ndash8 Rome ItalyMay 2009
[3] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965
[4] M L Massar R Bhagavatula M Fickus and J KovacevicldquoLocal histograms and image occlusion modelsrdquo Applied andComputational Harmonic Analysis vol 34 no 3 pp 469ndash4872013
[5] R G Lyons Understanding Digital Signal Processing PrenticeHall PTR 2004
[6] C Deyun L Zhiqiang G Ming W Lili and Y XiaoyangldquoA superresolution image reconstruction algorithm based onlandweber in electrical capacitance tomographyrdquoMathematicalProblems in Engineering vol 2013 Article ID 128172 8 pages2013
[7] Q Yang and D An ldquoEMD and wavelet transform based faultdiagnosis for wind turbine gear boxrdquo Advances in MechanicalEngineering vol 2013 Article ID 212836 9 pages 2013
[8] C S Burrus Fast Fourier Transforms Rice University HoustonTex USA 2008
[9] C van Loan Computational Frameworks for the Fast FourierTransform vol 10 of Frontiers in Applied Mathematics Societyfor Industrial and Applied Mathematics (SIAM) PhiladelphiaPa USA 1992
[10] A H Karp ldquoBit reversal on uniprocessorsrdquo SIAM Review vol38 no 1 pp 1ndash26 1996
[11] A C Elster ldquoFast bit-reversal algorithmsrdquo in Proceedings ofthe International Conference on Acoustics Speech and SignalProcessing pp 1099ndash1102 IEEE Press Glasgow UK May 1989
[12] M Rubio P Gomez and K Drouiche ldquoA new superfast bitreversal algorithmrdquo International Journal of Adaptive Controland Signal Processing vol 16 no 10 pp 703ndash707 2002
[13] B Stroustrup Programming Principles and Practice Using C++Addison-Wesley 2009
[14] B Stroustrup ldquoSoftware development for infrastructurerdquo IEEEComputer Society vol 45 no 1 Article ID 6081841 pp 47ndash582012
[15] B Stroustrup The C++ Programming Language ATampT Labs3rd edition 1997
[16] S DonovanC++ by Example ldquoUnderCrdquo Learning Edition QUECorporation 2002
[17] S Mcconnell Code Complete Microsoft Press 2nd edition2004
[18] Microsoft STL Containers Microsoft New York NY USA2012 httpmsdnmicrosoftcomen-uslibrary1fe2x6ktaspx
[19] Microsoft Arrays (C++ Component Extensions) MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryvstudiots4c4dw6(v=vs110)aspx
[20] B Stroustrup ldquoEvolving a language in and for the real worldC++ 1991ndash2006rdquo in Proceedings of the 3rd ACM SIGPLANHistory of Programming Languages Conference (HOPL-III rsquo07)June 2007
[21] Microsoft Memory Limits for Windows Releases MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryaa36677828VS8529aspxmemory limits
[22] A Fog ldquoOptimizing software in C++rdquo 2014 httpwwwagnerorgoptimize
[23] Code Project ldquoC++ Memory Poolrdquo 2014 httpwwwcodepro-jectcomArticles15527C-Memory-Pool
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
6 Mathematical Problems in Engineering
Table 3 Common memory allocating methods that are used in Visual C++
Number Name Syntax Nature of memory layout1 Array int BRP[1000] Slot of continuous memory elements2 Array intlowast BRP = new int[119899] Slot of continuous memory elements3 Array array⟨int⟩andBRP = gcnew array⟨int⟩(119899) Collection of noncontinuous memory elements4 ArrayList ArrayList andBRP = gcnew ArrayList() Collection of noncontinuous memory elements5 Vector stdvector⟨int⟩ BRP(119899) Collection of noncontinuous memory elements
Table 4 Different versions of new BRA implemented with different data structures
Split (s) Data structure type AlgorithmSingle memory structure Multidimensional memory structure Multiple memory structure
1 Array (A) BRA Split 1 1A BRA Split 1 2DA BRA Split 1 2A1 Vector (V) BRA Split 1 1V BRA Split 1 2DV BRA Split 1 2V2 Array (A) BRA Split 2 1A BRA Split 2 4DA BRA Split 2 4A2 Vector (V) BRA Split 2 1V BRA Split 2 4DV BRA Split 2 4V3 Array (A) BRA Split 3 1A BRA Split 3 8DA BRA Split 3 8A3 Vector (V) BRA Split 3 1V BRA Split 3 8DV BRA Split 3 8V4 Array (A) BRA Split 4 1A BRA Split 4 16DA BRA Split 4 16A4 Vector (V) BRA Split 4 1V BRA Split 4 16DV BRA Split 4 16VNaming convention for algorithms ldquoBRA Splitrdquo + ltNumber of splitsgt + ltnature of memory structuregt xA xV x number of arrays and x number of vectorsxDA xDV single x-dimensional array and single x-dimensional vector
void mf BRM Split 1 1A (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DL
ui N = ui NS Number of samplesui t = ui log2NS minus 1ui EB = ui N2ui L = 1unsigned intlowast BRP = new unsigned int[ui N]BRP[0] = 0BRP[ui EB] = 1for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP[j] = BRP[j minus ui L] + ui NBRP[ui EB + j] = BRP[j] + 1
ui L = ui L + ui L
delete[] BRP
Algorithm 1 C++ implementation of new BRA with single array for split = 1 (BRA Split 1 1A)
Mathematical Problems in Engineering 7
Void mf BRM Split 2 4DV (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DLui N = ui NS Number of samplesui t = ui log2NS minus 2ui EB = ui N4ui L = 1stdvectorltstdvectorltunsigned intgtgt BRP(4
stdvectorltunsigned intgt(ui EB))BRP[0][0] = 0BRP[1][0] = 2BRP[2][0] = 1BRP[3][0] = 3for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP[0][j] = BRP[0][j minus ui L] + ui NBRP[1][j] = BRP[0][j] + 2BRP[2][j] = BRP[0][j] + 1BRP[3][j] = BRP[1][j] + 1
ui L = ui L + ui L
Algorithm 2 C++ implementation of new BRA with four arrays for split = 2 (BRM Split 2 4A)
0 8 4 12 2 10 6 14 1 9 5 13 3 11 7 15
Initialization
First iteration Seventh (last) iteration
Figure 2 Permutation filling sequence of new BRA with single memory structure for 119899 = 16 and split =1 (BRA Split 1 1A)
For large 119896 2119896 + 1 asymp 2119896 Then
119896
sum
119905=0
2119905= 2119896+1
(11)
Because the considered sample size is 221 to 231 the value 119896can be considered as large Then from (9)
OPPE =1198621lowast (2119896minus119904)
2119896
(12)
OPPE = 1198621
2119904 (13)
According to (13) OPPE is 119891(1198621 119904) The value 119862
1(oper-
ations in inner loop) and the value 119904 (number of splits) are aconstant for a certain algorithm
To evaluate the performance of new BRA we selectedthree algorithms (LBRA EBR and BRA-Ru) which used apattern instead of conventional bit reversing method Theperformance of vector and array versions of the best version
Table 5 The number of operations in the inner loop (1198621) of each
algorithm for different splits (119904) (The total number of operations ineach algorithm asymp 119862
1)
Memory structure type Value of 1198621
119904 = 1 119904 = 2 119904 = 3 119904 = 4
Single one-dimension (array and vector) 8 15 30 61Multidimension (array and vector) 7 11 19 35Multiple one-dimension (array and vector) 7 11 19 35
of new BRA was compared with the relevant versions ofselected algorithms
3 Results and Discussion
Our objective was to introduce BRA using recursive patternof the BRP that we identified We used multiple memorystructures which is a feasible yet unpopular technique toimplement index mapping According to Table 5 the num-bers of operations in all the array and vector versions of both
8 Mathematical Problems in Engineering
Void mf BRM Split 2 4A (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DL
ui N = ui NS Number of samplesui t = ui log2NS minus 2ui EB = ui N4ui L = 1unsigned intlowast BRP1 = new unsigned int[ui EB]unsigned intlowast BRP2 = new unsigned int[ui EB]unsigned intlowast BRP3 = new unsigned int[ui EB]unsigned intlowast BRP4 = new unsigned int[ui EB]
BRP1[0] = 0BRP2[0] = 2BRP3[0] = 1BRP4[0] = 3for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP1[j] = BRP1[j minus ui L]+ ui NBRP2[j] = BRP1[j] + 2BRP3[j] = BRP1[j] + 1BRP4[j] = BRP2[j] + 1
ui L = ui L + ui L
delete[] BRO1delete[] BRO2delete[] BRO3delete[] BRO4
Algorithm 3 C++ implementation of new BRA with four one-dimensional arrays for split = 2 (BRM Split 2 4A)
0 8 4 12
2 10 6 14
1 9 5 13
3 11 7 15
Initi
aliz
atio
n
Third
(las
t) ite
ratio
n
First iteration
Figure 3 Permutation filling sequence of 4 individual and single 4-dimensional memory structure for 119899 = 16
multidimensional and multiple memory structures are thesame Also Figure 5 shows continuous decrement of OPPEwhen the number of splits increases Then the algorithmwith the highest number of splits and the lowest number ofoperations is the one which is expected to be most efficient
However results in relation with CPE (Figure 5) show thatthe new algorithm with four memory structures of array isthe fastest and most consistent in the selected range Twofour eight and sixteen multiple array implementations ofnew BRA reported 25 34 33 and 18 higher efficiency
Mathematical Problems in Engineering 9
0
1
2
3
4
5
6
Refalgorithms
Ope
ratio
ns p
er el
emen
t (O
PPE)
Split = 1 Split = 2 Split = 3 Split = 4
Corresponds to both array and vector versions of algorithm of rudioCorresponds to both array and vector versions of ldquolinear bit reversalrdquo algorithmCorresponds to both array and vector versions of ldquoelsterrsquos bit reversalrdquo algorithmCorresponds to both array and vector versions of the new algorithm in
Corresponds to both array and vector versions of the new algorithm in
Corresponds to both array and vector versions of the new algorithm in
signal one-dimensional memory structure
signal multi-dimensional memory structure
signal multiple one-dimensional memory structure
Figure 4 Operations per element versus reference algorithms and new algorithm with different s (splits) where dashed column correspondsto both array and vector versions of algorithm of Rudio cross lines column corresponds to both array and vector versions of ldquoLinear BitReversalrdquo algorithm dotted column corresponds to both array and vector versions of ldquoElsterrsquos Bit Reversalrdquo algorithm vertical lines columncorresponds to both array and vector versions of the new algorithm in single one-dimensional memory structure inclined lines columncorresponds to both array and vector versions of the new algorithm in single multidimensional memory structure and horizontal linescolumn corresponds to both array and vector versions of the new algorithm in multiple one-dimensional memory structures
respectively in relation to the array version of LBRA Thealgorithm with eight memory structures has nearly the sameCPE as the four-array and four-vector versions but is lessconsistent On the other hand the four-vector implementa-tion of the new algorithm is the fastest and most consistentamong all vector versions Two four eight and sixteenmultiple vector implementations of new BRA reported 1316 and 16 higher and 23 lower efficiency respectively inrelation to the vector version of LBRAThis result proves thatat a certain point multiple memory structure gives the bestperformances in the considered domain Also usage of mul-tiple memory structures of primitive array is a good optionfor implementing index mapping of BRP compared to multi-dimensional or single one-dimensional memory structures
Due to the flexible nature of the vector it is commonlyused for implementing algorithms According to Figure 4there is no difference in OPPE between array and vectorversionsHowever our results in Figure 5 show that the vectorversions of BRA always required more CPE (44ndash142)than the array version The structure of vector gives priorityto generality and flexibility rather than to execution speedmemory economy cache efficiency and code size [22]There-fore vector is not a good option with respect to efficiency
The results in Table 5 and Figure 4 show that there is nodifference between the number of calculations and OPPEfor equal versions of algorithms with multidimensional andmultiple memory structure Structure and type of calcula-tions are the same for both types The only difference is thenature of the memory structure multidimension or multipleone-dimension When CPE is considered it shows 19ndash79
performance efficiency from algorithms with multiple one-dimension memory structures The reason for that observa-tion is that the memory access of multidimensional memorystructure is less efficient than one-dimensional memorystructure [22]
We agree with the statement of Elster about indexmapping of BRP [2] and the generally accepted fact (theusage of multidimensional memory structures reduces theperformance of a computer) [4] only with respect to multi-dimensional memory structures of vector Our results showthat even with multidimensional arrays there are situationswhere the new BRA performs better than the same typeof one-dimensional memory structure The four eight andsixteen dimensional array versions of new BRA perform 810 and 2 in relation to one-dimensional array version ofnew BRA Some results in relation to single one-dimensionalmemory structure implementation of new BRA are also notin agreement with the general accepted idea For examplesample size = 231 the two-dimensional vector version of newBRA (BRA Split 1 2DV) reported 542119864 minus 05 CPE which is389 higher in relation to average CPE of sample size rangeof 221 to 230 Also the inconsistency was very highThereforewe excluded values related to sample size = 231 for the two-dimensional vector version
We observed very high memory utilization with the two-dimensional vector version especially with sample size =231 Windows task manager showed that the memory uti-lization of all the considered algorithms was nearly thesame for all sample sizes except for multidimension versionsof vector The multidimensional version of vector initially
10 Mathematical Problems in Engineering
Cloc
ks p
er el
emen
t (CP
E)
000E + 00
500E minus 06
100E minus 05
150E minus 05
200E minus 05
250E minus 05
300E minus 05
BRM
split
01x
BRM
split
12x
BRM
split
24x
BRM
split
38x
BRM
split
416x
lowastBR
Msp
lit12
Dx
BRM
split
24
Dx
BRM
split
38
Dx
BRM
split
416
Dx
BRM
split
11x
BRM
split
21x
BRM
split
31x
BRM
split
41x
x-A or V (array or vector)Algorithms
Refe
renc
e alg
orith
m o
ne-
dim
ensio
nal s
ingl
e mem
ory
struc
ture
of s
izen
Multiple memorystructures
1
1
2
k
k
k
k = 2s
n sample size
Multidimensionalmemory structure
One-dimensionalmemory structure
nknk
nk
middotmiddotmiddot
middotmiddotmiddot
(s = 1 2 3 4)
Corresponds to array version of the algorithmCorresponds to vector version of the algorithmCorresponds to minimum reported CPE of the category
Figure 5 Clocks per element (combined average) versus algorithm for the sample size range from 221 to 231 where blue dotted columncorresponds to array version of the algorithm green column corresponds to vector version of the algorithm and red rhombus corresponds tominimum reported CPE of the category lowastFor vector version sample size 231 was excluded because at sample size 231 it showed huge deviationdue to memory limitation of the machine
utilizes highermemory and drops down to normal valueTheresults in relation to sample size = 2
30 showed that the extramemory requirement of two dimension vector was higherthan that of the four-dimensional vector Based upon that itcan be predicted that BRA Split 1 2DV needs an extra 3GB(total 13 GB) for normal execution at sample size = 2
31 butthe total memory barrier of 12GB of the machine slows theprocess down The most likely reason for this observation isthe influence ofmemory poolingmechanismWhen defininga memory structure it is possible to allocate the requiredamount of memory This is known as memory pooling [23]In all the memory structures used in algorithms discussedin this paper we used memory pooling Memory poolingallocates the required memory at startup and divides thisblock into small chunks If there is no memory poolingmemory will get fragmented Accessing fragmented memoryis inefficient When the existing memory is not enough forallocating then it switches to use fragmented memory forallocating memory elements In the considered situation theexisting memory (12GB) is not sufficient for allocating therequired amount (13GB) which switches to use fragmentedmemory
The total cache size of the machine is 825MB whichis less than the minimum memory utilization of consideredalgorithms (16MB to 8GB) in relation to the sample sizerange from 222 to 231 Only the sample size 221 occupies 8MBmemory which is less than the total cache memory ExceptBRA Split 4 1V structure all algorithms reported constantCPE in relation to the entire sample size range The bestalgorithms of each category especially reported very steady
behaviour This observation is in disagreement with thestatement of Karp ldquothat a machine with hierarchical memorydoes not performwell when array size is larger than the cachesizerdquo [10]
Comparison (Figure 6) of best reported version in theconsidered domain (four memory structure version) andthe selected algorithms shows that the array version ofEBR performs the best The four-array version of new BRAreported 1 lower performance than the array version ofEBR However the four-array version of new BRA reported34 and 23 higher performances than array versions ofLBRA and BRA-Ru Also the four-vector version of newBRA is reported to have the best performance among all thevector versions It reported 16 10 and 22 performancescompared to vector versions of LBRA EBR and BRA-Rurespectively
4 Conclusion and Outlook
Themain finding of this paper is the recursive pattern of BPRand the implementationmethod of it using multiple memorystructures With multiple memory structures especiallythe newly identified index mapping performs much betterthan multidimensional or single one-dimensional memorystructure Furthermore findings of this paper show that theperformance of primitive array is higher than vector typeThe result is in disagreement with the statement of Karp ldquothata machine with hierarchical memory does not perform wellwhen array size is larger than the cache sizerdquo Almost all thesample sizes we used were higher than the total cache size
Mathematical Problems in Engineering 11
120E ndash 05
100E minus 05
800E minus 06
600E minus 06
400E minus 06
200E ndash 06
000E + 00
Cloc
ks p
er el
emen
t (CP
E)
BRM
split
24x
(the b
est v
ersio
nof
the n
ewal
gorit
hm)
Line
ar b
it-re
vers
al1x
Elste
rrsquos b
it-
Rubi
o1x
reve
rsal
1x
x-A or V (array or vector)Algorithms
Corresponds to array version of the algorithmCorresponds to vector version of the algorithm
Figure 6 Clocks per element versus best version of new algorithm and selected algorithms where blue dotted column corresponds to arrayversion and green column corresponds to vector version of the algorithm
1
1
1
1
1
0
1
0
1
0
1
0
Process 1
Process 2
Process 5
Process 6Process 3
Process 4
Process 7
4 parallel processes
2 parallel processes
Single process
Tota
l sig
nal i
s div
ided
into
equa
l fou
r ind
epen
dent
mem
ory
struc
ture
s usin
g ne
w al
gorit
hm (B
RMsp
lit24x
)
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
minus1
w04
w14
w24
w34
w04
w14
w24
w34
A(0)
A(1)
A(2)
A(3)
B(0)
B(1)
B(2)
B(3)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
minus1
minus1
minus1
Figure 7 Four-memory-structure version of algorithm and the parallel processes for sample size = 16
of the computer However multiple memory structure andthe multidimensional memory structure versions showedreasonable steady performancewith those samples In generalthese results show the effects of data structures and memoryallocation techniques and open a new window of creatingefficient algorithms with multiple memory structures inmany other fields where index mapping is involved
The new bit reversal algorithm with 2119904 independent
memory structures splits the total signal into 119904 independentportions and the total FFT process into 119904 + 1 levels Thenthese 119904 signal portions can be processed independently bymeans of 119904 independent processes on the first level Onthe next level the results from the previous level stored inindependent memory structures can be processed with 1199042
12 Mathematical Problems in Engineering
processes and so on until the last level Therefore we suggestusing the concept of multiple memory structures in total FFTprocess along with the new algorithm with multiple memorystructures and suitable parallel processing technique Weexpect that it is possible to achieve higher performance fromFFT process with proper combination of parallel processingtechnique and new algorithm compared to using the newalgorithm only to create bit reversal permutation Figure 7shows such approach with four (when 119904 = 2) independentmemory structures for sample size = 16
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This publication is financially supported by the UniversityGrants Commission Sri Lanka
References
[1] C S Burrus ldquoUnscrambling for fast DFT algorithmsrdquo IEEETransactions on Acoustics Speech and Signal Processing vol 36no 7 pp 1086ndash1087 1988
[2] A C Elster and J C Meyer ldquoA super-efficient adaptable bit-reversal algorithm for multithreaded architecturesrdquo in Pro-ceedings of the IEEE International Symposium on Parallel ampDistributed Processing (IPDPS 09) vol 1ndash5 pp 1ndash8 Rome ItalyMay 2009
[3] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965
[4] M L Massar R Bhagavatula M Fickus and J KovacevicldquoLocal histograms and image occlusion modelsrdquo Applied andComputational Harmonic Analysis vol 34 no 3 pp 469ndash4872013
[5] R G Lyons Understanding Digital Signal Processing PrenticeHall PTR 2004
[6] C Deyun L Zhiqiang G Ming W Lili and Y XiaoyangldquoA superresolution image reconstruction algorithm based onlandweber in electrical capacitance tomographyrdquoMathematicalProblems in Engineering vol 2013 Article ID 128172 8 pages2013
[7] Q Yang and D An ldquoEMD and wavelet transform based faultdiagnosis for wind turbine gear boxrdquo Advances in MechanicalEngineering vol 2013 Article ID 212836 9 pages 2013
[8] C S Burrus Fast Fourier Transforms Rice University HoustonTex USA 2008
[9] C van Loan Computational Frameworks for the Fast FourierTransform vol 10 of Frontiers in Applied Mathematics Societyfor Industrial and Applied Mathematics (SIAM) PhiladelphiaPa USA 1992
[10] A H Karp ldquoBit reversal on uniprocessorsrdquo SIAM Review vol38 no 1 pp 1ndash26 1996
[11] A C Elster ldquoFast bit-reversal algorithmsrdquo in Proceedings ofthe International Conference on Acoustics Speech and SignalProcessing pp 1099ndash1102 IEEE Press Glasgow UK May 1989
[12] M Rubio P Gomez and K Drouiche ldquoA new superfast bitreversal algorithmrdquo International Journal of Adaptive Controland Signal Processing vol 16 no 10 pp 703ndash707 2002
[13] B Stroustrup Programming Principles and Practice Using C++Addison-Wesley 2009
[14] B Stroustrup ldquoSoftware development for infrastructurerdquo IEEEComputer Society vol 45 no 1 Article ID 6081841 pp 47ndash582012
[15] B Stroustrup The C++ Programming Language ATampT Labs3rd edition 1997
[16] S DonovanC++ by Example ldquoUnderCrdquo Learning Edition QUECorporation 2002
[17] S Mcconnell Code Complete Microsoft Press 2nd edition2004
[18] Microsoft STL Containers Microsoft New York NY USA2012 httpmsdnmicrosoftcomen-uslibrary1fe2x6ktaspx
[19] Microsoft Arrays (C++ Component Extensions) MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryvstudiots4c4dw6(v=vs110)aspx
[20] B Stroustrup ldquoEvolving a language in and for the real worldC++ 1991ndash2006rdquo in Proceedings of the 3rd ACM SIGPLANHistory of Programming Languages Conference (HOPL-III rsquo07)June 2007
[21] Microsoft Memory Limits for Windows Releases MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryaa36677828VS8529aspxmemory limits
[22] A Fog ldquoOptimizing software in C++rdquo 2014 httpwwwagnerorgoptimize
[23] Code Project ldquoC++ Memory Poolrdquo 2014 httpwwwcodepro-jectcomArticles15527C-Memory-Pool
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 7
Void mf BRM Split 2 4DV (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DLui N = ui NS Number of samplesui t = ui log2NS minus 2ui EB = ui N4ui L = 1stdvectorltstdvectorltunsigned intgtgt BRP(4
stdvectorltunsigned intgt(ui EB))BRP[0][0] = 0BRP[1][0] = 2BRP[2][0] = 1BRP[3][0] = 3for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP[0][j] = BRP[0][j minus ui L] + ui NBRP[1][j] = BRP[0][j] + 2BRP[2][j] = BRP[0][j] + 1BRP[3][j] = BRP[1][j] + 1
ui L = ui L + ui L
Algorithm 2 C++ implementation of new BRA with four arrays for split = 2 (BRM Split 2 4A)
0 8 4 12 2 10 6 14 1 9 5 13 3 11 7 15
Initialization
First iteration Seventh (last) iteration
Figure 2 Permutation filling sequence of new BRA with single memory structure for 119899 = 16 and split =1 (BRA Split 1 1A)
For large 119896 2119896 + 1 asymp 2119896 Then
119896
sum
119905=0
2119905= 2119896+1
(11)
Because the considered sample size is 221 to 231 the value 119896can be considered as large Then from (9)
OPPE =1198621lowast (2119896minus119904)
2119896
(12)
OPPE = 1198621
2119904 (13)
According to (13) OPPE is 119891(1198621 119904) The value 119862
1(oper-
ations in inner loop) and the value 119904 (number of splits) are aconstant for a certain algorithm
To evaluate the performance of new BRA we selectedthree algorithms (LBRA EBR and BRA-Ru) which used apattern instead of conventional bit reversing method Theperformance of vector and array versions of the best version
Table 5 The number of operations in the inner loop (1198621) of each
algorithm for different splits (119904) (The total number of operations ineach algorithm asymp 119862
1)
Memory structure type Value of 1198621
119904 = 1 119904 = 2 119904 = 3 119904 = 4
Single one-dimension (array and vector) 8 15 30 61Multidimension (array and vector) 7 11 19 35Multiple one-dimension (array and vector) 7 11 19 35
of new BRA was compared with the relevant versions ofselected algorithms
3 Results and Discussion
Our objective was to introduce BRA using recursive patternof the BRP that we identified We used multiple memorystructures which is a feasible yet unpopular technique toimplement index mapping According to Table 5 the num-bers of operations in all the array and vector versions of both
8 Mathematical Problems in Engineering
Void mf BRM Split 2 4A (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DL
ui N = ui NS Number of samplesui t = ui log2NS minus 2ui EB = ui N4ui L = 1unsigned intlowast BRP1 = new unsigned int[ui EB]unsigned intlowast BRP2 = new unsigned int[ui EB]unsigned intlowast BRP3 = new unsigned int[ui EB]unsigned intlowast BRP4 = new unsigned int[ui EB]
BRP1[0] = 0BRP2[0] = 2BRP3[0] = 1BRP4[0] = 3for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP1[j] = BRP1[j minus ui L]+ ui NBRP2[j] = BRP1[j] + 2BRP3[j] = BRP1[j] + 1BRP4[j] = BRP2[j] + 1
ui L = ui L + ui L
delete[] BRO1delete[] BRO2delete[] BRO3delete[] BRO4
Algorithm 3 C++ implementation of new BRA with four one-dimensional arrays for split = 2 (BRM Split 2 4A)
0 8 4 12
2 10 6 14
1 9 5 13
3 11 7 15
Initi
aliz
atio
n
Third
(las
t) ite
ratio
n
First iteration
Figure 3 Permutation filling sequence of 4 individual and single 4-dimensional memory structure for 119899 = 16
multidimensional and multiple memory structures are thesame Also Figure 5 shows continuous decrement of OPPEwhen the number of splits increases Then the algorithmwith the highest number of splits and the lowest number ofoperations is the one which is expected to be most efficient
However results in relation with CPE (Figure 5) show thatthe new algorithm with four memory structures of array isthe fastest and most consistent in the selected range Twofour eight and sixteen multiple array implementations ofnew BRA reported 25 34 33 and 18 higher efficiency
Mathematical Problems in Engineering 9
0
1
2
3
4
5
6
Refalgorithms
Ope
ratio
ns p
er el
emen
t (O
PPE)
Split = 1 Split = 2 Split = 3 Split = 4
Corresponds to both array and vector versions of algorithm of rudioCorresponds to both array and vector versions of ldquolinear bit reversalrdquo algorithmCorresponds to both array and vector versions of ldquoelsterrsquos bit reversalrdquo algorithmCorresponds to both array and vector versions of the new algorithm in
Corresponds to both array and vector versions of the new algorithm in
Corresponds to both array and vector versions of the new algorithm in
signal one-dimensional memory structure
signal multi-dimensional memory structure
signal multiple one-dimensional memory structure
Figure 4 Operations per element versus reference algorithms and new algorithm with different s (splits) where dashed column correspondsto both array and vector versions of algorithm of Rudio cross lines column corresponds to both array and vector versions of ldquoLinear BitReversalrdquo algorithm dotted column corresponds to both array and vector versions of ldquoElsterrsquos Bit Reversalrdquo algorithm vertical lines columncorresponds to both array and vector versions of the new algorithm in single one-dimensional memory structure inclined lines columncorresponds to both array and vector versions of the new algorithm in single multidimensional memory structure and horizontal linescolumn corresponds to both array and vector versions of the new algorithm in multiple one-dimensional memory structures
respectively in relation to the array version of LBRA Thealgorithm with eight memory structures has nearly the sameCPE as the four-array and four-vector versions but is lessconsistent On the other hand the four-vector implementa-tion of the new algorithm is the fastest and most consistentamong all vector versions Two four eight and sixteenmultiple vector implementations of new BRA reported 1316 and 16 higher and 23 lower efficiency respectively inrelation to the vector version of LBRAThis result proves thatat a certain point multiple memory structure gives the bestperformances in the considered domain Also usage of mul-tiple memory structures of primitive array is a good optionfor implementing index mapping of BRP compared to multi-dimensional or single one-dimensional memory structures
Due to the flexible nature of the vector it is commonlyused for implementing algorithms According to Figure 4there is no difference in OPPE between array and vectorversionsHowever our results in Figure 5 show that the vectorversions of BRA always required more CPE (44ndash142)than the array version The structure of vector gives priorityto generality and flexibility rather than to execution speedmemory economy cache efficiency and code size [22]There-fore vector is not a good option with respect to efficiency
The results in Table 5 and Figure 4 show that there is nodifference between the number of calculations and OPPEfor equal versions of algorithms with multidimensional andmultiple memory structure Structure and type of calcula-tions are the same for both types The only difference is thenature of the memory structure multidimension or multipleone-dimension When CPE is considered it shows 19ndash79
performance efficiency from algorithms with multiple one-dimension memory structures The reason for that observa-tion is that the memory access of multidimensional memorystructure is less efficient than one-dimensional memorystructure [22]
We agree with the statement of Elster about indexmapping of BRP [2] and the generally accepted fact (theusage of multidimensional memory structures reduces theperformance of a computer) [4] only with respect to multi-dimensional memory structures of vector Our results showthat even with multidimensional arrays there are situationswhere the new BRA performs better than the same typeof one-dimensional memory structure The four eight andsixteen dimensional array versions of new BRA perform 810 and 2 in relation to one-dimensional array version ofnew BRA Some results in relation to single one-dimensionalmemory structure implementation of new BRA are also notin agreement with the general accepted idea For examplesample size = 231 the two-dimensional vector version of newBRA (BRA Split 1 2DV) reported 542119864 minus 05 CPE which is389 higher in relation to average CPE of sample size rangeof 221 to 230 Also the inconsistency was very highThereforewe excluded values related to sample size = 231 for the two-dimensional vector version
We observed very high memory utilization with the two-dimensional vector version especially with sample size =231 Windows task manager showed that the memory uti-lization of all the considered algorithms was nearly thesame for all sample sizes except for multidimension versionsof vector The multidimensional version of vector initially
10 Mathematical Problems in Engineering
Cloc
ks p
er el
emen
t (CP
E)
000E + 00
500E minus 06
100E minus 05
150E minus 05
200E minus 05
250E minus 05
300E minus 05
BRM
split
01x
BRM
split
12x
BRM
split
24x
BRM
split
38x
BRM
split
416x
lowastBR
Msp
lit12
Dx
BRM
split
24
Dx
BRM
split
38
Dx
BRM
split
416
Dx
BRM
split
11x
BRM
split
21x
BRM
split
31x
BRM
split
41x
x-A or V (array or vector)Algorithms
Refe
renc
e alg
orith
m o
ne-
dim
ensio
nal s
ingl
e mem
ory
struc
ture
of s
izen
Multiple memorystructures
1
1
2
k
k
k
k = 2s
n sample size
Multidimensionalmemory structure
One-dimensionalmemory structure
nknk
nk
middotmiddotmiddot
middotmiddotmiddot
(s = 1 2 3 4)
Corresponds to array version of the algorithmCorresponds to vector version of the algorithmCorresponds to minimum reported CPE of the category
Figure 5 Clocks per element (combined average) versus algorithm for the sample size range from 221 to 231 where blue dotted columncorresponds to array version of the algorithm green column corresponds to vector version of the algorithm and red rhombus corresponds tominimum reported CPE of the category lowastFor vector version sample size 231 was excluded because at sample size 231 it showed huge deviationdue to memory limitation of the machine
utilizes highermemory and drops down to normal valueTheresults in relation to sample size = 2
30 showed that the extramemory requirement of two dimension vector was higherthan that of the four-dimensional vector Based upon that itcan be predicted that BRA Split 1 2DV needs an extra 3GB(total 13 GB) for normal execution at sample size = 2
31 butthe total memory barrier of 12GB of the machine slows theprocess down The most likely reason for this observation isthe influence ofmemory poolingmechanismWhen defininga memory structure it is possible to allocate the requiredamount of memory This is known as memory pooling [23]In all the memory structures used in algorithms discussedin this paper we used memory pooling Memory poolingallocates the required memory at startup and divides thisblock into small chunks If there is no memory poolingmemory will get fragmented Accessing fragmented memoryis inefficient When the existing memory is not enough forallocating then it switches to use fragmented memory forallocating memory elements In the considered situation theexisting memory (12GB) is not sufficient for allocating therequired amount (13GB) which switches to use fragmentedmemory
The total cache size of the machine is 825MB whichis less than the minimum memory utilization of consideredalgorithms (16MB to 8GB) in relation to the sample sizerange from 222 to 231 Only the sample size 221 occupies 8MBmemory which is less than the total cache memory ExceptBRA Split 4 1V structure all algorithms reported constantCPE in relation to the entire sample size range The bestalgorithms of each category especially reported very steady
behaviour This observation is in disagreement with thestatement of Karp ldquothat a machine with hierarchical memorydoes not performwell when array size is larger than the cachesizerdquo [10]
Comparison (Figure 6) of best reported version in theconsidered domain (four memory structure version) andthe selected algorithms shows that the array version ofEBR performs the best The four-array version of new BRAreported 1 lower performance than the array version ofEBR However the four-array version of new BRA reported34 and 23 higher performances than array versions ofLBRA and BRA-Ru Also the four-vector version of newBRA is reported to have the best performance among all thevector versions It reported 16 10 and 22 performancescompared to vector versions of LBRA EBR and BRA-Rurespectively
4 Conclusion and Outlook
Themain finding of this paper is the recursive pattern of BPRand the implementationmethod of it using multiple memorystructures With multiple memory structures especiallythe newly identified index mapping performs much betterthan multidimensional or single one-dimensional memorystructure Furthermore findings of this paper show that theperformance of primitive array is higher than vector typeThe result is in disagreement with the statement of Karp ldquothata machine with hierarchical memory does not perform wellwhen array size is larger than the cache sizerdquo Almost all thesample sizes we used were higher than the total cache size
Mathematical Problems in Engineering 11
120E ndash 05
100E minus 05
800E minus 06
600E minus 06
400E minus 06
200E ndash 06
000E + 00
Cloc
ks p
er el
emen
t (CP
E)
BRM
split
24x
(the b
est v
ersio
nof
the n
ewal
gorit
hm)
Line
ar b
it-re
vers
al1x
Elste
rrsquos b
it-
Rubi
o1x
reve
rsal
1x
x-A or V (array or vector)Algorithms
Corresponds to array version of the algorithmCorresponds to vector version of the algorithm
Figure 6 Clocks per element versus best version of new algorithm and selected algorithms where blue dotted column corresponds to arrayversion and green column corresponds to vector version of the algorithm
1
1
1
1
1
0
1
0
1
0
1
0
Process 1
Process 2
Process 5
Process 6Process 3
Process 4
Process 7
4 parallel processes
2 parallel processes
Single process
Tota
l sig
nal i
s div
ided
into
equa
l fou
r ind
epen
dent
mem
ory
struc
ture
s usin
g ne
w al
gorit
hm (B
RMsp
lit24x
)
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
minus1
w04
w14
w24
w34
w04
w14
w24
w34
A(0)
A(1)
A(2)
A(3)
B(0)
B(1)
B(2)
B(3)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
minus1
minus1
minus1
Figure 7 Four-memory-structure version of algorithm and the parallel processes for sample size = 16
of the computer However multiple memory structure andthe multidimensional memory structure versions showedreasonable steady performancewith those samples In generalthese results show the effects of data structures and memoryallocation techniques and open a new window of creatingefficient algorithms with multiple memory structures inmany other fields where index mapping is involved
The new bit reversal algorithm with 2119904 independent
memory structures splits the total signal into 119904 independentportions and the total FFT process into 119904 + 1 levels Thenthese 119904 signal portions can be processed independently bymeans of 119904 independent processes on the first level Onthe next level the results from the previous level stored inindependent memory structures can be processed with 1199042
12 Mathematical Problems in Engineering
processes and so on until the last level Therefore we suggestusing the concept of multiple memory structures in total FFTprocess along with the new algorithm with multiple memorystructures and suitable parallel processing technique Weexpect that it is possible to achieve higher performance fromFFT process with proper combination of parallel processingtechnique and new algorithm compared to using the newalgorithm only to create bit reversal permutation Figure 7shows such approach with four (when 119904 = 2) independentmemory structures for sample size = 16
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This publication is financially supported by the UniversityGrants Commission Sri Lanka
References
[1] C S Burrus ldquoUnscrambling for fast DFT algorithmsrdquo IEEETransactions on Acoustics Speech and Signal Processing vol 36no 7 pp 1086ndash1087 1988
[2] A C Elster and J C Meyer ldquoA super-efficient adaptable bit-reversal algorithm for multithreaded architecturesrdquo in Pro-ceedings of the IEEE International Symposium on Parallel ampDistributed Processing (IPDPS 09) vol 1ndash5 pp 1ndash8 Rome ItalyMay 2009
[3] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965
[4] M L Massar R Bhagavatula M Fickus and J KovacevicldquoLocal histograms and image occlusion modelsrdquo Applied andComputational Harmonic Analysis vol 34 no 3 pp 469ndash4872013
[5] R G Lyons Understanding Digital Signal Processing PrenticeHall PTR 2004
[6] C Deyun L Zhiqiang G Ming W Lili and Y XiaoyangldquoA superresolution image reconstruction algorithm based onlandweber in electrical capacitance tomographyrdquoMathematicalProblems in Engineering vol 2013 Article ID 128172 8 pages2013
[7] Q Yang and D An ldquoEMD and wavelet transform based faultdiagnosis for wind turbine gear boxrdquo Advances in MechanicalEngineering vol 2013 Article ID 212836 9 pages 2013
[8] C S Burrus Fast Fourier Transforms Rice University HoustonTex USA 2008
[9] C van Loan Computational Frameworks for the Fast FourierTransform vol 10 of Frontiers in Applied Mathematics Societyfor Industrial and Applied Mathematics (SIAM) PhiladelphiaPa USA 1992
[10] A H Karp ldquoBit reversal on uniprocessorsrdquo SIAM Review vol38 no 1 pp 1ndash26 1996
[11] A C Elster ldquoFast bit-reversal algorithmsrdquo in Proceedings ofthe International Conference on Acoustics Speech and SignalProcessing pp 1099ndash1102 IEEE Press Glasgow UK May 1989
[12] M Rubio P Gomez and K Drouiche ldquoA new superfast bitreversal algorithmrdquo International Journal of Adaptive Controland Signal Processing vol 16 no 10 pp 703ndash707 2002
[13] B Stroustrup Programming Principles and Practice Using C++Addison-Wesley 2009
[14] B Stroustrup ldquoSoftware development for infrastructurerdquo IEEEComputer Society vol 45 no 1 Article ID 6081841 pp 47ndash582012
[15] B Stroustrup The C++ Programming Language ATampT Labs3rd edition 1997
[16] S DonovanC++ by Example ldquoUnderCrdquo Learning Edition QUECorporation 2002
[17] S Mcconnell Code Complete Microsoft Press 2nd edition2004
[18] Microsoft STL Containers Microsoft New York NY USA2012 httpmsdnmicrosoftcomen-uslibrary1fe2x6ktaspx
[19] Microsoft Arrays (C++ Component Extensions) MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryvstudiots4c4dw6(v=vs110)aspx
[20] B Stroustrup ldquoEvolving a language in and for the real worldC++ 1991ndash2006rdquo in Proceedings of the 3rd ACM SIGPLANHistory of Programming Languages Conference (HOPL-III rsquo07)June 2007
[21] Microsoft Memory Limits for Windows Releases MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryaa36677828VS8529aspxmemory limits
[22] A Fog ldquoOptimizing software in C++rdquo 2014 httpwwwagnerorgoptimize
[23] Code Project ldquoC++ Memory Poolrdquo 2014 httpwwwcodepro-jectcomArticles15527C-Memory-Pool
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
8 Mathematical Problems in Engineering
Void mf BRM Split 2 4A (unsigned int ui NS int ui log2NS)
unsigned int ui Nunsigned int ui EBunsigned int ui tunsigned int ui Lunsigned int ui DL
ui N = ui NS Number of samplesui t = ui log2NS minus 2ui EB = ui N4ui L = 1unsigned intlowast BRP1 = new unsigned int[ui EB]unsigned intlowast BRP2 = new unsigned int[ui EB]unsigned intlowast BRP3 = new unsigned int[ui EB]unsigned intlowast BRP4 = new unsigned int[ui EB]
BRP1[0] = 0BRP2[0] = 2BRP3[0] = 1BRP4[0] = 3for (unsigned int q = 0 q lt ui t q++)
ui DL = ui L + ui Lui N = ui N2for (unsigned int j = ui L j lt ui DL j++)
BRP1[j] = BRP1[j minus ui L]+ ui NBRP2[j] = BRP1[j] + 2BRP3[j] = BRP1[j] + 1BRP4[j] = BRP2[j] + 1
ui L = ui L + ui L
delete[] BRO1delete[] BRO2delete[] BRO3delete[] BRO4
Algorithm 3 C++ implementation of new BRA with four one-dimensional arrays for split = 2 (BRM Split 2 4A)
0 8 4 12
2 10 6 14
1 9 5 13
3 11 7 15
Initi
aliz
atio
n
Third
(las
t) ite
ratio
n
First iteration
Figure 3 Permutation filling sequence of 4 individual and single 4-dimensional memory structure for 119899 = 16
multidimensional and multiple memory structures are thesame Also Figure 5 shows continuous decrement of OPPEwhen the number of splits increases Then the algorithmwith the highest number of splits and the lowest number ofoperations is the one which is expected to be most efficient
However results in relation with CPE (Figure 5) show thatthe new algorithm with four memory structures of array isthe fastest and most consistent in the selected range Twofour eight and sixteen multiple array implementations ofnew BRA reported 25 34 33 and 18 higher efficiency
Mathematical Problems in Engineering 9
0
1
2
3
4
5
6
Refalgorithms
Ope
ratio
ns p
er el
emen
t (O
PPE)
Split = 1 Split = 2 Split = 3 Split = 4
Corresponds to both array and vector versions of algorithm of rudioCorresponds to both array and vector versions of ldquolinear bit reversalrdquo algorithmCorresponds to both array and vector versions of ldquoelsterrsquos bit reversalrdquo algorithmCorresponds to both array and vector versions of the new algorithm in
Corresponds to both array and vector versions of the new algorithm in
Corresponds to both array and vector versions of the new algorithm in
signal one-dimensional memory structure
signal multi-dimensional memory structure
signal multiple one-dimensional memory structure
Figure 4 Operations per element versus reference algorithms and new algorithm with different s (splits) where dashed column correspondsto both array and vector versions of algorithm of Rudio cross lines column corresponds to both array and vector versions of ldquoLinear BitReversalrdquo algorithm dotted column corresponds to both array and vector versions of ldquoElsterrsquos Bit Reversalrdquo algorithm vertical lines columncorresponds to both array and vector versions of the new algorithm in single one-dimensional memory structure inclined lines columncorresponds to both array and vector versions of the new algorithm in single multidimensional memory structure and horizontal linescolumn corresponds to both array and vector versions of the new algorithm in multiple one-dimensional memory structures
respectively in relation to the array version of LBRA Thealgorithm with eight memory structures has nearly the sameCPE as the four-array and four-vector versions but is lessconsistent On the other hand the four-vector implementa-tion of the new algorithm is the fastest and most consistentamong all vector versions Two four eight and sixteenmultiple vector implementations of new BRA reported 1316 and 16 higher and 23 lower efficiency respectively inrelation to the vector version of LBRAThis result proves thatat a certain point multiple memory structure gives the bestperformances in the considered domain Also usage of mul-tiple memory structures of primitive array is a good optionfor implementing index mapping of BRP compared to multi-dimensional or single one-dimensional memory structures
Due to the flexible nature of the vector it is commonlyused for implementing algorithms According to Figure 4there is no difference in OPPE between array and vectorversionsHowever our results in Figure 5 show that the vectorversions of BRA always required more CPE (44ndash142)than the array version The structure of vector gives priorityto generality and flexibility rather than to execution speedmemory economy cache efficiency and code size [22]There-fore vector is not a good option with respect to efficiency
The results in Table 5 and Figure 4 show that there is nodifference between the number of calculations and OPPEfor equal versions of algorithms with multidimensional andmultiple memory structure Structure and type of calcula-tions are the same for both types The only difference is thenature of the memory structure multidimension or multipleone-dimension When CPE is considered it shows 19ndash79
performance efficiency from algorithms with multiple one-dimension memory structures The reason for that observa-tion is that the memory access of multidimensional memorystructure is less efficient than one-dimensional memorystructure [22]
We agree with the statement of Elster about indexmapping of BRP [2] and the generally accepted fact (theusage of multidimensional memory structures reduces theperformance of a computer) [4] only with respect to multi-dimensional memory structures of vector Our results showthat even with multidimensional arrays there are situationswhere the new BRA performs better than the same typeof one-dimensional memory structure The four eight andsixteen dimensional array versions of new BRA perform 810 and 2 in relation to one-dimensional array version ofnew BRA Some results in relation to single one-dimensionalmemory structure implementation of new BRA are also notin agreement with the general accepted idea For examplesample size = 231 the two-dimensional vector version of newBRA (BRA Split 1 2DV) reported 542119864 minus 05 CPE which is389 higher in relation to average CPE of sample size rangeof 221 to 230 Also the inconsistency was very highThereforewe excluded values related to sample size = 231 for the two-dimensional vector version
We observed very high memory utilization with the two-dimensional vector version especially with sample size =231 Windows task manager showed that the memory uti-lization of all the considered algorithms was nearly thesame for all sample sizes except for multidimension versionsof vector The multidimensional version of vector initially
10 Mathematical Problems in Engineering
Cloc
ks p
er el
emen
t (CP
E)
000E + 00
500E minus 06
100E minus 05
150E minus 05
200E minus 05
250E minus 05
300E minus 05
BRM
split
01x
BRM
split
12x
BRM
split
24x
BRM
split
38x
BRM
split
416x
lowastBR
Msp
lit12
Dx
BRM
split
24
Dx
BRM
split
38
Dx
BRM
split
416
Dx
BRM
split
11x
BRM
split
21x
BRM
split
31x
BRM
split
41x
x-A or V (array or vector)Algorithms
Refe
renc
e alg
orith
m o
ne-
dim
ensio
nal s
ingl
e mem
ory
struc
ture
of s
izen
Multiple memorystructures
1
1
2
k
k
k
k = 2s
n sample size
Multidimensionalmemory structure
One-dimensionalmemory structure
nknk
nk
middotmiddotmiddot
middotmiddotmiddot
(s = 1 2 3 4)
Corresponds to array version of the algorithmCorresponds to vector version of the algorithmCorresponds to minimum reported CPE of the category
Figure 5 Clocks per element (combined average) versus algorithm for the sample size range from 221 to 231 where blue dotted columncorresponds to array version of the algorithm green column corresponds to vector version of the algorithm and red rhombus corresponds tominimum reported CPE of the category lowastFor vector version sample size 231 was excluded because at sample size 231 it showed huge deviationdue to memory limitation of the machine
utilizes highermemory and drops down to normal valueTheresults in relation to sample size = 2
30 showed that the extramemory requirement of two dimension vector was higherthan that of the four-dimensional vector Based upon that itcan be predicted that BRA Split 1 2DV needs an extra 3GB(total 13 GB) for normal execution at sample size = 2
31 butthe total memory barrier of 12GB of the machine slows theprocess down The most likely reason for this observation isthe influence ofmemory poolingmechanismWhen defininga memory structure it is possible to allocate the requiredamount of memory This is known as memory pooling [23]In all the memory structures used in algorithms discussedin this paper we used memory pooling Memory poolingallocates the required memory at startup and divides thisblock into small chunks If there is no memory poolingmemory will get fragmented Accessing fragmented memoryis inefficient When the existing memory is not enough forallocating then it switches to use fragmented memory forallocating memory elements In the considered situation theexisting memory (12GB) is not sufficient for allocating therequired amount (13GB) which switches to use fragmentedmemory
The total cache size of the machine is 825MB whichis less than the minimum memory utilization of consideredalgorithms (16MB to 8GB) in relation to the sample sizerange from 222 to 231 Only the sample size 221 occupies 8MBmemory which is less than the total cache memory ExceptBRA Split 4 1V structure all algorithms reported constantCPE in relation to the entire sample size range The bestalgorithms of each category especially reported very steady
behaviour This observation is in disagreement with thestatement of Karp ldquothat a machine with hierarchical memorydoes not performwell when array size is larger than the cachesizerdquo [10]
Comparison (Figure 6) of best reported version in theconsidered domain (four memory structure version) andthe selected algorithms shows that the array version ofEBR performs the best The four-array version of new BRAreported 1 lower performance than the array version ofEBR However the four-array version of new BRA reported34 and 23 higher performances than array versions ofLBRA and BRA-Ru Also the four-vector version of newBRA is reported to have the best performance among all thevector versions It reported 16 10 and 22 performancescompared to vector versions of LBRA EBR and BRA-Rurespectively
4 Conclusion and Outlook
Themain finding of this paper is the recursive pattern of BPRand the implementationmethod of it using multiple memorystructures With multiple memory structures especiallythe newly identified index mapping performs much betterthan multidimensional or single one-dimensional memorystructure Furthermore findings of this paper show that theperformance of primitive array is higher than vector typeThe result is in disagreement with the statement of Karp ldquothata machine with hierarchical memory does not perform wellwhen array size is larger than the cache sizerdquo Almost all thesample sizes we used were higher than the total cache size
Mathematical Problems in Engineering 11
120E ndash 05
100E minus 05
800E minus 06
600E minus 06
400E minus 06
200E ndash 06
000E + 00
Cloc
ks p
er el
emen
t (CP
E)
BRM
split
24x
(the b
est v
ersio
nof
the n
ewal
gorit
hm)
Line
ar b
it-re
vers
al1x
Elste
rrsquos b
it-
Rubi
o1x
reve
rsal
1x
x-A or V (array or vector)Algorithms
Corresponds to array version of the algorithmCorresponds to vector version of the algorithm
Figure 6 Clocks per element versus best version of new algorithm and selected algorithms where blue dotted column corresponds to arrayversion and green column corresponds to vector version of the algorithm
1
1
1
1
1
0
1
0
1
0
1
0
Process 1
Process 2
Process 5
Process 6Process 3
Process 4
Process 7
4 parallel processes
2 parallel processes
Single process
Tota
l sig
nal i
s div
ided
into
equa
l fou
r ind
epen
dent
mem
ory
struc
ture
s usin
g ne
w al
gorit
hm (B
RMsp
lit24x
)
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
minus1
w04
w14
w24
w34
w04
w14
w24
w34
A(0)
A(1)
A(2)
A(3)
B(0)
B(1)
B(2)
B(3)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
minus1
minus1
minus1
Figure 7 Four-memory-structure version of algorithm and the parallel processes for sample size = 16
of the computer However multiple memory structure andthe multidimensional memory structure versions showedreasonable steady performancewith those samples In generalthese results show the effects of data structures and memoryallocation techniques and open a new window of creatingefficient algorithms with multiple memory structures inmany other fields where index mapping is involved
The new bit reversal algorithm with 2119904 independent
memory structures splits the total signal into 119904 independentportions and the total FFT process into 119904 + 1 levels Thenthese 119904 signal portions can be processed independently bymeans of 119904 independent processes on the first level Onthe next level the results from the previous level stored inindependent memory structures can be processed with 1199042
12 Mathematical Problems in Engineering
processes and so on until the last level Therefore we suggestusing the concept of multiple memory structures in total FFTprocess along with the new algorithm with multiple memorystructures and suitable parallel processing technique Weexpect that it is possible to achieve higher performance fromFFT process with proper combination of parallel processingtechnique and new algorithm compared to using the newalgorithm only to create bit reversal permutation Figure 7shows such approach with four (when 119904 = 2) independentmemory structures for sample size = 16
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This publication is financially supported by the UniversityGrants Commission Sri Lanka
References
[1] C S Burrus ldquoUnscrambling for fast DFT algorithmsrdquo IEEETransactions on Acoustics Speech and Signal Processing vol 36no 7 pp 1086ndash1087 1988
[2] A C Elster and J C Meyer ldquoA super-efficient adaptable bit-reversal algorithm for multithreaded architecturesrdquo in Pro-ceedings of the IEEE International Symposium on Parallel ampDistributed Processing (IPDPS 09) vol 1ndash5 pp 1ndash8 Rome ItalyMay 2009
[3] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965
[4] M L Massar R Bhagavatula M Fickus and J KovacevicldquoLocal histograms and image occlusion modelsrdquo Applied andComputational Harmonic Analysis vol 34 no 3 pp 469ndash4872013
[5] R G Lyons Understanding Digital Signal Processing PrenticeHall PTR 2004
[6] C Deyun L Zhiqiang G Ming W Lili and Y XiaoyangldquoA superresolution image reconstruction algorithm based onlandweber in electrical capacitance tomographyrdquoMathematicalProblems in Engineering vol 2013 Article ID 128172 8 pages2013
[7] Q Yang and D An ldquoEMD and wavelet transform based faultdiagnosis for wind turbine gear boxrdquo Advances in MechanicalEngineering vol 2013 Article ID 212836 9 pages 2013
[8] C S Burrus Fast Fourier Transforms Rice University HoustonTex USA 2008
[9] C van Loan Computational Frameworks for the Fast FourierTransform vol 10 of Frontiers in Applied Mathematics Societyfor Industrial and Applied Mathematics (SIAM) PhiladelphiaPa USA 1992
[10] A H Karp ldquoBit reversal on uniprocessorsrdquo SIAM Review vol38 no 1 pp 1ndash26 1996
[11] A C Elster ldquoFast bit-reversal algorithmsrdquo in Proceedings ofthe International Conference on Acoustics Speech and SignalProcessing pp 1099ndash1102 IEEE Press Glasgow UK May 1989
[12] M Rubio P Gomez and K Drouiche ldquoA new superfast bitreversal algorithmrdquo International Journal of Adaptive Controland Signal Processing vol 16 no 10 pp 703ndash707 2002
[13] B Stroustrup Programming Principles and Practice Using C++Addison-Wesley 2009
[14] B Stroustrup ldquoSoftware development for infrastructurerdquo IEEEComputer Society vol 45 no 1 Article ID 6081841 pp 47ndash582012
[15] B Stroustrup The C++ Programming Language ATampT Labs3rd edition 1997
[16] S DonovanC++ by Example ldquoUnderCrdquo Learning Edition QUECorporation 2002
[17] S Mcconnell Code Complete Microsoft Press 2nd edition2004
[18] Microsoft STL Containers Microsoft New York NY USA2012 httpmsdnmicrosoftcomen-uslibrary1fe2x6ktaspx
[19] Microsoft Arrays (C++ Component Extensions) MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryvstudiots4c4dw6(v=vs110)aspx
[20] B Stroustrup ldquoEvolving a language in and for the real worldC++ 1991ndash2006rdquo in Proceedings of the 3rd ACM SIGPLANHistory of Programming Languages Conference (HOPL-III rsquo07)June 2007
[21] Microsoft Memory Limits for Windows Releases MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryaa36677828VS8529aspxmemory limits
[22] A Fog ldquoOptimizing software in C++rdquo 2014 httpwwwagnerorgoptimize
[23] Code Project ldquoC++ Memory Poolrdquo 2014 httpwwwcodepro-jectcomArticles15527C-Memory-Pool
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 9
0
1
2
3
4
5
6
Refalgorithms
Ope
ratio
ns p
er el
emen
t (O
PPE)
Split = 1 Split = 2 Split = 3 Split = 4
Corresponds to both array and vector versions of algorithm of rudioCorresponds to both array and vector versions of ldquolinear bit reversalrdquo algorithmCorresponds to both array and vector versions of ldquoelsterrsquos bit reversalrdquo algorithmCorresponds to both array and vector versions of the new algorithm in
Corresponds to both array and vector versions of the new algorithm in
Corresponds to both array and vector versions of the new algorithm in
signal one-dimensional memory structure
signal multi-dimensional memory structure
signal multiple one-dimensional memory structure
Figure 4 Operations per element versus reference algorithms and new algorithm with different s (splits) where dashed column correspondsto both array and vector versions of algorithm of Rudio cross lines column corresponds to both array and vector versions of ldquoLinear BitReversalrdquo algorithm dotted column corresponds to both array and vector versions of ldquoElsterrsquos Bit Reversalrdquo algorithm vertical lines columncorresponds to both array and vector versions of the new algorithm in single one-dimensional memory structure inclined lines columncorresponds to both array and vector versions of the new algorithm in single multidimensional memory structure and horizontal linescolumn corresponds to both array and vector versions of the new algorithm in multiple one-dimensional memory structures
respectively in relation to the array version of LBRA Thealgorithm with eight memory structures has nearly the sameCPE as the four-array and four-vector versions but is lessconsistent On the other hand the four-vector implementa-tion of the new algorithm is the fastest and most consistentamong all vector versions Two four eight and sixteenmultiple vector implementations of new BRA reported 1316 and 16 higher and 23 lower efficiency respectively inrelation to the vector version of LBRAThis result proves thatat a certain point multiple memory structure gives the bestperformances in the considered domain Also usage of mul-tiple memory structures of primitive array is a good optionfor implementing index mapping of BRP compared to multi-dimensional or single one-dimensional memory structures
Due to the flexible nature of the vector it is commonlyused for implementing algorithms According to Figure 4there is no difference in OPPE between array and vectorversionsHowever our results in Figure 5 show that the vectorversions of BRA always required more CPE (44ndash142)than the array version The structure of vector gives priorityto generality and flexibility rather than to execution speedmemory economy cache efficiency and code size [22]There-fore vector is not a good option with respect to efficiency
The results in Table 5 and Figure 4 show that there is nodifference between the number of calculations and OPPEfor equal versions of algorithms with multidimensional andmultiple memory structure Structure and type of calcula-tions are the same for both types The only difference is thenature of the memory structure multidimension or multipleone-dimension When CPE is considered it shows 19ndash79
performance efficiency from algorithms with multiple one-dimension memory structures The reason for that observa-tion is that the memory access of multidimensional memorystructure is less efficient than one-dimensional memorystructure [22]
We agree with the statement of Elster about indexmapping of BRP [2] and the generally accepted fact (theusage of multidimensional memory structures reduces theperformance of a computer) [4] only with respect to multi-dimensional memory structures of vector Our results showthat even with multidimensional arrays there are situationswhere the new BRA performs better than the same typeof one-dimensional memory structure The four eight andsixteen dimensional array versions of new BRA perform 810 and 2 in relation to one-dimensional array version ofnew BRA Some results in relation to single one-dimensionalmemory structure implementation of new BRA are also notin agreement with the general accepted idea For examplesample size = 231 the two-dimensional vector version of newBRA (BRA Split 1 2DV) reported 542119864 minus 05 CPE which is389 higher in relation to average CPE of sample size rangeof 221 to 230 Also the inconsistency was very highThereforewe excluded values related to sample size = 231 for the two-dimensional vector version
We observed very high memory utilization with the two-dimensional vector version especially with sample size =231 Windows task manager showed that the memory uti-lization of all the considered algorithms was nearly thesame for all sample sizes except for multidimension versionsof vector The multidimensional version of vector initially
10 Mathematical Problems in Engineering
Cloc
ks p
er el
emen
t (CP
E)
000E + 00
500E minus 06
100E minus 05
150E minus 05
200E minus 05
250E minus 05
300E minus 05
BRM
split
01x
BRM
split
12x
BRM
split
24x
BRM
split
38x
BRM
split
416x
lowastBR
Msp
lit12
Dx
BRM
split
24
Dx
BRM
split
38
Dx
BRM
split
416
Dx
BRM
split
11x
BRM
split
21x
BRM
split
31x
BRM
split
41x
x-A or V (array or vector)Algorithms
Refe
renc
e alg
orith
m o
ne-
dim
ensio
nal s
ingl
e mem
ory
struc
ture
of s
izen
Multiple memorystructures
1
1
2
k
k
k
k = 2s
n sample size
Multidimensionalmemory structure
One-dimensionalmemory structure
nknk
nk
middotmiddotmiddot
middotmiddotmiddot
(s = 1 2 3 4)
Corresponds to array version of the algorithmCorresponds to vector version of the algorithmCorresponds to minimum reported CPE of the category
Figure 5 Clocks per element (combined average) versus algorithm for the sample size range from 221 to 231 where blue dotted columncorresponds to array version of the algorithm green column corresponds to vector version of the algorithm and red rhombus corresponds tominimum reported CPE of the category lowastFor vector version sample size 231 was excluded because at sample size 231 it showed huge deviationdue to memory limitation of the machine
utilizes highermemory and drops down to normal valueTheresults in relation to sample size = 2
30 showed that the extramemory requirement of two dimension vector was higherthan that of the four-dimensional vector Based upon that itcan be predicted that BRA Split 1 2DV needs an extra 3GB(total 13 GB) for normal execution at sample size = 2
31 butthe total memory barrier of 12GB of the machine slows theprocess down The most likely reason for this observation isthe influence ofmemory poolingmechanismWhen defininga memory structure it is possible to allocate the requiredamount of memory This is known as memory pooling [23]In all the memory structures used in algorithms discussedin this paper we used memory pooling Memory poolingallocates the required memory at startup and divides thisblock into small chunks If there is no memory poolingmemory will get fragmented Accessing fragmented memoryis inefficient When the existing memory is not enough forallocating then it switches to use fragmented memory forallocating memory elements In the considered situation theexisting memory (12GB) is not sufficient for allocating therequired amount (13GB) which switches to use fragmentedmemory
The total cache size of the machine is 825MB whichis less than the minimum memory utilization of consideredalgorithms (16MB to 8GB) in relation to the sample sizerange from 222 to 231 Only the sample size 221 occupies 8MBmemory which is less than the total cache memory ExceptBRA Split 4 1V structure all algorithms reported constantCPE in relation to the entire sample size range The bestalgorithms of each category especially reported very steady
behaviour This observation is in disagreement with thestatement of Karp ldquothat a machine with hierarchical memorydoes not performwell when array size is larger than the cachesizerdquo [10]
Comparison (Figure 6) of best reported version in theconsidered domain (four memory structure version) andthe selected algorithms shows that the array version ofEBR performs the best The four-array version of new BRAreported 1 lower performance than the array version ofEBR However the four-array version of new BRA reported34 and 23 higher performances than array versions ofLBRA and BRA-Ru Also the four-vector version of newBRA is reported to have the best performance among all thevector versions It reported 16 10 and 22 performancescompared to vector versions of LBRA EBR and BRA-Rurespectively
4 Conclusion and Outlook
Themain finding of this paper is the recursive pattern of BPRand the implementationmethod of it using multiple memorystructures With multiple memory structures especiallythe newly identified index mapping performs much betterthan multidimensional or single one-dimensional memorystructure Furthermore findings of this paper show that theperformance of primitive array is higher than vector typeThe result is in disagreement with the statement of Karp ldquothata machine with hierarchical memory does not perform wellwhen array size is larger than the cache sizerdquo Almost all thesample sizes we used were higher than the total cache size
Mathematical Problems in Engineering 11
120E ndash 05
100E minus 05
800E minus 06
600E minus 06
400E minus 06
200E ndash 06
000E + 00
Cloc
ks p
er el
emen
t (CP
E)
BRM
split
24x
(the b
est v
ersio
nof
the n
ewal
gorit
hm)
Line
ar b
it-re
vers
al1x
Elste
rrsquos b
it-
Rubi
o1x
reve
rsal
1x
x-A or V (array or vector)Algorithms
Corresponds to array version of the algorithmCorresponds to vector version of the algorithm
Figure 6 Clocks per element versus best version of new algorithm and selected algorithms where blue dotted column corresponds to arrayversion and green column corresponds to vector version of the algorithm
1
1
1
1
1
0
1
0
1
0
1
0
Process 1
Process 2
Process 5
Process 6Process 3
Process 4
Process 7
4 parallel processes
2 parallel processes
Single process
Tota
l sig
nal i
s div
ided
into
equa
l fou
r ind
epen
dent
mem
ory
struc
ture
s usin
g ne
w al
gorit
hm (B
RMsp
lit24x
)
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
minus1
w04
w14
w24
w34
w04
w14
w24
w34
A(0)
A(1)
A(2)
A(3)
B(0)
B(1)
B(2)
B(3)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
minus1
minus1
minus1
Figure 7 Four-memory-structure version of algorithm and the parallel processes for sample size = 16
of the computer However multiple memory structure andthe multidimensional memory structure versions showedreasonable steady performancewith those samples In generalthese results show the effects of data structures and memoryallocation techniques and open a new window of creatingefficient algorithms with multiple memory structures inmany other fields where index mapping is involved
The new bit reversal algorithm with 2119904 independent
memory structures splits the total signal into 119904 independentportions and the total FFT process into 119904 + 1 levels Thenthese 119904 signal portions can be processed independently bymeans of 119904 independent processes on the first level Onthe next level the results from the previous level stored inindependent memory structures can be processed with 1199042
12 Mathematical Problems in Engineering
processes and so on until the last level Therefore we suggestusing the concept of multiple memory structures in total FFTprocess along with the new algorithm with multiple memorystructures and suitable parallel processing technique Weexpect that it is possible to achieve higher performance fromFFT process with proper combination of parallel processingtechnique and new algorithm compared to using the newalgorithm only to create bit reversal permutation Figure 7shows such approach with four (when 119904 = 2) independentmemory structures for sample size = 16
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This publication is financially supported by the UniversityGrants Commission Sri Lanka
References
[1] C S Burrus ldquoUnscrambling for fast DFT algorithmsrdquo IEEETransactions on Acoustics Speech and Signal Processing vol 36no 7 pp 1086ndash1087 1988
[2] A C Elster and J C Meyer ldquoA super-efficient adaptable bit-reversal algorithm for multithreaded architecturesrdquo in Pro-ceedings of the IEEE International Symposium on Parallel ampDistributed Processing (IPDPS 09) vol 1ndash5 pp 1ndash8 Rome ItalyMay 2009
[3] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965
[4] M L Massar R Bhagavatula M Fickus and J KovacevicldquoLocal histograms and image occlusion modelsrdquo Applied andComputational Harmonic Analysis vol 34 no 3 pp 469ndash4872013
[5] R G Lyons Understanding Digital Signal Processing PrenticeHall PTR 2004
[6] C Deyun L Zhiqiang G Ming W Lili and Y XiaoyangldquoA superresolution image reconstruction algorithm based onlandweber in electrical capacitance tomographyrdquoMathematicalProblems in Engineering vol 2013 Article ID 128172 8 pages2013
[7] Q Yang and D An ldquoEMD and wavelet transform based faultdiagnosis for wind turbine gear boxrdquo Advances in MechanicalEngineering vol 2013 Article ID 212836 9 pages 2013
[8] C S Burrus Fast Fourier Transforms Rice University HoustonTex USA 2008
[9] C van Loan Computational Frameworks for the Fast FourierTransform vol 10 of Frontiers in Applied Mathematics Societyfor Industrial and Applied Mathematics (SIAM) PhiladelphiaPa USA 1992
[10] A H Karp ldquoBit reversal on uniprocessorsrdquo SIAM Review vol38 no 1 pp 1ndash26 1996
[11] A C Elster ldquoFast bit-reversal algorithmsrdquo in Proceedings ofthe International Conference on Acoustics Speech and SignalProcessing pp 1099ndash1102 IEEE Press Glasgow UK May 1989
[12] M Rubio P Gomez and K Drouiche ldquoA new superfast bitreversal algorithmrdquo International Journal of Adaptive Controland Signal Processing vol 16 no 10 pp 703ndash707 2002
[13] B Stroustrup Programming Principles and Practice Using C++Addison-Wesley 2009
[14] B Stroustrup ldquoSoftware development for infrastructurerdquo IEEEComputer Society vol 45 no 1 Article ID 6081841 pp 47ndash582012
[15] B Stroustrup The C++ Programming Language ATampT Labs3rd edition 1997
[16] S DonovanC++ by Example ldquoUnderCrdquo Learning Edition QUECorporation 2002
[17] S Mcconnell Code Complete Microsoft Press 2nd edition2004
[18] Microsoft STL Containers Microsoft New York NY USA2012 httpmsdnmicrosoftcomen-uslibrary1fe2x6ktaspx
[19] Microsoft Arrays (C++ Component Extensions) MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryvstudiots4c4dw6(v=vs110)aspx
[20] B Stroustrup ldquoEvolving a language in and for the real worldC++ 1991ndash2006rdquo in Proceedings of the 3rd ACM SIGPLANHistory of Programming Languages Conference (HOPL-III rsquo07)June 2007
[21] Microsoft Memory Limits for Windows Releases MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryaa36677828VS8529aspxmemory limits
[22] A Fog ldquoOptimizing software in C++rdquo 2014 httpwwwagnerorgoptimize
[23] Code Project ldquoC++ Memory Poolrdquo 2014 httpwwwcodepro-jectcomArticles15527C-Memory-Pool
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
10 Mathematical Problems in Engineering
Cloc
ks p
er el
emen
t (CP
E)
000E + 00
500E minus 06
100E minus 05
150E minus 05
200E minus 05
250E minus 05
300E minus 05
BRM
split
01x
BRM
split
12x
BRM
split
24x
BRM
split
38x
BRM
split
416x
lowastBR
Msp
lit12
Dx
BRM
split
24
Dx
BRM
split
38
Dx
BRM
split
416
Dx
BRM
split
11x
BRM
split
21x
BRM
split
31x
BRM
split
41x
x-A or V (array or vector)Algorithms
Refe
renc
e alg
orith
m o
ne-
dim
ensio
nal s
ingl
e mem
ory
struc
ture
of s
izen
Multiple memorystructures
1
1
2
k
k
k
k = 2s
n sample size
Multidimensionalmemory structure
One-dimensionalmemory structure
nknk
nk
middotmiddotmiddot
middotmiddotmiddot
(s = 1 2 3 4)
Corresponds to array version of the algorithmCorresponds to vector version of the algorithmCorresponds to minimum reported CPE of the category
Figure 5 Clocks per element (combined average) versus algorithm for the sample size range from 221 to 231 where blue dotted columncorresponds to array version of the algorithm green column corresponds to vector version of the algorithm and red rhombus corresponds tominimum reported CPE of the category lowastFor vector version sample size 231 was excluded because at sample size 231 it showed huge deviationdue to memory limitation of the machine
utilizes highermemory and drops down to normal valueTheresults in relation to sample size = 2
30 showed that the extramemory requirement of two dimension vector was higherthan that of the four-dimensional vector Based upon that itcan be predicted that BRA Split 1 2DV needs an extra 3GB(total 13 GB) for normal execution at sample size = 2
31 butthe total memory barrier of 12GB of the machine slows theprocess down The most likely reason for this observation isthe influence ofmemory poolingmechanismWhen defininga memory structure it is possible to allocate the requiredamount of memory This is known as memory pooling [23]In all the memory structures used in algorithms discussedin this paper we used memory pooling Memory poolingallocates the required memory at startup and divides thisblock into small chunks If there is no memory poolingmemory will get fragmented Accessing fragmented memoryis inefficient When the existing memory is not enough forallocating then it switches to use fragmented memory forallocating memory elements In the considered situation theexisting memory (12GB) is not sufficient for allocating therequired amount (13GB) which switches to use fragmentedmemory
The total cache size of the machine is 825MB whichis less than the minimum memory utilization of consideredalgorithms (16MB to 8GB) in relation to the sample sizerange from 222 to 231 Only the sample size 221 occupies 8MBmemory which is less than the total cache memory ExceptBRA Split 4 1V structure all algorithms reported constantCPE in relation to the entire sample size range The bestalgorithms of each category especially reported very steady
behaviour This observation is in disagreement with thestatement of Karp ldquothat a machine with hierarchical memorydoes not performwell when array size is larger than the cachesizerdquo [10]
Comparison (Figure 6) of best reported version in theconsidered domain (four memory structure version) andthe selected algorithms shows that the array version ofEBR performs the best The four-array version of new BRAreported 1 lower performance than the array version ofEBR However the four-array version of new BRA reported34 and 23 higher performances than array versions ofLBRA and BRA-Ru Also the four-vector version of newBRA is reported to have the best performance among all thevector versions It reported 16 10 and 22 performancescompared to vector versions of LBRA EBR and BRA-Rurespectively
4 Conclusion and Outlook
Themain finding of this paper is the recursive pattern of BPRand the implementationmethod of it using multiple memorystructures With multiple memory structures especiallythe newly identified index mapping performs much betterthan multidimensional or single one-dimensional memorystructure Furthermore findings of this paper show that theperformance of primitive array is higher than vector typeThe result is in disagreement with the statement of Karp ldquothata machine with hierarchical memory does not perform wellwhen array size is larger than the cache sizerdquo Almost all thesample sizes we used were higher than the total cache size
Mathematical Problems in Engineering 11
120E ndash 05
100E minus 05
800E minus 06
600E minus 06
400E minus 06
200E ndash 06
000E + 00
Cloc
ks p
er el
emen
t (CP
E)
BRM
split
24x
(the b
est v
ersio
nof
the n
ewal
gorit
hm)
Line
ar b
it-re
vers
al1x
Elste
rrsquos b
it-
Rubi
o1x
reve
rsal
1x
x-A or V (array or vector)Algorithms
Corresponds to array version of the algorithmCorresponds to vector version of the algorithm
Figure 6 Clocks per element versus best version of new algorithm and selected algorithms where blue dotted column corresponds to arrayversion and green column corresponds to vector version of the algorithm
1
1
1
1
1
0
1
0
1
0
1
0
Process 1
Process 2
Process 5
Process 6Process 3
Process 4
Process 7
4 parallel processes
2 parallel processes
Single process
Tota
l sig
nal i
s div
ided
into
equa
l fou
r ind
epen
dent
mem
ory
struc
ture
s usin
g ne
w al
gorit
hm (B
RMsp
lit24x
)
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
minus1
w04
w14
w24
w34
w04
w14
w24
w34
A(0)
A(1)
A(2)
A(3)
B(0)
B(1)
B(2)
B(3)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
minus1
minus1
minus1
Figure 7 Four-memory-structure version of algorithm and the parallel processes for sample size = 16
of the computer However multiple memory structure andthe multidimensional memory structure versions showedreasonable steady performancewith those samples In generalthese results show the effects of data structures and memoryallocation techniques and open a new window of creatingefficient algorithms with multiple memory structures inmany other fields where index mapping is involved
The new bit reversal algorithm with 2119904 independent
memory structures splits the total signal into 119904 independentportions and the total FFT process into 119904 + 1 levels Thenthese 119904 signal portions can be processed independently bymeans of 119904 independent processes on the first level Onthe next level the results from the previous level stored inindependent memory structures can be processed with 1199042
12 Mathematical Problems in Engineering
processes and so on until the last level Therefore we suggestusing the concept of multiple memory structures in total FFTprocess along with the new algorithm with multiple memorystructures and suitable parallel processing technique Weexpect that it is possible to achieve higher performance fromFFT process with proper combination of parallel processingtechnique and new algorithm compared to using the newalgorithm only to create bit reversal permutation Figure 7shows such approach with four (when 119904 = 2) independentmemory structures for sample size = 16
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This publication is financially supported by the UniversityGrants Commission Sri Lanka
References
[1] C S Burrus ldquoUnscrambling for fast DFT algorithmsrdquo IEEETransactions on Acoustics Speech and Signal Processing vol 36no 7 pp 1086ndash1087 1988
[2] A C Elster and J C Meyer ldquoA super-efficient adaptable bit-reversal algorithm for multithreaded architecturesrdquo in Pro-ceedings of the IEEE International Symposium on Parallel ampDistributed Processing (IPDPS 09) vol 1ndash5 pp 1ndash8 Rome ItalyMay 2009
[3] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965
[4] M L Massar R Bhagavatula M Fickus and J KovacevicldquoLocal histograms and image occlusion modelsrdquo Applied andComputational Harmonic Analysis vol 34 no 3 pp 469ndash4872013
[5] R G Lyons Understanding Digital Signal Processing PrenticeHall PTR 2004
[6] C Deyun L Zhiqiang G Ming W Lili and Y XiaoyangldquoA superresolution image reconstruction algorithm based onlandweber in electrical capacitance tomographyrdquoMathematicalProblems in Engineering vol 2013 Article ID 128172 8 pages2013
[7] Q Yang and D An ldquoEMD and wavelet transform based faultdiagnosis for wind turbine gear boxrdquo Advances in MechanicalEngineering vol 2013 Article ID 212836 9 pages 2013
[8] C S Burrus Fast Fourier Transforms Rice University HoustonTex USA 2008
[9] C van Loan Computational Frameworks for the Fast FourierTransform vol 10 of Frontiers in Applied Mathematics Societyfor Industrial and Applied Mathematics (SIAM) PhiladelphiaPa USA 1992
[10] A H Karp ldquoBit reversal on uniprocessorsrdquo SIAM Review vol38 no 1 pp 1ndash26 1996
[11] A C Elster ldquoFast bit-reversal algorithmsrdquo in Proceedings ofthe International Conference on Acoustics Speech and SignalProcessing pp 1099ndash1102 IEEE Press Glasgow UK May 1989
[12] M Rubio P Gomez and K Drouiche ldquoA new superfast bitreversal algorithmrdquo International Journal of Adaptive Controland Signal Processing vol 16 no 10 pp 703ndash707 2002
[13] B Stroustrup Programming Principles and Practice Using C++Addison-Wesley 2009
[14] B Stroustrup ldquoSoftware development for infrastructurerdquo IEEEComputer Society vol 45 no 1 Article ID 6081841 pp 47ndash582012
[15] B Stroustrup The C++ Programming Language ATampT Labs3rd edition 1997
[16] S DonovanC++ by Example ldquoUnderCrdquo Learning Edition QUECorporation 2002
[17] S Mcconnell Code Complete Microsoft Press 2nd edition2004
[18] Microsoft STL Containers Microsoft New York NY USA2012 httpmsdnmicrosoftcomen-uslibrary1fe2x6ktaspx
[19] Microsoft Arrays (C++ Component Extensions) MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryvstudiots4c4dw6(v=vs110)aspx
[20] B Stroustrup ldquoEvolving a language in and for the real worldC++ 1991ndash2006rdquo in Proceedings of the 3rd ACM SIGPLANHistory of Programming Languages Conference (HOPL-III rsquo07)June 2007
[21] Microsoft Memory Limits for Windows Releases MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryaa36677828VS8529aspxmemory limits
[22] A Fog ldquoOptimizing software in C++rdquo 2014 httpwwwagnerorgoptimize
[23] Code Project ldquoC++ Memory Poolrdquo 2014 httpwwwcodepro-jectcomArticles15527C-Memory-Pool
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 11
120E ndash 05
100E minus 05
800E minus 06
600E minus 06
400E minus 06
200E ndash 06
000E + 00
Cloc
ks p
er el
emen
t (CP
E)
BRM
split
24x
(the b
est v
ersio
nof
the n
ewal
gorit
hm)
Line
ar b
it-re
vers
al1x
Elste
rrsquos b
it-
Rubi
o1x
reve
rsal
1x
x-A or V (array or vector)Algorithms
Corresponds to array version of the algorithmCorresponds to vector version of the algorithm
Figure 6 Clocks per element versus best version of new algorithm and selected algorithms where blue dotted column corresponds to arrayversion and green column corresponds to vector version of the algorithm
1
1
1
1
1
0
1
0
1
0
1
0
Process 1
Process 2
Process 5
Process 6Process 3
Process 4
Process 7
4 parallel processes
2 parallel processes
Single process
Tota
l sig
nal i
s div
ided
into
equa
l fou
r ind
epen
dent
mem
ory
struc
ture
s usin
g ne
w al
gorit
hm (B
RMsp
lit24x
)
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
minus1
w04
w14
w24
w34
w04
w14
w24
w34
A(0)
A(1)
A(2)
A(3)
B(0)
B(1)
B(2)
B(3)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
minus1
minus1
minus1
Figure 7 Four-memory-structure version of algorithm and the parallel processes for sample size = 16
of the computer However multiple memory structure andthe multidimensional memory structure versions showedreasonable steady performancewith those samples In generalthese results show the effects of data structures and memoryallocation techniques and open a new window of creatingefficient algorithms with multiple memory structures inmany other fields where index mapping is involved
The new bit reversal algorithm with 2119904 independent
memory structures splits the total signal into 119904 independentportions and the total FFT process into 119904 + 1 levels Thenthese 119904 signal portions can be processed independently bymeans of 119904 independent processes on the first level Onthe next level the results from the previous level stored inindependent memory structures can be processed with 1199042
12 Mathematical Problems in Engineering
processes and so on until the last level Therefore we suggestusing the concept of multiple memory structures in total FFTprocess along with the new algorithm with multiple memorystructures and suitable parallel processing technique Weexpect that it is possible to achieve higher performance fromFFT process with proper combination of parallel processingtechnique and new algorithm compared to using the newalgorithm only to create bit reversal permutation Figure 7shows such approach with four (when 119904 = 2) independentmemory structures for sample size = 16
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This publication is financially supported by the UniversityGrants Commission Sri Lanka
References
[1] C S Burrus ldquoUnscrambling for fast DFT algorithmsrdquo IEEETransactions on Acoustics Speech and Signal Processing vol 36no 7 pp 1086ndash1087 1988
[2] A C Elster and J C Meyer ldquoA super-efficient adaptable bit-reversal algorithm for multithreaded architecturesrdquo in Pro-ceedings of the IEEE International Symposium on Parallel ampDistributed Processing (IPDPS 09) vol 1ndash5 pp 1ndash8 Rome ItalyMay 2009
[3] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965
[4] M L Massar R Bhagavatula M Fickus and J KovacevicldquoLocal histograms and image occlusion modelsrdquo Applied andComputational Harmonic Analysis vol 34 no 3 pp 469ndash4872013
[5] R G Lyons Understanding Digital Signal Processing PrenticeHall PTR 2004
[6] C Deyun L Zhiqiang G Ming W Lili and Y XiaoyangldquoA superresolution image reconstruction algorithm based onlandweber in electrical capacitance tomographyrdquoMathematicalProblems in Engineering vol 2013 Article ID 128172 8 pages2013
[7] Q Yang and D An ldquoEMD and wavelet transform based faultdiagnosis for wind turbine gear boxrdquo Advances in MechanicalEngineering vol 2013 Article ID 212836 9 pages 2013
[8] C S Burrus Fast Fourier Transforms Rice University HoustonTex USA 2008
[9] C van Loan Computational Frameworks for the Fast FourierTransform vol 10 of Frontiers in Applied Mathematics Societyfor Industrial and Applied Mathematics (SIAM) PhiladelphiaPa USA 1992
[10] A H Karp ldquoBit reversal on uniprocessorsrdquo SIAM Review vol38 no 1 pp 1ndash26 1996
[11] A C Elster ldquoFast bit-reversal algorithmsrdquo in Proceedings ofthe International Conference on Acoustics Speech and SignalProcessing pp 1099ndash1102 IEEE Press Glasgow UK May 1989
[12] M Rubio P Gomez and K Drouiche ldquoA new superfast bitreversal algorithmrdquo International Journal of Adaptive Controland Signal Processing vol 16 no 10 pp 703ndash707 2002
[13] B Stroustrup Programming Principles and Practice Using C++Addison-Wesley 2009
[14] B Stroustrup ldquoSoftware development for infrastructurerdquo IEEEComputer Society vol 45 no 1 Article ID 6081841 pp 47ndash582012
[15] B Stroustrup The C++ Programming Language ATampT Labs3rd edition 1997
[16] S DonovanC++ by Example ldquoUnderCrdquo Learning Edition QUECorporation 2002
[17] S Mcconnell Code Complete Microsoft Press 2nd edition2004
[18] Microsoft STL Containers Microsoft New York NY USA2012 httpmsdnmicrosoftcomen-uslibrary1fe2x6ktaspx
[19] Microsoft Arrays (C++ Component Extensions) MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryvstudiots4c4dw6(v=vs110)aspx
[20] B Stroustrup ldquoEvolving a language in and for the real worldC++ 1991ndash2006rdquo in Proceedings of the 3rd ACM SIGPLANHistory of Programming Languages Conference (HOPL-III rsquo07)June 2007
[21] Microsoft Memory Limits for Windows Releases MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryaa36677828VS8529aspxmemory limits
[22] A Fog ldquoOptimizing software in C++rdquo 2014 httpwwwagnerorgoptimize
[23] Code Project ldquoC++ Memory Poolrdquo 2014 httpwwwcodepro-jectcomArticles15527C-Memory-Pool
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
12 Mathematical Problems in Engineering
processes and so on until the last level Therefore we suggestusing the concept of multiple memory structures in total FFTprocess along with the new algorithm with multiple memorystructures and suitable parallel processing technique Weexpect that it is possible to achieve higher performance fromFFT process with proper combination of parallel processingtechnique and new algorithm compared to using the newalgorithm only to create bit reversal permutation Figure 7shows such approach with four (when 119904 = 2) independentmemory structures for sample size = 16
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This publication is financially supported by the UniversityGrants Commission Sri Lanka
References
[1] C S Burrus ldquoUnscrambling for fast DFT algorithmsrdquo IEEETransactions on Acoustics Speech and Signal Processing vol 36no 7 pp 1086ndash1087 1988
[2] A C Elster and J C Meyer ldquoA super-efficient adaptable bit-reversal algorithm for multithreaded architecturesrdquo in Pro-ceedings of the IEEE International Symposium on Parallel ampDistributed Processing (IPDPS 09) vol 1ndash5 pp 1ndash8 Rome ItalyMay 2009
[3] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965
[4] M L Massar R Bhagavatula M Fickus and J KovacevicldquoLocal histograms and image occlusion modelsrdquo Applied andComputational Harmonic Analysis vol 34 no 3 pp 469ndash4872013
[5] R G Lyons Understanding Digital Signal Processing PrenticeHall PTR 2004
[6] C Deyun L Zhiqiang G Ming W Lili and Y XiaoyangldquoA superresolution image reconstruction algorithm based onlandweber in electrical capacitance tomographyrdquoMathematicalProblems in Engineering vol 2013 Article ID 128172 8 pages2013
[7] Q Yang and D An ldquoEMD and wavelet transform based faultdiagnosis for wind turbine gear boxrdquo Advances in MechanicalEngineering vol 2013 Article ID 212836 9 pages 2013
[8] C S Burrus Fast Fourier Transforms Rice University HoustonTex USA 2008
[9] C van Loan Computational Frameworks for the Fast FourierTransform vol 10 of Frontiers in Applied Mathematics Societyfor Industrial and Applied Mathematics (SIAM) PhiladelphiaPa USA 1992
[10] A H Karp ldquoBit reversal on uniprocessorsrdquo SIAM Review vol38 no 1 pp 1ndash26 1996
[11] A C Elster ldquoFast bit-reversal algorithmsrdquo in Proceedings ofthe International Conference on Acoustics Speech and SignalProcessing pp 1099ndash1102 IEEE Press Glasgow UK May 1989
[12] M Rubio P Gomez and K Drouiche ldquoA new superfast bitreversal algorithmrdquo International Journal of Adaptive Controland Signal Processing vol 16 no 10 pp 703ndash707 2002
[13] B Stroustrup Programming Principles and Practice Using C++Addison-Wesley 2009
[14] B Stroustrup ldquoSoftware development for infrastructurerdquo IEEEComputer Society vol 45 no 1 Article ID 6081841 pp 47ndash582012
[15] B Stroustrup The C++ Programming Language ATampT Labs3rd edition 1997
[16] S DonovanC++ by Example ldquoUnderCrdquo Learning Edition QUECorporation 2002
[17] S Mcconnell Code Complete Microsoft Press 2nd edition2004
[18] Microsoft STL Containers Microsoft New York NY USA2012 httpmsdnmicrosoftcomen-uslibrary1fe2x6ktaspx
[19] Microsoft Arrays (C++ Component Extensions) MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryvstudiots4c4dw6(v=vs110)aspx
[20] B Stroustrup ldquoEvolving a language in and for the real worldC++ 1991ndash2006rdquo in Proceedings of the 3rd ACM SIGPLANHistory of Programming Languages Conference (HOPL-III rsquo07)June 2007
[21] Microsoft Memory Limits for Windows Releases MicrosoftNew York NY USA 2012 httpmsdnmicrosoftcomen-uslibraryaa36677828VS8529aspxmemory limits
[22] A Fog ldquoOptimizing software in C++rdquo 2014 httpwwwagnerorgoptimize
[23] Code Project ldquoC++ Memory Poolrdquo 2014 httpwwwcodepro-jectcomArticles15527C-Memory-Pool
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of