FLAC Audio File Decoding Using a Serial Streaming Approach ... · decode FLAC audio frame by frame,...

Australian National University

Semester 1, 2017

FLAC Audio File DecodingUsing a Serial Streaming

Approach Accelerated via GPU

Jiajin Huang

supervised byDr. Eric McCreath

May 26, 2017

Contents

1 Acknowledgment 3

2 Introduction and Motivation 5

3 Background 63.1 Introduction of FLAC Format . . . . . . . . . . . . . . . . . . 6

3.1.1 What’s FLAC . . . . . . . . . . . . . . . . . . . . . . . 63.1.2 Feature of FLAC . . . . . . . . . . . . . . . . . . . . . 63.1.3 FLAC Format and Its Structure . . . . . . . . . . . . . 7

3.2 Linear Predictive Coding . . . . . . . . . . . . . . . . . . . . . 83.2.1 Rice Coding . . . . . . . . . . . . . . . . . . . . . . . . 9

3.3 Graphic Processing Unit(GPU) programming . . . . . . . . . 103.4 GPU Architecture and Its Working Principle . . . . . . . . . . 10

4 Methodology and Documentation 134.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.3 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.3.1 Defect and Isssues in My Code . . . . . . . . . . . . . 17

5 Experiments and Discussion 185.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 18

5.1.1 Reference Declaration for Coding . . . . . . . . . . . . 185.2 Experiments and Discussion in Testing Audio File . . . . . . . 185.3 Experiments and Discussion in Longer Recordings . . . . . . . 20

6 Future work 256.1 Idea to Reduce serially getting frame headers positions . . . . 256.2 How arrray allocation and transfer can be reduced . . . . . . . 25

7 Conclusion 26

8 Appendix 27

9 bibliography 31

2

1 Acknowledgment

I sincerely express my gratitude towards Dr. Eric McGreath, who is mysupervisor. Dr. Eric McGreath guided me throughout the whole projectover the past year with patience and wisdom. He helped me to build upbasic understanding of CUDA parallel programming, FLAC format, reportwritings, etc. He also provided me with necessary materials, example reportformat. He always listened to and answered my questions, no matter howsilly they were. When it came to difficulties, or barriers I met during theproject, he encourage me a lot and provide me not only academic supportbut also mental support.

I also thanks my colleague, Yang Wang and Qin Tian. As our projectare similarly about FLAC encoding or decoding via CUDA, we took tutorialunder Dr. Eric McGreath for FLAC format, FLAC audio encoding anddecoding basis. We helped each other understand about CUDA and FLACformat.

My thanks also goes to Prof, Weifa Liang and Prof. Peter Strazdinsas they are the course conveners and held many lectures weekly for us tounderstand about this project and guided us on the right track.

I also thanks all my friends and especially my family members for thecontinuous support during the project.

3

Abstract

This research is exploring the decoding method of FLAC losslessaudio format in streaming approach using Graphic Processing Unit(GPU). FLAC is an wildly used audio format due to its lossless and“free” features. It provides also good compression.

Traditionally, people use a serial approach for decoding FLAC.However, when it comes to a long audio file, decoding speed will besignificant large. This is time-consuming for users editing or process-ing long recordings.

In order to over come this, I implemented a streaming way, viaGPU, to decode long FLAC audio. More specifically, I used the GPUto coordinate and synchronize the decoding work of FLAC audio sothat the performance of decoding can be enhanced compared to theserial approach. In this way, streams of an FLAC audio can be decodedconcurrently, which leads to a better performance. The performanceusing the different approaches has been compared and analyzed.

The performance of decoding of FLAC audio shows a XXX% im-provement in long recordings.

4

2 Introduction and Motivation

FLAC is a audio compression format like some universally used format, forexample, mp3. However, FLAC format is lossless, which means that the au-dio quality and listening experience are better than mp3 or other lossy audioformat, and these are significantly noticeable when good listening equipmentor decent listening environment are provided. [13] Thus, FLAC is commonlyused among those who in demand of a better quality audio.

Due to the increasing popularity of FLAC audio, researches and develop-ments have been practicing in order to make FLAC better serve users, suchas an embedded FLAC decoder system was designed to play FLAC not onlyon PC but also embedded devices with lower price and better sound quality.[14]

Moreover, the decoding technique and algorithm in computer science per-spective can be another drawback of FLAC, as the traditional serial approachdecode FLAC audio frame by frame, which doesn’t reach the optimal timeefficient. The serial approach of decoding also doesn’t make full use of theFLAC format structure, one of the most obvious wastes is that frames inFLAC are independent in structure, which provide the opportunities of par-allelism processing.

This report proposed a streaming approach to decode audio in FLACformat. As CUDA can manage concurrency using streaming function viaGPU, stream approach can be a efficient way to decode FLAC format.

The report is written in following structure. Firstly, it comes with Ac-knowledgment, introduction and motivation of the project. Then, the back-ground of the project was illustrated which contains the introduction ofFLAC format, Linear Predictive Coding, Graphic Processing Unit(GPU)programming and the general architecture and working principle of GPU.Subsequently, it comes with the methodology(framework and algorithm) anddocumentation. Next, experiments and the corresponding analysis and dis-cussion are presented, with diagrams. Based on discussions, the bottleneckwas found and assumption was made for future improvement. Before theconclusion, some defects, bugs in my artifact were discussed.

5

3 Background

This section includes the introduction of FLAC format, Linear PredictiveCoding, Rice Coding, Graphic Processing Unit(GPU) programming and GraphicProcessing Unit(GPU) architecture and how it works.

3.1 Introduction of FLAC Format

The most part of introduction to FLAC comes from a FLAC documenta-tion website —- “FLAC - Free Lossless Audio Codec.Xiph.org”, written byCoalson, J.[2] Other citation will be noted accordingly both in text and inReference

3.1.1 What’s FLAC

FLAC is an audio format in which audios are compressed in FLAC withoutany loss in quality, in other words, its lossless. FLAC stands for Free Loss-less Audio Codec. More specifically, Free of FLAC means that not only itsavailable at no cost, but also FLAC and its specification are fully open tothe public to be used for any purpose (commercial or noncommercial use)with few restrictions of any kind. Especially, the implemented encoding ordecoding are not cover by any known patent.[2]

3.1.2 Feature of FLAC

There are several significant features of FLAC:

• Lossless: FLAC encode audio data with on loss of its information anddecode it bit by bit from encoder. Whats more, FLAC use MD5 toverify data integrity.[7]

• Fast: As FLAC is asymmetric in decoding speed, it requires much lesscomputation time, even with simple embedded hardware support.

• Hardware support: FLAC can be used in most kinds of universal elec-tronic devices

• Flexible metadata: Metadata like tag, tables, cue sheet can be addedinto FLAC files without breaking pervious streams.

• Seakable: the accurate seeking in FLAC allows FLAC files can be usedeasily in editing applications.

6

Table 1: metadata of stream information[2]

bits description16 min block size16 max block size24 min frame size24 max frame size20 sample rate(Hz)3 number of channels5 bits per sample36 number of samples in stream128 MD5 signature

• Streamable: frames in a FLAC file are synchronized and independent,which means that the access of any section of a stream in FLAC takesminimum time for decoding.

• Suitable for archiving: Converting data to other format is convenientto be converted to other format.

• Error resistant: Once an error occurs, it has limited damage to theframe due to the framing of FLAC. This avoid a single error ruins thewhole stream.

• Convenient CD archiving: the cue sheet in the metadata block of FLACcan provide index and playlist information for a large audio file in orderto be used in extracting or burning [4]

3.1.3 FLAC Format and Its Structure

A FLAC stream is made up of the flaC marker at the beginning of thestream, followed by some metadata block, and then come with the audioframes. Additionally, all numbers in FLAC format are integers. The streamstrcuture of FLAC is as shown in Table 2 on page 8.

As for the mandatory metadata block with stream information, the struc-ture of the block is as shown in Table 1 on page 7.

FLAC supports many kinds of metadata blocks. For example:

• ”Streaminfo” consists the overall information of the stream like chan-nels numbers, sample rate;

7

Table 2: FLAC stream format [2]

name descriptionFLAC stream marker marker in ASCIIMetadata block mandatory metadata blockalternative metadata block metadata blocks that can be addedframes one or more audio frames

• ”Application” is used for third-party applications;

• ”Padding” doesn’t have a meaning but can be used to instruct theencoder to book a padding block so that the block can be used forfuture overwrite;

• ”Seektable” stores seeking points for seeking purpose;

• ”Vorbis Comment” stores a list of name-value pairs which implementthe Vorbis comment sepcification;

• ”Cuesheet” can be used for tracking and indexing points for playback;

• ”Picture” stores pictures corresponding to the files. Specifically, theheader of the metadata.

Frames in FLAC has a frame header, which contains synchronize codeand information for decoder, and ends with a frame footer for error detectionpurpose. The details structure of frame header is as shown in Table 3 on page9.

3.2 Linear Predictive Coding

”Linear predictive coding (LPC) is a method for signal source modelling inspeech signal processing”. [6] LPC is often used in the process of compressingand encoding from audio to its FLAC format. Analog signal can be encodedand the next value can be predicted via a linear function. In specifically, theprediction value is generate by the linear function as follow[9]:

s′(t) =n∑

i=1

ais(t− i)

8

Table 3: frame header in FLAC [2]

bits description14 111111111111110(synchronize code)1 reserved value(0 or 1)1 blocking strategy4 block size in inter-channel samples4 sample rate4 channel assignment3 channel size in bits1 reserved value(0 or 1)? if(variable blocksize)8 CRC-8

Then according to the result, difference between the actual value and thepredicted value can be produce[9]:

e(t) = s(t)− s′(t)

This value called residual. The residual must be recorded losslessly, if thepredicted value is different from the actual value in audio. [2]

3.2.1 Rice Coding

The following introduction of Rice Coding is learned from Michael Dipper-stein. [5] Rice coding is wildly used in the encoding of audio or video domain.“It’s used to encode strings of numbers with a variable bit length for eachnumber. If most of the numbers are small, fairly good compression canbe achieved”.[8] In decoding stage, it’s worthy to learn from Rice codingto do the reverse for decoding. Given an entity S, it can be representedby q ∗ m + r, wherem = 2k. Note that usually k or m is given previously,The encoded version of entity s, can be represent in 2 parts, the Prefix isq = s >> k, whcih means s shifted k bits; the Suffix is r = s&(m−1), whichmeans s bitwise ANDed with(m - 1). For example, 18 (0b00010010) can beencoded into 100010, which, obviously, saves 2 bits.

As for decoding, the q is determined by the number of 1s before thefirst occurred 0; R is determined by the remaining binary value. Thus, thedecoded value is q * m + r.

9

Table 4: Architecture of GPU

block00 block01 ...block10 block11 ...

... ... ...

3.3 Graphic Processing Unit(GPU) programming

Graphic Processing Unit(GPU) is a electronic circuit which is used origi-nally for 3D application rendering. However, nowadays, GPU can be usedin accelerate computational workloads in many areas[3] Unlike CPU, GPUis “composed of hundreds of simpler cores that can handle thousands of con-current hardware threads”[1] , which means that GPU is support for threadlevel parallelism and parallel computing.

In CUDA, specifically, GPU stands for “host”, while CPU stands fordevice, and kernel is C-like programs written by programmers which can beexecuted on the device end. [10]

3.4 GPU Architecture and Its Working Principle

GPU basically consists of device memory(share memory) and many multipro-cessors. As for the high level architecture of GPU, blocks can be organizedin multiple dimensions, which called grid as a whole. The number of blocksin each dimension is defined in the kernel. Also, the number of threads isdefined in kernel. The maximum number of threads in a block is restrictedin 512. Each block is made of many threads. [11] In Addition, both blocksand threads have their own id to distinguish each others and do coordinationand synchronization. Threads can be synchronized by “syncthreads”, whichis called in kernel and synchronized all thread in the block. However, threadsin different blocks can’t be synchronized [12] The diagrams below can bettershow the thread organization and memory model for GPU grid, blocks andthreads. [15]

Before computations and calculations in CUDA thread, data(inputs) mustbe copied from host to device via PCIe bus(Peripheral Component Intercon-nect Express) so that CUDA device and its threads can have quick memoryaccess than normal CPU memory access.

10

Figure 1: 2-dimensional architecture of GPU (figure obtained from lectureslide, see reference [14])

Table 5: structure of block

thread00 thread01 ...thread10 thread11 ...

... ... ...

11

Figure 2: 3-dimensional architecture of GPU (figure obtained from lectureslide, see reference [14])

12

Figure 3: workflow of serial GPU FLAC audio file decoder

4 Methodology and Documentation

In this section, the general framework and work-flow will be presented, andthe algorithm that’s used will be discussed. Also there’s documentation ofmy work and the problem I encountered.

4.1 Framework

In order for comparison, two working flow are being shown–a serial CPUdecoder for FLAC audio file and my CUDA version decoder for FLAC audiofile.

According to Figure 6, we can see that the serial way to decode FLACformat is quite time-consuming, as it should go through the raw data array ofthe target FLAC audio file bytes by bytes, bits by bits to transfer them intodecoded information. However, there’s iterations that we can take advantagesof. The frames in FLAC format construct the main body of a FLAC audiofile and they’re independently exist. Therefore, we can turn this iterationsof decoding work into parallel, implementing it in CUDA device.

According to Figure 4, The meta-data block is piked up and went throughexclusively, as there’s information we need, for example, blocksize, sample-size. They in turn will be used in location of header position of each frames.Then, the raw data array of the FLAC audio file still be went through. How-ever, it’s not went through bits by bits, instead, it’s went through by jumpingto locate the header position of each frames. What’s worthy mentioning isthat the subframes decoding in each frame still works in iteration, as thenumber of subframes so small that it don’t have to be decode in parallelagain using threads. Otherwise, more threads should be used and the cost ofthe decoder will increase. For example, the time consumption in each thread

13

Figure 4: workfolw of CUDA FLAC audio file decoder

14

is t, and the number of frames is n, then the cost for just parallel frames is

Cost = t ∗ n

However, if more threads are going to be created to further put subframesdecoding in parallel, assuming the number of subframes in a frame is m, thecost is

Cost = t ∗ n ∗m,and this is not cost optimal.

4.2 Algorithm

Algorithm: FLAC audio CUDA decoder.

i n t main ( i n t argc , char ∗∗ argv )Input : raw data array o f FLAC audio f i l e data .Output : r e s i d u a l s r e s u l t [ ] , n a l l s o r t s o f decode in fo rmat ion

∗data = read audio f i l ei n t by t e po s i t i on , b o f f p o s i t i o ni n t header r e co rde r [ ] ;decode meta−data block ;whi l e ( po s i t i on<data . s i z e ){

/∗go through data to l o c a t e frame header p o s i t i o n ∗/get heade r r e co rde r [ i ]f o r ( i n r s1 ; s1<sumframe number ; s1++){

decode subframe}

}a l l o c a t e memories in dev i c e and host , ∗a d , ∗a hcopy heade r r e co rde r [ ] array a c r o s scopy data array ac r o s s/∗ ke rne l setup ∗/k e r n e l f u n c t i o n <<< n blocks , b l o c k s i z e >>> \( pass a r rays and parameters )

/∗ in k e rne l ∗/i n t idx = blockIdx . x ∗ blockDim . x + threadIdx . x ;i f ( idx<array length −1){

i n t b y t e s p o s i t i o n = heade r r e co rde r [ idx +1] ;decode frame headerf o r ( i n r s2 ; s2<sumframe number ; s2++){

15

decode subframe}

}copy a d to a hpre sent r e s u l t

Assume the number of frames is n, the time it takes to allocate and copyarray is T, the size of data is m in bits

As we can see from the general algorithm, the process of getting frameheader positions takes O(n), as we can ignore the iteration for decodingsubframes because the number of subframes in each frame is small; Theprocess of memory allocation and copy can be T; The process for kernel setup can be O(m/n), again iteration for subframes decoding can be ignore.Thus, the total running time upper bound should be:

O(n) + T + O(m/n)

if we ignore the number of frames, as the number of frames is finite, it couldbe:

T + O(m/n)

Compare to the serial CPU decoder —- O(m), the speed up is

S =O(m)

T + O(m/n)

4.3 Documentation

The general method I take is to use parallel computation features of CUDAdevice to decode FLAC format audio to improve decoding performance.Firstly, based on the understanding of the FLAC format, an serial CPUversion of FLAC audio decoder was developed, and it has the ability to de-code all information that encrypted in raw FLAC audio. Secondly, an simpleversion of Flac decoder was developed as it aiming in eliminating all unnec-essary computation and calculations but just to record the byte positions offrame headers in a FLAC audio into an array. This array will later be passedto CUDA device, together with the original raw FLAC audio file, for threadsto obtain the starting decoding points. Thirdly, a heavier CPU version ofFLAC audio decoder was developed. With the simple decoder in the front,a for-loop is added subsequently. Each iteration in the for-loop stands forthe decoding of a frame, and they could be transferred into CUDA easily.Additionally, the iterations in the for-loop determine the start position of

16

frame headers by reading the frame header record array. However, it’s as-sumed to be take a longer time for decoding. Finally, the two arrays abovewill be transfer to CUDA device, and each thread in CUDA is responsibleto decode a single frame and record the decoded results into an result array,which means that frames can be decoded concurrently. Later, the resultsarrays will be merge and pass back to host. I choose clock() function in c torecord the running time. The serial CPU version of Flac decoder takes anaverage clock value of 26000 usec. The simple serial CPU version which onlyrecord the positions of frame headers takes an average clock value of 20000usec. Thus, the performance in terms of running time increase by roughly23%. However, the heavier version(for-loop) decoder takes an average clockvalue of 44000 usec. Therefore, the performance was bad in this situation asthe time-consuming increase 169 percent.

4.3.1 Defect and Isssues in My Code

The main decoding information that I was supposed to get is residuals re-sults. To justify, the residual results can be get from computations, in otherwords they’re there, as individual element can be assign with residual result,transfer back to host and print out

/∗ code in CUDA kerne l f unc t i on ∗/. . .a [11 ]= r e s i d u a l ;. . ./∗ p a r t i a l p r in ted output in fo rmat ion ∗/p r i n t f (”%d %d\n” , i , a h [ i ] ) ;11 −97/

However, there seems to be issues in getting residual results into device arrayin CUDA kernel. If I make array to store residuals according to their threa-dID, the array couldn’t get residuals,resulting in printing out the initialedvalues for the array in host.

a [ idx ∗ b l o c k s i z e ∗4+ i ] =r e s i d u a l ;

Nevertheless, This won’t affact the timing measurements as all residuals arecomputed correctly.

17

5 Experiments and Discussion

This sections mainly present the results for experiments of my FLAC CUDAdecoder for a variety of recordings input file. Also, every presented experi-ments are followed with analysis and discussion.

5.1 Experimental Setup

The code implementations are written in C and CUDA language. The testingmachine is a Macbook Pro(Retina, 13-inch, Mid 2014), with operation systemOS X Yosemite(version 10. 10. 1), with 2.6 GHz Intel Core i5 Procesor,8 GB 1600 MHz DDR3 Memory, Interl Iris 1536 MB Graphics. There’sa tool that’s being used for transfer recordings with different format, forexample Wav format, into FLAC format—- Audacity 2.1.3. It’s a free, opensource, cross-platform software for recording and editing sounds, written bya worldwide team of volunteers. A screen-shot of its interface is shown inFigure 5.

5.1.1 Reference Declaration for Coding

My supervisor Dr. Eric McCreath provided me with a Java version of FLACdecoder, whcih is a simple serial decoder. I learned and modify it into C bymyself, as I mentioned before, the serial CPU decoder. Further, I implementit in parallel in CUDA.

5.2 Experiments and Discussion in Testing Audio File

In coding development, the original audio file ”hellogpgpu” was used. It’s avery short audio file which just contains my supervisor saying ”hello world”,lasting for roughly 1.55 seconds. The reason for it be short is to save decodingtime when develop and test the code iteratively. From the results(Table 6 onpage 20) we can see that the GPU way to decode the short ”hello world” fileperform much worse than an simple serial way of decoding in CPU. There’remany factors could cause this. The memory allocation and array copyingtakes time. The transition of arrays(raw data arrays and frame headersposition arrays) between host and device would take times, especially for thetransition of raw data array. The output array that stores decoding valuesin device should also be copy back to device, again it consumes time. Therelease of memory buffer also consumes time. Table 7 on page 21 recordsand analyses timing data about the following blocks of code respectively: 1.the serial CPU part of decoder, which takes frame header positions into an

18

Figure 5: audacity 2. 1. 3

19

Table 6: running time difference in CPU and GPU code for hellogpgpufile(usec)

times—approaches CPU GPU1 123644 6167982 138241 6231953 136664 6178944 146642 6323495 141786 6250696 139043 6188677 138955 6202448 145441 6131379 152290 62191810 140665 619462

avarage 140337 620893.3

array; 2. the execution time of whole block of cuda code 3. the time thatallocate memory and copy arrays across in cuda code; 4. the total runningtime for decoder

From Table 7 on page 21, we conclude that the serial CPU code whichgets the frame header positions averagely takes the proportion of 2.30% ofthe decoder, while the cuda part of code averagely takes the proportions of97.70% of the decoder. What’s worthy mention is that the time for allocationand copy of arrays in CUDA, takes 99.9% in the block of CUDA code and97.6% in the entire decoder. From these we can conclude that the mainreason that CUDA decoder for FLAC audio files takes longer time comparedwith serial CPU decoder is that the allocation and copy of arrays, especiallythe copy of the raw data array, which is the raw bytes data of the testingFLAC audio file. However, because the testing file at this stage is just a smallfile, array allocation and transfer are supposed to take significant proportionin terms of timing consumption. As the size of audio file gets larger, thisproportion will fall.

5.3 Experiments and Discussion in Longer Recordings

The recording audio file I used was from one of my supervisor’s lecture,Introduction, Software Design Methodologies COMP2100. The audio fileare transform to FLAC format using Audacity. It’s length is 9 min 00 sec.[16]

20

Table 7: timing mesurements across the code(usec)

times serial CPU code the whole code for array the total running(get header position) CUDA code allocation and copy time of decoder

1 14498 619691 619154 6341892 14212 610385 609871 6245983 14175 618562 617985 6327384 14514 610814 610266 6253285 14846 599390 598828 6142366 14066 604673 604109 6187397 13788 607660 607154 6214498 14136 600213 599670 6143499 14446 611329 610822 62577610 14437 616286 613033 630724ave 14311.8 609900.3 609089.2 624212.6

Further, the FLAC audio file was decided into different pieces in orderto get a curve for different time-consumption of different length of audio filedecoding process. Further, the decoding timing in terms of CUDA code time-consumption in CUDA decoder, data allocation and copy consumption inCUDA decoder, serial CPU part of code time-consumption in CUDA decoder,total time-consumption in CUDA decoder, total time-consumption in CPUdecoder, and several percentage calculation are listed in the following Table 8on page 22.

Further, line char was form in figure 6 to better present and illustrate thetendency as the time require for CUDA decoder and serial CPU decoder.

as we can see form figure 6, when it comes to very short FLAC audio,the timing performance of serial CPU decoder is better than CUDA decoder.It’s because the raw data bits that of small size audio is short, and the timefor serial CPU decoder to go through it is shorter than the arrays allocationand copy process in CUDA decoder. The cross-point occurs at roughly whenthe length of audio is 0.5 min. After that, the time requires for decoding inGPU decoder become less than that of serial CPU decoder. Also, the longeraudio recording is as input, the larger difference there will be.

From Figure 7 we can see that when it comes to very short length of au-dio, the CUDA code section in CUDA decoder takes the major proportion ofthe total running time, while the serial CPU code section in CUDA decoderjust takes a small part.As the length of audio file increases, the proportion for

21

Table 8: timing mesurements across the code(usec)

audio serial the whole array allocation total time for total time forlength code CUDA code and copy CUDA decoder CPU decodermin:sec usec usec usec usec usec01:00 370607 741412 721640 1112020 125720902:00 743059 864460 826503 1607519 249391403:00 1087937 985044 927567 2072983 366565904:00 1454574 1101494 1025933 2556069 495745705:00 1838306 1186842 1094508 3025149 614391206:00 2206805 1307023 1198747 3513829 736251607:00 2583807 1430106 1301621 4013915 863517508:00 2936351 1542562 1397865 4478914 1001193609:00 3306155 1678796 1516878 4984952 11018142

Figure 6: Timing performance for different decoder

22

Figure 7: percentage information for different section of code in CUDA de-coder

23

CUDA code section decrease while that of CPU code section increase. Whenthe length of audio roughly is 1.5min, each of those code section takes equalproportion(50%). When length of audio is longer than 2 min, CPU code sec-tion in CUDA decoder takes more and more proportion of total running time,while that of CUDA code section is opposite. Eventually, the percentage foreach code section become stable as the length of audio file become very long.Looking at the diagram, we can predict that the proportion of CUDA codesection in CUDA decoder will be stable at around 30%, while that of CPUcode section will be stable at around 70%. One additional information isthat the time consumption for array copy and allocation take the majoritypart of the time consumption in CUDA code section.

24

6 Future work

From the diagrams, results and discussion above, we can see that there areseveral bottlenecks in the CUDA decoder.

6.1 Idea to Reduce serially getting frame headers po-sitions

Firstly, When it comes with longer recordings, as we can see from Figure 7,the serial CPU section of code takes a significant part of timing(70%), ofthe decoder. The reason for this is that although the serial CPU code goesthrough the raw data array of FLAC audio file to get frame header position,it still go through it bit by bit. As the length of audio file increases, thelonger time it will take to locate header positions. As each frame sharessame format and pattern, in terms of future works, fixed size of certain bits,which is for the same function in each frame and won’t affect the effort toget frame headers positions, can be found. Then in very low level, jumpthrough these bits directly, creating a function. As the size and the numberof these skippable blocks increase, the time consumption for serially gettingframe headers positions can decrease.

6.2 How arrray allocation and transfer can be reduced

Secondly, also from Figure7 we can see that the array allocation and copybetween host and device also take a important part of time consumption.What’s worthy mention is that in terms of the time consumption of CUDAsection of code, the array allocation and copy takes a significant percentage.Therefore, in order to reduce the timing cost of CUDA section of code, thekey is to reduce the timing cost of array transfer between device and host.One potential approach is to improve hardware itself so that the transferrate between device and host could increase, and in turn reduce the timingcost. For example, the bandwidth of the transfer between host and devicecould be increased, as the bandwidth required from experience is around 0.2Mb/sec, which is not good enough.

25

7 Conclusion

This work presents a serial GPU streaming approach to decode FLAC audiofiles in parallel. It justifies that the feature of FLAC format in terms of fixedframe haeder pattern and independent frames can be taken advantage of andimplement parallel computation for frames. The experiments, result analysisand discussions shows that the performance improvement is significant, com-paring serial CPU FLAC audio file decoder and the artifact, CUDA FLACaudio file decoder. In detail, as the length of FLAC recordings increases, theCUDA decoder can shows more significant and obvious improvement.

26

8 Appendix

ReadMeThe artefact and a FLAC audio file is provided The steps to execute my

artefact are as following: 1. put the artefact and testing FLAC audio file intoCUDA device directory. 2. compile the artefact using mvcc -o compile nameartefact name. 3. execute artefact using ./compile name

Note that slightly modifications for different FLAC audio file are required.1. The path for the FLAC audio file is different in name. 2. Hard codingrequired to eliminate the last frame of different FLAC file as the size of lastframe is different.

27

INDEPENDENT STUDY CONTRACTNote: Enrolment is subject to approval by the projects co-ordinator

SECTION A (Students and Supervisors)

UniID: ____u5819281____

SURNAME: _____Huang___________ FIRST NAMES: __Jiajin______________________

PROJECT SUPERVISOR (may be external): _Dr Eric McCreath___________________

COURSE SUPERVISOR (a RSCS academic): ________________________________________________

COURSE CODE, TITLE AND UNIT: _____COMP4560 _Advanced Computing Project

SEMESTER S1 S2 YEAR: Summer session 2016/2017 (6u) and Semester 1 2017 (6u)

PROJECT TITLE:

FLAC decoding using a serial streaming approach accelerated via GPU

LEARNING OBJECTIVES:The student would gain a good understanding of the binary formats, particularly the FLAC audio format,and GPGPU software development. With a focus on looking at performance relating to the decoding of a FLAC audio file. More generally the project would strengthen the programming and problem solving abilities along with research skill associated with exploring approaches and ideas and then implementing,testing and evaluating these approaches and ideas.

Also it is expected that the student would gain general skills relating to: writing a report, and giving a seminar.

Research School of Computer Science Form updated Jun-12

PROJECT DESCRIPTION:

The project will explore the FLAC lossless audio format. This format has grown in popularity as a lossless audio format because the format is: “free”, relatively simple, and effective for compressing audio data. This project would explore a streaming approach for decoding of FLAC using a GPU. The approach would work using the main CPU to coordinated the decoding of FLAC and it would offload work, in a streaming fashion, to the GPU. So for example much of the computation involved in decoding relates to decoding the RICE encoded numbers, this work could be offloaded to the GPU improving the overall decoding performance. The challenge would be coordinating and synchronizing the work for the GPU to complete. The performance and the performance bottlenecks will be evaluated for the proposed approach, in particular the analyses of memory transfer time along with synchronization costs will be evaluated.

Given the serial approach for FLAC decoding is fast there is only a small room for improvement and this would only be worth while for long recordings. Such performance improvement would be most useful foraudio editing or trans-coding software, as the simple serial approaches are easily fast enough for audio playing.

The project report will contain:+ An introduction to the topic.+ A background section which describes the FLAC format,+ A section which provides a background to GPU computing.+ A description of the algorithm for decoding+ A description of the implementation. + Experimental chapter which: describes the hardware used for evaluation, the experiments done, and the results tabulated/graphed.+ Conclusion/discussion/limitations/future work chapter.


ASSESSMENT (as per course’s project rules web page, with the differences noted below):

Assessed project components: % of mark Due date

Evaluated

by:

Report: name style: _____________________________(e.g. research report, software description...) 50%

Artefact: name kind: ____________________________(e.g. software, user interface, robot...) 40%

Presentation:10%

MEETING DATES (IF KNOWN): During the summer session every few days, and then during semester 1 2017 weekly. STUDENT DECLARATION: I agree to fulfil the above defined contract:

………………………………………………….. ………………………..Signature Date

SECTION B (Supervisor):I am willing to supervise and support this project. I have checked the student's academic record and believe this student can complete the project.

………………………………………………….. ………………………..Signature Date

REQUIRED DEPARTMENT RESOURCES: + Most of the development can be done on the students laptop.

SECTION C (Course coordinator approval)

………………………………………………….. ………………………..Signature Date

SECTION D (Projects coordinator approval)………………………………………………….. ………………………..Signature Date


9 bibliography

31

[1]André R., B. (2012). Graphics processing unit (GPU) programming strategies

and trends in GPU computing (1st ed.). Journal of parallel and distributed

computing. Retrieved from

http://jn8sf5hk5v.search.serialssolutions.com/?ctx_ver=Z39.88-‐2004&ctx_enc=i

nfo%3Aofi%2Fenc%3AUTF-‐8&rfr_id=info%3Asid%2Fsummon.serialssolutions.

com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=art

icle&rft.atitle=Graphics+processing+unit+programming+strategies+and+trends

+in+GPU+computing&rft.jtitle=Journal+of+Parallel+and+Distributed+Computing

&rft.au=Brodtkorb%2C+Andre+R&rft.au=Hagen%2C+Trond+R&rft.au=Saetra%

2C+Martin+L&rft.date=2013-‐01-‐01&rft.pub=Elsevier+B.V&rft.issn=0743-‐7315&

rft.eissn=1096-‐0848&rft.volume=73&rft.issue=1&rft.spage=4&rft_id=info:doi/1

0.1016%2Fj.jpdc.2012.04.003&rft.externalDBID=BSHEE&rft.externalDocID=309

042616&paramdict=en-‐US

[2] Coalson, J. (2016). FLAC -‐ Free Lossless Audio Codec. Xiph.org. Retrieved 21

November 2016, from https://xiph.org/flac/, https://xiph.org/flac/license,

https://xiph.org/flac/features, https://xiph.org/flac/format.

[4] Cue sheets -‐ Official Kodi Wiki. (2016). Kodi.wiki. Retrieved 21 November

2016, from http://kodi.wiki/view/Cuesheets \bibliographystyle{plain}

[5] Dipperstein, M. (2016). Rice (Golomb) Coding Encoding Discussion and

Implementation. Michael.dipperstein.com. Retrieved 23 November 2016, from

http://michael.dipperstein.com/rice/index.html

[6] Introduction -‐ Linear Predictive Coding. (2016). Support.ircam.fr. Retrieved

23 November 2016, from

http://support.ircam.fr/docs/AudioSculpt/3.0/co/LPC.html

[3] Krewell, K. (2009). What's the Difference Between a CPU and a GPU? | The

Official NVIDIA Blog. The Official NVIDIA Blog. Retrieved 28 November 2016,

from

https://blogs.nvidia.com/blog/2009/12/16/whats-‐the-‐difference-‐between-‐a-‐cp

u-‐and-‐a-‐gpu/

[7] MD5 Homepage (unofficial). (2016). Userpages.umbc.edu. Retrieved 21

November 2016, from

http://userpages.umbc.edu/~mabzug1/cs/md5/md5.html

[8] Mikulic, E. (2004). Rice Coding. Dmr.ath.cx. Retrieved 23 November 2016,

from http://dmr.ath.cx/code/rice/

[9] Robinson, T. (1994). Simple lossless and near-‐lossless waveform

compression (1st ed.). Cambridge: Cambridge University Engreport

sample.pdfineering Department.

[10] Harris, M. (2012). An Easy Introduction to CUDA C and C++. Parallel Forall.

Retrieved 4 December 2016, from

https://devblogs.nvidia.com/parallelforall/easy-‐introduction-‐cuda-‐c-‐and-‐c/

[11] Krik, D. & Hwu, W. (2008). Chapter3 CUDA Threads (1st ed., pp. 1-‐5).

Retrieved from

https://courses.engr.illinois.edu/ece498al/textbook/Chapter3-‐CudaThreadingM

odel.pdf

[12] Zahran, M. (2016). Graphics Processing Units (GPUs): Architecture and

Programming (1st ed., pp. 1-‐19). Retrieved from

http://cs.nyu.edu/courses/spring12/CSCI-‐GA.3033-‐012/lecture5.pdf

[13] What is FLAC and why should you use it? // www.synnack.com.

(2010). Synnack.com. Retrieved 5 December 2016, from

http://www.synnack.com/blog/post/22/what-‐is-‐flac-‐and-‐why-‐should-‐you-‐use-‐i

t

[14] Z. Fang, C. Weiming and Z. Yukun, "Design and Research on Free Lossless

Audio Decoding Systems under the Embedded Development Platform of

ARM9," 2009 Second International Symposium on Information Science and

Engineering, Shanghai, 2009, pp. 223-‐226.

doi: 10.1109/ISISE.2009.45

[14] Audacity® | Free, open source, cross-‐platform audio software for multi-‐track

recording and editing.. (2017). Audacityteam.org. Retrieved 6 May 2017, from

http://www.audacityteam.org/

[15] Strazdins, P. (2017). Overview: Graphics Processing Units (1st ed., p. 9).

Canberra: ANU. Retrieved from

http://courses.cecs.anu.edu.au/courses/COMP4300/lectures/gpus.pdf

[16] McCreath, E. (2017). Introduction. Cs.anu.edu.au. Retrieved 8 May 2017,

from

https://cs.anu.edu.au/pages/courses/comp2100/lectures/campus_only/introd

uction.html

FLAC Audio File Decoding Using a Serial Streaming Approach ... · decode FLAC audio frame by frame,...

Documents

Transcript of FLAC Audio File Decoding Using a Serial Streaming Approach ... · decode FLAC audio frame by frame,...