COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING
description
Transcript of COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING
![Page 1: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/1.jpg)
COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING
By Sudeep GangavatiDepartment of Electrical Engineering
University of Texas at Arlington
Supervisor : Dr.K.R.Rao
![Page 2: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/2.jpg)
Outline Introduction to video compression Why H.264 Overview of H.264 Motivation Possible approaches Related work Theoretical estimation Proposed approach Parallel computing NVIDIA GPUs and CUDA Programming Model Complexity reduction using CUDA Results Conclusion and future work
![Page 3: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/3.jpg)
Introduction to video compression Video codec: A software or a hardware device that
can compress and decompress Need for compression : Limited bandwidth and limited
storage space. Several codecs : H.264, VP8, AVS China, Dirac etc.
Figure 1 Forecast of mobile data usage
![Page 4: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/4.jpg)
Why H.264 ?H.264/MPEG-4 part 10 or AVC (Advanced Video
Coding) standardized by ITU-T VCEG and MPEG in 2004.
Approximately 50% bit-rate reductions over MPEG-2.
Most widely used standard.Built on the concepts of earlier standards like
MPEG-2.Substantial compression efficiency.Network friendly data representation. Improved error resiliency toolsSupports various applications
![Page 5: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/5.jpg)
Overview of H.264There are two parts:
◦ Encoder : Carries out prediction, transform, quantization and encoding processes to produce a H.264 bit-stream.
◦ Decoder: Carries out the decoding, inverse transform, inverse quantization to reconstruct the earlier encoded video.
![Page 6: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/6.jpg)
H.264 encoder
![Page 7: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/7.jpg)
H.264 decoder
![Page 8: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/8.jpg)
Intra predictionExploit spatial redundancies9 directional modes for prediction for
4 x 4 luma blocks4 modes for 16 x 16 luma blocks4 modes for 8 x 8 chroma blocks
![Page 9: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/9.jpg)
Intra prediction9 modes for 4 x 4 luma block
4 modes for 16 x 16 luma blocks
![Page 10: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/10.jpg)
Inter predictionExploits temporal redundancyInvolves prediction from one or more
previous frames called reference frames
![Page 11: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/11.jpg)
Motion estimation and compensationMotion estimation and compensation
is a process of finding matching blockMotion search is performed.Motion vectors are obtained that
provide the displacement in the block.
![Page 12: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/12.jpg)
Transform, Quantization and EncodingPredicted values are then
transformed.H.264 employs integer transform,
basically rough approximation of DCTAfter transform, the values are
quantized for compressionEntropy encoding : CAVLC / CABAC
![Page 13: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/13.jpg)
Motivation
90%
5%
2% 3%
Motion EstimationTransform and quantiza-tionIntra PredictionVLC Encoding and others
Performed a time profiling on H.264 and obtained :
Motion estimation takes more time than any other module in H.264
Need to reduce this time by efficient implementation without sacrificing video quality and bitrate.
With reduced motion estimation time, the total time for encoding is reduced.
![Page 14: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/14.jpg)
Possible approachesEncoder optimization Levels :
◦ Algorithmic Level : Develop new algorithms similar to
Three step algorithm, fast mode decision algorithm etc.◦ Compiler Level : Efficient programming ◦ Implementation Level: Using parallel
programming using CUDA, OpenMP , utilize underlying hardware etc.
![Page 15: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/15.jpg)
Related workAuthor Features Advantages Disadvantages1. Chan et.al [41] Considers pyramid
algorithm for the motion estimation
Consider motion vector predicted to calculate SAD.
1.Video quality degradation2.RD performance is not considered.
2.Lee et.al [40] Multi-pass motion estimation. Generates local and global SADs in the first and second passes. Fast ME Search algorithm is used.
6 times speed up achieved compared to standard implementation.
1.Focus only on speed, not on rate and distortion.2. Threads are invoked for pixels3.Video resolution limit the thread creation
3.Rodriguez et.al [42]
Considers tree structured motion estimation algorithm
Three sequential steps 1. SAD Calculation 2.Uses binary reduction algorithm 3. Cost reduction
1.Implementation results in higher bitrate. 2.RD performance is not shown.
4. Cheng et.al [44] Based on simplified unsymmetrical multi-hexagon search. Divide into tiles.
3x speed up. Thread created for each tile.
Penalty in video quality
5.NVIDIA Encoder Searching algorithm is not disclosed. No documentation on internal details.
Provides 4 times speed up. Very good visual quality.
Fixed search range,
![Page 16: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/16.jpg)
Issues with previous workFocus only on achievable speed up.Does not consider the methods to
decrease the bitrateDoes not consider techniques to
maintain video qualityThread creation overhead and
limitations in some approaches.
![Page 17: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/17.jpg)
Theoretical estimation by Amdahl`s Law [43] We use this law to find out maximum achievable speed up
Widely used in parallel computing to predict theoretical maximum speed up using multiple processors.
Amdahl`s law states that if P is the proportion of a program that can be made parallel and (1-P) is the proportion that cannot be parallelized, then maximum speedup that can be achieved by using N processors is
![Page 18: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/18.jpg)
Using Amdahl`s LawApproximation of speed up achieved upon
parallelizing a portion of the code◦ P: parallelized portion◦ N: Number of processor cores
In the encoder code, motion estimation accounts to approximately 2/3rd of the code .
Applying the law the maximum speedup that can be achieved in our case is 2.2 times or 55% time reduction.
![Page 19: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/19.jpg)
Proposed workWe propose the following to address the problem :
◦ Using CUDA for parallel implementation for faster calculation of SAD (sum of absolute differences) and use one thread per block instead of one thread per pixel to address the thread creation overhead and limitation.
◦ Use a better search algorithm to maintain the video quality
◦ Combine SAD cost values and save the bitrate
The above methods address all the issues mentioned earlier
Along with the above, we utilize shared and texture memory of the GPU that reduces the global memory references and provides faster memory access.
![Page 20: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/20.jpg)
Parallel ComputingMulti-core and many-core processors
improve the efficiency by parallel processing
Parallel processing provides significant improvement
Techniques to program software on multiple core processors: ◦ Data Parallelism◦ Task parallelism
![Page 21: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/21.jpg)
Parallel ComputingData Parallelism
◦Split the large data set into smaller parts and execute them in parallel. After the execution, the data are grouped
![Page 22: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/22.jpg)
Parallel ComputingTask Parallelism
◦Distribute threads to different processors
◦Data could be common ◦May execute same or different code
![Page 23: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/23.jpg)
NVIDIA GPU And CUDA Programming Model
NVIDIA pioneered the Graphics Processing Units (GPU) Technology. First GPU: GeForce256 in 1999, had 128 MB of graphics memory.
GPUs, consisting of many core processors, are used in applications requiring high amounts of computation.
CPU-GPU Heterogeneous Model
![Page 24: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/24.jpg)
Host-Device Connection
![Page 25: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/25.jpg)
Compute Unified Device Architecture (CUDA)
NVIDIA introduced CUDA in 2006. Programming model that make
programs run on GPU.The serial portions of our program
written in C/C++ functions.Parallel portions are written as GPU
kernels.C/C++ functions execute on CPU,
kernels sent to GPU for processing.
![Page 26: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/26.jpg)
Problem decompositionSerial C functions run on CPU CUDA Kernels run on GPU
![Page 27: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/27.jpg)
Hardware ArchitectureMain element : Stream multiprocessor (SM)GT550M series has 2 SMs
Each SM has 48 cores
Each SM is capable ofexecuting 1536 threads
Total of 3072 threads running in parallel
![Page 28: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/28.jpg)
ThreadingThreads are grouped into blocks
Blocks are grouped into grids
All threads within a block execute on the same SM
![Page 29: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/29.jpg)
Complexity reduction using CUDA
Motion estimation: Process of finding the best matching block.
![Page 30: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/30.jpg)
Complexity reduction using CUDATo find best matching block, search is
done in the search window (or region).Search provides the best matching
block by computing the difference i.e. it obtains sum of absolute difference (SAD).
![Page 31: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/31.jpg)
SAD (dx, dy) = Search through search range of 8,16 or maximum 32 Select the block with least SAD. Larger the block size, more the computations
Complexity reduction using CUDA
1 1
1 |),(),(|Nx
xm
Ny
ynkk dyndxmInmI
A 352 x 288 frame
![Page 32: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/32.jpg)
Standard algorithmDivide the block into 16 x 16 Again, further divide it into
Subblock of 8 x 8 . Search through the search areaCompute SAD obtain MVs
![Page 33: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/33.jpg)
Our approach• Main idea is to:
– Minimize memory references and Memory transfer– Make use of shared memory and texture memory– Use single thread to compute SAD for single block– Make thread block creation dependent on the frame size for scalability– large number of threads are invoked that run in parallel and each block of thread consists of 396 threads that compute SADs of 396 - 8 x 8 blocks
![Page 34: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/34.jpg)
SAD mapping to threads
Blocks 352 x 288 : (352/8) * (288 /8) = 1584 blocks that are to be computed for SAD.Total thread blocks = 4. Each block with 396 threads.This makes the approach scalable. For a video with higher resolution, like 704 x 480 ( 4SIF) or 704 x 576 (4CIF), we can create 16 blocks each with 396. So the number of threads created is dependent on video resolution.
![Page 35: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/35.jpg)
Performance enhancementsWe consider Rate-distortion (RD) criteria
and employ following techniques: ◦ To minimize bitrate:
Calculate the cost for smaller sub blocks of 8 x 8 and combine 4 of these and form a single cost for 16 x 16 block.
◦ To enhance video quality: Incorporate exhaustive full search algorithm that goes
on to calculate the matching block for the entire frame without skipping any blocks as opposed to other algorithms. Previous studies [30] show that, this algorithm provides the best performance. Though it is highly computational, this is used keeping video quality in mind.
![Page 36: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/36.jpg)
Memory access Memory access from texture memory to shared memory
Memcpy API to move data into the Array we allocated:cudaMemcpyToArray(a_before_dilated, // array pointer0, // array offset width0, // array offset heighth_before_dilated, // source width*height*sizeof(uchar1), // size in bytescudaMemcpyHostToDevice); // type of memcpy
Texture Memory
Shared memory
![Page 37: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/37.jpg)
Performance Metrics
![Page 38: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/38.jpg)
Test Sequences
![Page 39: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/39.jpg)
Results
Akiyo Carphone News Container Foreman0100200300400500600700
Comparison of average encoding time for QCIF sequences
Reference SoftwareOptimized SoftwareNVIDIA Encoder
QCIF Video Sequences
Time in se
conds
The CPU-GPU implemented encoder performs better than the CPU-only encoder. But falls short when compared to NVIDIA Encoder. This is due to the fact that NVIDIA Encoder is heavily optimized at all levels of H.264 and not just motion estimation. NVIDIA has not released the type of searching algorithm it is using as well. Use of appropriate algorithm for motion search significantly changes the performance of quality, bitrate and speed.The theoretical speed up was about 2.2-2.5 times. From results, we achieve approx. 2 times speed up. This can be attributed to the other factors like the time it takes for load and store operations for functions , transfer of control to the GPU, memory transfer and references for operations that we have not considered and also other H.264 calculations etc.
![Page 40: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/40.jpg)
Bitrate
200 kbps 400 kbps 600 kbps 800 kbps 1000 kbps
PSNR
(dB)
32
34
36
38
40
42
44
Reference softwareOptimized softwareNVIDIA Encoder
Bitrate
200 kbps 400 kbps 600 kbps 800 kbps 1000 kbps
PSNR
(dB)
30
32
34
36
38
40
42
Reference softwareOptimized softwareNVIDIA Encoder
Results for QCIF video sequences
PSNR vs. Bitrate for Akiyo sequencePSNR vs. Bitrate for Carphone sequence
PSNR vs. Bitrate for Container sequence PSNR vs. Bitrate for Foreman
sequence
Bitrate
200 kbps 400 kbps 600 kbps 800 kbps 1000 kbps
PSNR
(dB)
32
34
36
38
40
42
44
46
48
Reference softwareOptimized softwareNVIDIA Encoder
Bitrate
200 kbps 400 kbps 600 kbps 800 kbps 1000 kbps
PSNR
(dB)
32
34
36
38
40
42
Reference softwareOptimized softwareNVIDIA Encoder
![Page 41: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/41.jpg)
Results
Akiyo Carphone News Container Foreman0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
Reference SoftwareOptimized softwareNVIDIA Encoder
QCIF Sequences
SSIM
SSIM provides the structural similarity between the input and output videos. Ranges from 0.0 to 1.0. 0 is the least quality video. 1 is the highest quality video
![Page 42: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/42.jpg)
Results
Akiyo Carphone News Container Foreman0500
1000150020002500
Comparison of average encoding time for CIF sequences
Reference SoftwareOptimized softwareNVIDIA Encoder
CIF Video sequences
Time in se
conds
Similar behavior is observed in case of CIF video sequences.
![Page 43: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/43.jpg)
Results
Bitrate
200 kbps 400 kbps 600 kbps 800 kbps 1000 kbps
PS
NR
(dB)
36
38
40
42
44
46
48
Reference software Optimized softwareNVIDIA Encoder
News
Bitrate
200 kbps 400 kbps 600 kbps 800 kbps 1000 kbps
PS
NR
(dB)
37
38
39
40
41
42
43
44
45
Reference SoftwareOptimized softwareNVIDIA Encoder
Carphone
Bitrate
200 kbps 400 kbps 600 kbps 800 kbps 1000 kbps
PS
NR
(dB)
38
40
42
44
46
48
50
Reference SoftwareOptimized softwareNVIDIA Encoder
Container
Bitrate
200 kbps 400 kbps 600 kbps 800 kbps 1000 kbps
PS
NR
(dB)
34
36
38
40
42
44
Reference softwareOptimized softwareNVIDIA Encoder
![Page 44: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/44.jpg)
Results
SSIM values for our optimized software and NVIDIA encoder are very close.
Akiyo Carphone News Container Foreman0.910.920.930.940.950.960.97
Reference softwareOptimized softwareNVIDIA Encoder
CIF Sequences
SSIM
![Page 45: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/45.jpg)
ConclusionsNearly 50% reduction in encoding time on various
sequences close to the theoretical estimation.Less degradation in video quality is observed.Less bitrate is obtained by uniquely combining
the SAD costs of sub blocks into cost of larger macroblock
SSIM, Bitrate, PSNR are close to the values obtained without optimizations
Achieved data parallelismWith little modification in the code, the approach
is actually scalable to better hardware and increased video resolution
![Page 46: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/46.jpg)
LimitationsAs the threads work in parallel, in case
when the sum of SADs till kth row (k<8) exceeds the current SAD, then there is no need to compute further. But due to the concurrent processing, no best SAD is available until the thread is done calculating.
Search range cannot be modified while encoding is in progress.
Since this is a hardware implementation, the performance largely depends on the type of hardware used.
![Page 47: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/47.jpg)
Future work Other operations in H.264 like filtering, entropy encoding can be
parallelized.
Block dependencies are not considered in this approach. This could be challenging but results in higher compression efficiency
Different profiles like High and Main profiles can be used for implementation
Different motion estimation algorithm can be implemented in parallel and later on incorporated into H.264
CUDA can be applied to HEVC, next generation video coding standard, successor to H.264. HEVC is known be more complex than H.264.
![Page 48: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/48.jpg)
Thank You
![Page 49: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/49.jpg)
References[1] I.E. Richardson, “The H.264 advanced video compression standard”, 2nd Edition, Wiley, 2010.[2] S. Kwon, A. Tamhankar, and K.R. Rao, “Overview of H.264/MPEG-4 part 10”, Journal of Visual Communication and Image Representation, vol. 17, no.2, pp. 186-216, April 2006.[3] Draft ITU-T Recommendation and final draft international standard of joint video specification (ITU-T Rec. H.264/ISO/IEC 14 496-10 AVC), Mar. 2003. [4] G. Sullivan, “Overview of international video coding standards (preceding H.264/AVC)”, ITU-T VICA Workshop, July 2005.[5] T.Wiegand, et al “Overview of the H.264/AVC video coding standard”, IEEE Transactions on. Circuits and Sytems for Video Technology, vol.13, pp 560–576, July 2003. [6] M. Jafari and S. Kasaei, “Fast intra- and inter-prediction mode decision in H.264 advanced video coding”, International Journal of Computer Science and Network Security, vol.8, no.5, pp. 1-6, May 2008.[7] W. Chen and H. Hang, “H.264/AVC motion estimation implementation on Compute Unified Device Architecture (CUDA)”, 2008 IEEE International Conference on Multimedia and Expo, pp. 697-700, 26 April 2008.
![Page 50: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/50.jpg)
References[8] Y. He, I. Ahmad and M. Liou, “ A software-based MPEG-4 video encoder using parallel processing”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 8, no.7, pp. 909-920, November 1998.[9] D. Marpe, T. Wiegand and G. J. Sullivan, “The H.264/MPEG-4 AVC standard and its applications”, IEEE Communications Magazine, vol. 44, pp. 134-143, Aug. 2006.[10] Z.Wang, et al, “ Image quality assessment : From error visibility to structural similarity”, IEEE Transactions on Image Processing, vol 13. Pp. 600-612, April 2004.[11] G. Sullivan, P. Topiwala, and A. Luthra, “The H.264/AVC advanced video coding standard:overview and introduction to the fidelity range extensions” SPIE Conference on Applications of Digital Image Processing XXVII, vol. 5558, pp. 53-74, 2004.[12] A. Puri, X. Chen and A. Luthra, “Video coding using the H.264/MPEG-4 AVC compression standard”, Signal Processing:Image Communication , vol.19 793–849, 2004.[13] K.R. Rao and P. Yip, Discrete cosine transform, Academic Press, 1990.[14] H. Yadav, “Optimization of the deblocking filter in H.264 codec for real time implementation” M.S. Thesis, E.E. Dept, UT Arlington, 2006.[15] https://computing.llnl.gov/tutorials/parallel_comp/, Introduction to parallel computing.
![Page 51: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/51.jpg)
References[16] J. Kim, et al, “Complexity reduction algorithm for intra mode selection in H.264/AVC video coding” J. Blanc-Talon et al. (Eds.): ACIVS 2006, LNCS 4179, pp. 454 – 465, Springer-Verlag, Berlin, Heidelberg, 2006.[17] B.Jung, et al, “Adaptive slice-level parallelism for real-time H.264/AVC encoder with fast inter mode selection”, Multimedia Systems and Applications X, edited by S. Rahardja, J.W.Kim and J.Luo, Proc. of SPIE, vol 6777, 67770J, 2007.[18] S.Ge, X.tian and Y. - K. Chen, “Efficient multithreading implementation of H.264 encoder on Intel Hyper-threading architectures”, ICICS-PCM 2003.[19] T. Rauber and G.Runger, “Parallel programming for multicore and cluster systems”, 2nd Edition, Wiley, 2008[20] D.Ailawadi, M.K.Mohapatra and A.Mittal, “Frame-based parallelization of MPEG-4 on Compute Unified Device Arcitecture(CUDA)”, IEEE Conference on Advanced Computing , pp 267-272 , 2010.
![Page 52: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/52.jpg)
References[21] M. A. F. Rodriguez, “CUDA: Speeding up parallel computing”, International Journal of Computer Science and Security, November 2010.[22] NVIDIA, NVIDIA CUDA Programming Guide, Version 3.2, NVIDIA, September 2010.[23] “http://drdobbs.com/high-performance-computing/206900471” Jonathan Erickson, GPU Computing Isn’t Just About Graphics Anymore, Online Article, February 2008. [24] J. Nickolls and W. J. Dally,” The GPU Computing Era” , IEEE Computer Society Micro-IEEE, vol 30, Issue 2, pp . 56 - 69, April 2010.[25] M.Abdellah, “High performance Fourier volume rendering on graphics processing units”, M.S. Thesis, Systems and Bio-Medical Engineering Department, Cairo University, 2012.[26] J. Sanders and E. Kandrot, “CUDA by example: an introduction to general-purpose GPU programming” Addison-Wesley Professional, 2010.[27] NVIDIA, NVIDIA’s Next Generation CUDA Compute Architecture:Fermi, White Paper, Version 1.1, NVIDIA 2009.[28] NVIDIA, Best Programming Practices, 2009.[30] P.Kuhn, “Algorithms, complexity analysis and VLSI architectures for MPEG-4 motion estimation”, Kluwer Academic, 1999.
![Page 53: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/53.jpg)
References[31] K.Shen and E.J. Delp, “ A spatial-temporal parallel approach for real time MPEG video compression”, Proc. of 25th International conference on parallel processing, pp. 100-107, 1996.[32] JM 16.0 software – http://iphome.hhi.de/suehring/tml/[33] JM Reference Software Manual –http://iphome.hhi.de/suehring/tml/JM Reference Software Manual (JVT-AE010).pdf[34] D. Han, A. Kulkarni and K.R. Rao, “Fast inter-prediction mode decision algorithm for H.264 video encoder”, ECTICON 2012, Cha Am, Thailand, May 2012.[35] S. Sun, et al, “A highly efficient parallel algorithm for H.264 encoder based on macro-block region partition”, Springer-Verlag, Berlin, Heidelberg, pp. 577–585, 2007.[36] Test sequences - http://trace.eas.asu.edu/yuv/[37] D.Kirk and W.-M. Hwu, “Programming massively parallel processors: A hands-on approach(Applications of GPU Computing series)”, Morgan Kauffman, 2010[38] Flynn`s Taxonomy, http://www.phy.ornl.gov/csep/ca/node11.html [39] T. Saxena, “Reducing the encoding time of H.264 Baseline profile using parallel programming”, M.S. Thesis, E.E. Dept, UT Arlington, 2012.[40] Lee et.al., “ Multi-pass and frame parallel algorithms of motion estimation of H.264/AVC for generic GPU”, Proc of International conference on multimedia and expo, Beijing,China, 2007.[41] Chan et.al ,“Parallelizing H.264 Motion Estimation Algorithm using CUDA”, MIT IAP 2009[42] Rodriguez et.al., “ Accelerating H.264 Inter Prediction in a GPU by using CUDA”, Proc. of the 2010 ICCE, pp 463-464, 2010[43] “Amdahl`s Law”, http://www.futurechips.org/thoughts-for-researchers/parallel-programming-gene-amdahl-said.html[44] Cheung et.al “ Video coding on Multi-core graphics processors”, IEEE Signal Processing Magazine, March 2010.
![Page 54: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/54.jpg)
Appendix
![Page 55: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/55.jpg)
CUDA Memory Model
![Page 56: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/56.jpg)
GPU Hardware Specs
![Page 57: COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816806550346895ddd8858/html5/thumbnails/57.jpg)
SSIM The difference with respect to other techniques mentioned previously such as
MSE or PSNR, is that these approaches estimate perceived errors on the other hand SSIM considers image degradation as perceived change in structural information. Structural information is the idea that the pixels have strong inter-dependencies especially when they are spatially close. These dependencies carry important information about the structure of the objects in the visual scene.
The SSIM metric is calculated on various windows of an image. The measure between two windows and of common size N×N is:
the average of μx ; the average of μy; the variance of σx; the variance of σy ; the covariance of and σxy; C1 and C2, two variables to stabilize the division with weak denominator; In order to evaluate the image quality this formula is applied only on luma. The
resultant SSIM index is a decimal value between -1 and 1, and value 1 is only reachable in the case of two identical sets of data. Typically it is calculated on window sizes of 8×8. The window can be displaced pixel-by-pixel on the image but the authors propose to use only a subgroup of the possible windows to reduce the complexity of the calculation.