GPU accelerated HEVC decoder on Mali™ T600read.pudn.com/downloads600/ebook/2451717/ARM... · HEVC...
Transcript of GPU accelerated HEVC decoder on Mali™ T600read.pudn.com/downloads600/ebook/2451717/ARM... · HEVC...
GPU accelerated HEVC decoder on Mali™ T600
Ittiam Systems Introduction
2
DSP Professionals Survey by Forward Concepts
World’s most preferred DSP IP supplier
2004 2005 2006
DSP Systems IP CompanyMultimedia + Communication SystemsMultimedia Components, Systems, HardwareFocus on Broadcast, Video Communication, Video Security, Mobile
IP Licensing Business ModelFounded in 2001Venture fundedFlexible mix of one time fees and royalties for licensing
300+ licenseesWorldwideFortune 100 companies, Tier 1 OEMsConsistently rated as Most Preferred DSP IP Supplier
250 strong Engineering TeamWorld Class TalentDeep Multimedia and end application Expertise29 patents issued 30+ patents filed
Ittiam Multimedia Overview
3
Multimedia Components
Middleware + SDKs
OEM Applications
Audio CodecsVideo Codecs/Image CodecsAlgorithms for Audio Effects, Acoustics, ImagingARM® CPU , NEON™ OptimizedDSP+HW Accelerators + GPU expertise and capabilities
System components Parsers, Creators, Stacks, SubtitlesMultimedia Integration Android, Other FrameworksUse Case validation Enhancements to existing MiddlewareApplication Specific SDKs
Complete Multimedia ApplicationsCovers major Multimedia Use CasesCamera, Gallery, Editor, Players, Video EditorProduction testedCustomizable to requirements
4x
Ittiam Multimedia Solutions and ARM
4
Strategic Platform
Long Investment
Partnership
Focus on Mobile, Home, Portable segmentsARM® Connected Community MemberStrong Portfolio of IPExpertise in ARM architecture and optimizations for ARM
Many years of development on ARM® PlatformsCovering ARM9E, ARM11, Cortex™ A8, A9, A15, A5, A7, A12 and NEON™In house developed reference C models for all IPEfficient, targeted for ARM, validated across multiple generations
Joint Benchmarking of implementationsEarly Access to Mali™/OpenCL™ informationEarly involvement on new platforms
Ittiam Media Processing Elements
Audio Codes Video CodecsStereo and Multichannel MP12, AAC- LC/HE v1&v2, AC3, DD+High Quality ResamplerPost Processing and Audio Effects Field Proven
MPEG2, MPEG-4, H.264 , HEVC / H.265 Scalable across Multiple ARM CoresOptimized for bandwidth and CPU + NEONError Resilience for Streaming Use cases In Production
Acoustics
Sin
Voice Quality Enhancements with Echo Cancellation/ AEC), Noise Reduction/ANREqualizer for Microphone & SpeakerAGC , AVC , Audio De-Reverb Mic Beam Forming
De-noise, Face detection, Red-eye correctionPanorama, HDR, Low Light, 3DB&W, Sepia, Cross ProcessExposure, Colours, Geometric, Filters
5
Image Processing
HEVC Overview
HEVC / H.265 Sandard
HEVC aka H.265 is a video compression standard, jointly developed by ISO/IEC MPEG and ITU-T VCEG
MPEG and VCEG have established a Joint Collaborative Team on Video Coding (JCT-VC) to develop the HEVC standard
HEVC is a successor to H.264 standard
HEVC can support ultra high resolutions upto 8192 x 4320 pixels
HEVC offers substantially higher video compression ratio compared to existing standards
H.265 vs H.264
Tool H.264 H.265
Coding unit 16x16 macroblocksBlock coding Structure
Coding tree blocks (64x64)Quadtree coding structure
Transforms 4x4 and 8x8 4x4, 8x8, 16x16 and 32x32
Inter Prediction 4x4 to 16x16Symmetric partitions
4x4 to 64x64 Asymmetric partitions
Intra Prediction 9 Modes 35 Modes
Motion Prediction Spatial Median Advanced Motion Vection Prediction (Spatial + Temporal)
Luma motioncompensation
6 taps for half-pel positions+ Bilinear filter for qpel positions
8 taps for half-pixel positions + 7 tap filter for quarter-pel positions
Chroma motioncompensation
2 taps 4 taps
Slices Slices for parallel parsing Wavefront parallel processingTiles and slices for parallel parsing
In-loop filters Deblocking Deblocking and SAO
HEVC compressionB
itR
ate
1990 2000 2010
MPEG-2
H264/AVC
H265/HEVC
35% reduction in bitrate for same PSNR output when compared to H.264
Perceptual video quality is subjective and cannot be measured with PSNR values
Subjective tests have shown around 50% reduction in bitrate for similar perceptual video quality when compared to H.264
About 50% compression over H264 for video resolutions of 1080p and above. 30-40% compression over H264 for lower resolutions
HEVC Applications – Near Term
Over-the-top(OTT) video services market is growing at a rapid pace, thanks to Netflix, Hulu, YouTube etc.,
Smarter Phones and Tablets contribute significantly to OTT growth with consumers opting to view videos on-the-go
OTT video services are popularly used with in TVs/set-top boxes as well
Rapid growth in OTT market chokes the network bandwidth
One in five Consumers abandon viewing due to slow feeds , poor quality viewing experience
HEVC will enable superior viewing experience with OTT video service
HEVC Applications – Long Term
Higher quality video in the traditional terrestrial and satellite broadcasts
Video recording in cameras and mobile phones, for saving storage space or higher quality
Broadcasting 1080p video at 50 or 60 frames per second for the same bandwidth as 1080i (25 or 30 fps)
4K and 8K Ultra-HD broadcasts for theatre-like quality
Need for Software HEVC Decoder
HEVC is a newly ratified standard and there is no hardware support in the current generation of Processors (Embedded / Mobile / Applications SoCs)
Dedicated HW accelerators for HEVC increases the silicon area and hence the cost significantly
Lack of HEVC content makes the early HW implementation risky
Software Decoding is simpler and economically viable option for HEVC deployment NOW
Handling the HEVC decoder complexity on a wide range of processors with constraints on the power consumption is key challenge for the Software Decoder
Why use GPUs for Video Processing ?
Decoding of high resolution videos in software involves high computational complexity and will load the CPU enormously
GPUs are highly compute capable and power efficient devices
GPUs are generally idle during video playout
GPU acceleration will free up the CPU to perform other (system) tasks
Sin
CPU Core(s)
ARM Cortex with NEON
MALI T600 / OpenCLcompliant GPU
HEVC Decoding on Capable GPUs
GPUs are massively multithreaded devices capable of handling hundreds or thousands of threads in parallel at any given time
Only highly data parallel algorithms of video codec can be efficiently offloaded to the GPU for processing
Parsing & Entropy Decode
Motion Compensa
tion
Intra Prediction
Recon
Inverse Quant
Inverse Transform
Not suitable for GPU execution Data parallel execution ,suitable for GPU execution
Deblocking& SAO
Sin
Motion Compensation
The current picture/frame pixels is predicted from the reference frame’s pixels
The reference picture can be from past or future
The prediction happens on a block-by-block basis
And there can be multiple reference frames for each block
Sin
Motion Compensation
The most compute intensive part of Motion compensation is sub-pixel interpolation
• Luma – 8 or 7 tap filter
• Chroma – 4 tap filter
Sub pixel interpolation is data parallel, i.e., interpolation of each block within a frame can happen in parallel and hence suited for GPU computing
Sin
Inverse Quantization and Transform
• The residue value need to be Inverse quantized
• 2-D Inverse DCT transformations should be performed over the inverse quantized data
Inverse Quantization & Transform
• Reconstruction : The output from the Motion compensation and intra prediction should be added with the output from Inverse transform
• In loop filtering such as Deblocking and SAO filters are applied over reconstructed samples
Recon & InLoop Filters
Parsing & Entropy Decode
Motion Compensati
on
Intra Prediction
Recon
Inverse Quant
Inverse Transform
Deblocking & SAO
Challenges in CPU+GPU Implementation
• The effective FPS of decoder will be the minimum of the FPS achieved by the CPU and GPU for their respective work
• So the partitioning needs to be efficient so that both of them perform their respective work at almost the same speed(FPS)
Efficient Partitioning of work between
CPU and GPU
• The algorithms running on CPU will depend on the output of algorithms from GPU and/or vice versa
• A good design should make sure neither the CPU nor the GPU spend any time waiting for the output of the other
Efficient pipelining data between CPU
and GPU
• Cache coherency between CPU and GPU data need to ensured. Cache coherency
Benefits of Mali T600 GPU
The 128-bit vector processing
• Suits DSP algorithms like Video processing
Presence of GPU cache instead of Local
memory
• No requirement for data transfers from/to global memory. Can be understood just like a CPU.
Flexible OpenCL workgroup size
• Works optimizally for a large range of OpenCL workgroup sizes. Multiple block sizes in a Video frame can be handled efficiently.
No divergent threads • Similar to CPU code, conditional code can be used in OpenCL
kernels as well. Different kinds of filter types, filter lengths etc., in video decode can be handled efficiently.
Unified memory • CPU and GPU share the same memory. Video YUV buffers are
pretty big. There is no need of costly memory transfers of those buffers.
MALI GPUs are well suited for Video Acceleration with significant power/performance benefits
Thank You
For more information visit www.ittiam.comor contact us at [email protected]