1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P....
-
Upload
zoe-phelps -
Category
Documents
-
view
214 -
download
0
description
Transcript of 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P....
![Page 1: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/1.jpg)
1
Hierarchical Parallelization of an H.264/AVC Video Encoder
A. Rodriguez, A. Gonzalez, and M.P. Malumbres
IEEE PARELEC 2006
![Page 2: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/2.jpg)
2
Outline
IntroductionPerformance AnalysisHierarchical H.264 Parallel EncoderExperimental ResultsConclusions
![Page 3: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/3.jpg)
3
IntroductionBackground Knowledge (1/5)
Video Communication
![Page 4: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/4.jpg)
4
IntroductionBackground Knowledge (2/5)
H.264/AVCRemove sensitive redundant informationIn order to reach the limits on compression
efficiency intensive computation Video on demand, video conference, live
broadcasting, etc.
![Page 5: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/5.jpg)
5
IntroductionBackground Knowledge (3/5)
H.264/AVC encoderHigh CPU demand
Low latency Real time response
Platforms with supercomputing capabilitiesClustersMultiprocessorsSpecial purpose devices
![Page 6: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/6.jpg)
6
IntroductionBackground Knowledge (4/5)
ClusterA group of linked computersImprove performance and/or availability
over that provided by a single computerCategorizations
High-availability clusters Load-balancing clusters High-performance clusters
![Page 7: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/7.jpg)
7
IntroductionBackground Knowledge (5/5)
Message Passing ParallelismMessage passing runtimes and libraries MPI
Multithread ParallelismOpenMP
Optimized librariesSIMD extension and global processing unit Intel IPP, AMD ACML, etc.
![Page 8: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/8.jpg)
8
IntroductionMain Purpose (1/6)
Apply parallel processing to H.264 encoders in order to reduce computation intensity.
Given video quality and bit rateImage resolutionFrame rateLatency
![Page 9: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/9.jpg)
9
IntroductionMain Purpose (2/6)
Hierarchical parallelization of H.264 encoder
Two level MPI message passing parallelizationGOP levelSlice level
![Page 10: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/10.jpg)
10
IntroductionMain Purpose (3/6)
GOP level parallelismGood speed-upHigh latency
…….. …….. ……..
GOP GOP GOP
![Page 11: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/11.jpg)
11
IntroductionMain Purpose (4/6)
Example of latency1 GOP = 10 framesFrame rate = 30 frames/secTime for encoding 1 GOP = 3 secondsWe have to encode 9 GOP in parallel in order to
achieve real time responseLatency = 3 seconds
![Page 12: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/12.jpg)
12
IntroductionMain Purpose (5/6)
Slice level parallelismLow latencyLess coding efficiency
![Page 13: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/13.jpg)
13
IntroductionMain Purpose (6/6)
Combination both approachesSpeed-up Efficiency
![Page 14: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/14.jpg)
14
Performance AnalysisOverview (1/2)
““Performance evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations”
“A Parallel implementation of H.26L video encoder”
CombinationScalability and low latency
![Page 15: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/15.jpg)
15
Performance AnalysisOverview (2/2)
Processing flow
video sequence
GOP GOP GOP GOP……..……..
Increasethroughput
Reducelatency
![Page 16: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/16.jpg)
16
Performance AnalysisEquation definition
Little’s lawN = X * R
• N : Number of GOPs processed in parallel.
• X : Number of GOPs encoded per second.
• R : Elapsed time between a GOP enters the
system and the same GOP is completely
encoded.
![Page 17: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/17.jpg)
17
Performance AnalysisAnalysis (1/2)
If we have np nodes in the cluster and every GOP decomposed in ns slices
N = np / ns
R = RSEQ / ( ns * Es)
• RSEQ : Sequential encoding time of a GOP
• Es : Parallel efficiency of slice level
![Page 18: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/18.jpg)
18
Performance AnalysisAnalysis (2/2)
GOP throughput of combined parallel encoder
If Es is significantly less than 1, throughput would be affected negatively
sSEQ
p
ss
SEQ
s
p
ERn
EnRnn
RNX
![Page 19: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/19.jpg)
19
Performance AnalysisExample (1/4)
Video sequence in HDTV format at 1280*720 Frame rate = 60 frames / sec We suppose that H.264 sequential encoder
encodes one GOP(15 frames) in 5 seconds Only one slice per frame is defined
SEQ
p
Rn
X
![Page 20: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/20.jpg)
20
Performance AnalysisExample (2/4)
To get real time response, X has to be equal to 60 frames/sec or 4 GOPs/sec
np = 4 * 5 = 20 nodes
![Page 21: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/21.jpg)
21
Performance AnalysisExample (3/4)
Combined with slice level parallelismMaximum of allowed latency = 1 secSlice parallelism efficiency = 0.8
nodesnp 258.05*4
slicesER
Rn
s
SEQs 25.6
8.0*15
![Page 22: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/22.jpg)
22
Performance AnalysisExample (4/4)
We set ns to 7 and N to 4, and number of required nodes is adjusted to 28
sec89.0
8.0*75
sec/GOPs48.48.0*5
4*7
ss
SEQ
EnR
R
X Throughput
Latency
![Page 23: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/23.jpg)
23
Performance AnalysisEfficiency Estimation (1/5)
Why we have to estimate Es ?ThroughputLatency
How to estimate Es ?PAMELA (PerformAnce ModEling
LAnguage) model
![Page 24: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/24.jpg)
24
Performance AnalysisEfficiency Estimation (2/5)
Update DPB (Decoding Picture Buffer) in every nodeUsing MPI_Allgather
In this PAMELA model MPI_Allgather is implemented using binary tree
![Page 25: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/25.jpg)
25
Performance AnalysisEfficiency Estimation (3/5)
The PAMELA model to parallel encode one frame is :
L = par ( p = 1…ns )
delay (ts); delay (tw)
seq ( I = 0…log2(ns)-1)
par ( j = 1…ns)
delay ( tL + tc * 2i)
ns : The number of slices processed in
parallel
ts : The mean of slice encoding time
tw : The mean wait time due to variations
in ts and global synchronization
tL : Start up time
tc : Transmission time of one encoded
slice
![Page 26: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/26.jpg)
26
Performance AnalysisEfficiency Estimation (4/5)
The parallel time obtained solving this model is
Efficiency can be computed as
T(L) = ts + tw + tAG
tAG = log2 (ns) * tL + (ns - 1) * tc
![Page 27: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/27.jpg)
27
Performance AnalysisEfficiency Estimation (5/5)
The experimental estimations of parameter values
Estimated efficiency for a slice based parallel encoder
tL tc ts tw tAG
6.0 0.0133*4056 840000 20586 421
![Page 28: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/28.jpg)
28
Performance AnalysisSlice Parallelism Scalability (1/4) The feasible number of slices will
depend on the video resolution
Number of MBs per slice
Bit rate increment (%)
![Page 29: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/29.jpg)
29
Performance AnalysisSlice Parallelism Scalability (2/4)Bit rate overhead vs. number of slices
per frame
![Page 30: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/30.jpg)
30
Performance AnalysisSlice Parallelism Scalability (3/4)PSNR loss vs. number of slices per
frame
![Page 31: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/31.jpg)
31
Performance AnalysisSlice Parallelism Scalability (4/4)Encoding time vs. number of slices per
frame
![Page 32: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/32.jpg)
32
Hierarchical Parallel Encoder Overview
In order to achieve scalability and low latencyCombine GOP and slice level parallelism
In the first levelDivide sequence in GOPs(15 frames) Every GOP is assigned to a processor
group inside the cluster Each group encodes independently
![Page 33: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/33.jpg)
33
Hierarchical Parallel Encoder GOP assignment method
Local managerCommunicate with global manager
Global managerInform the GOP assignment by sending a
message with the GOP number to the requesting local manager
Simple and load balance
![Page 34: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/34.jpg)
34
Hierarchical Parallel Encoder Framework
Hierarchical H.264 parallel encoderGlobal Manager
P0
P1 P2
P0
P1 P2
P0
P1 P2
![Page 35: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/35.jpg)
35
Experimental ResultsEnvironments (1/2)
Mozart4 biprocessor nodes with AMD Opteron 246
at 2 GHz interconnected by a switched Gigabit Ethernet
AldebaranSGI Altix 3700 with 44 nodes Itanium II
interconnected by a high performance proprietary network
![Page 36: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/36.jpg)
36
Experimental ResultsEnvironments (2/2)
720 * 480 standard sequence Ayersroc which composed by 16 GOPs Configuration Cluster #Groups #Slices
01_Gr_08S1 Mozart 1 8
02_Gr_04S1 Mozart 2 4
04_Gr_02S1 Mozart 4 2
08_Gr_01S1 Mozart 8 1
01_Gr_16S1 Aldebaran 1 16
02_Gr_08S1 Aldebaran 2 8
04_Gr_04S1 Aldebaran 4 4
08_Gr_02S1 Aldebaran 8 2
16_Gr_01S1 Aldebaran 16 1
![Page 37: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/37.jpg)
37
Experimental ResultsSystem Speedup (1/2)
Speed up in Mozart
![Page 38: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/38.jpg)
38
Experimental ResultsSystem Speedup (2/2)
Speed up in Aldebaran
![Page 39: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/39.jpg)
39
Experimental ResultsEncoding Latency
Mean GOP encoding time
![Page 40: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/40.jpg)
40
Conclusions
A hierarchical parallel video encoder based on H.264/AVC was proposed.
Experimental results confirm the results from previous analysis, showing the ability of getting a scalable and low latency H.264 encoder.
Some issues remains open, as mentioned in previous section.
![Page 41: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…](https://reader035.fdocuments.us/reader035/viewer/2022070616/5a4d1c017f8b9ab0599ef517/html5/thumbnails/41.jpg)
41
Reference
[1] J.C. Fernández and M. P. Malumbres, “A Parallel implementation J.C. Fernández and M. P. Malumbres, “A Parallel implementation of H.26L video encoder”, in proc. of of H.26L video encoder”, in proc. of EuroPar 2002 conf. (LNCS EuroPar 2002 conf. (LNCS 2400), pp. 830, 833, Padderborn, 2400), pp. 830, 833, Padderborn, 2002.2002.
[2] A. Rodriguez, A. González and M.P. Malumbres,A. Rodriguez, A. González and M.P. Malumbres,“ Performance “ Performance evaluation of parallel MPEG-4 video coding algorithms on clusters evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations ”, IEEE Int. Conference on Parallel Computing in of workstations ”, IEEE Int. Conference on Parallel Computing in Electrical Engineering, Electrical Engineering, pp. 354, 357, Dresden, 2004.pp. 354, 357, Dresden, 2004.
[3] Arjan J.C. van Gemund, “Symbolic Performance Modeling of Arjan J.C. van Gemund, “Symbolic Performance Modeling of Parallel Systems”, Parallel Systems”, IEEE Transactions on Parallel and Distributed IEEE Transactions on Parallel and Distributed Systems, vol 14, no 2, Feb. 2003.Systems, vol 14, no 2, Feb. 2003.
[4] Pacheco, P.S.: Parallel Programming with MPI, Morgan Kaufman Pacheco, P.S.: Parallel Programming with MPI, Morgan Kaufman Publishers, Inc.Publishers, Inc.