Post on 07-Jul-2020
Bitrate Reduction Techniques forLow-Complexity Surveillance Video Coding
by
Pushkar Gorur
Submitted to the
Department of Electrical Communication Engineering
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
at the
INDIAN INSTITUTE OF SCIENCE
July 2016
© Pushkar Gorur
2016
All rights reserved
ii
Abstract
High resolution surveillance video cameras are invaluable resources for effective crime pre-
vention and forensic investigations. However, increasing communication bandwidth re-
quirements of high definition surveillance videos are severely limiting the number of cam-
eras that can be deployed. Higher bitrate also increases operating expenses due to higher
data communication and storage costs. Hence, it is essential to develop low complexity
algorithms which reduce data rate of the compressed video stream without affecting the
image fidelity. In this thesis, a computer vision aided H.264 surveillance video encoder and
four associated algorithms are proposed to reduce the bitrate and computational complex-
ity. The proposed techniques are (I) Speeded up foreground segmentation (II) Skip decision
(III) Reference frame selection and (IV) Face Region-of-Interest (ROI) coding.
In the first part of the thesis, a modification to the adaptive Gaussian Mixture Model
(GMM) based foreground segmentation algorithm is proposed to reduce computational
complexity. This is achieved by replacing expensive floating point computations with low
cost integer operations. To maintain accuracy, we compute periodic floating point updates
for the GMM weight parameter using the value of an integer counter. Experiments show
speedups in the range of 1.33 - 1.44 on standard video datasets where a large fraction of
pixels are multimodal.
In the second part, we propose a skip decision technique that uses a spatial sampler to
sample pixels. The sampled pixels are segmented using the speeded up GMM algorithm.
iii
The storage pattern of the GMM parameters in memory is also modified to improve cache
performance. Skip selection is performed using the segmentation results of the sampled
pixels. In the third part, a reference frame selection algorithm is proposed to maximize the
number of background Macroblocks (MB’s) (i.e. MB’s that contain background image con-
tent) in the Decoded Picture Buffer. This reduces the cost of coding uncovered background
regions. Distortion over foreground pixels is measured to quantify the performance of skip
decision and reference frame selection techniques. Experimental results show bit rate sav-
ings of up to 94.5% over methods proposed in literature on video surveillance data sets.
The proposed techniques also provide up to 74.5% reduction in compression complexity
without increasing the distortion over the foreground regions in the video sequence.
In the final part of the thesis, face and shadow region detection is combined with the
skip decision algorithm to perform ROI coding for pedestrian surveillance videos. Since
person identification requires high quality face images, MB’s containing face image content
are encoded with a low Quantization Parameter setting (i.e. high quality). Other regions
of the body in the image are considered as RORI (Regions of reduced interest) and are
encoded at low quality. The shadow regions are marked as Skip. Techniques that use only
facial features to detect faces (e.g. Viola Jones face detector) are not robust in real world
scenarios. Hence, we propose to initially detect pedestrians using deformable part models.
The face region is determined using the deformed part locations. Detected pedestrians are
tracked using an optical flow based tracker combined with a Kalman filter. The tracker im-
proves the accuracy and also avoids the need to run the object detector on already detected
pedestrians. Shadow and skin detector scores are computed over super pixels. Bilattice
based logic inference is used to combine multiple likelihood scores and classify the super
pixels as ROI, RORI or RONI. The coding mode and QP values of the MB’s are determined
using the super pixel labels. The proposed techniques provide a further reduction in bitrate
of up to 50.2%.
iv
Acknowledgements
Firstly, I would like to thank my adviser Prof. Bharadwaj Amrutur for the patient guidance
and freedom he has provided me throughout my PhD program. When I was sitting like a
frog in the VLSI (P & N) wells, he nudged me to come out and explore the field of video
signal processing. His support and reassurances during the early days of my PhD when I
was working on the H.264 encoder has been invaluable. His detailed feedback about my
writing has helped me to present my research work more clearly. His emphasis on solving
the practical problem of surveillance video bitrate reduction helped me to stay focussed.
Without his constant course corrections, I would have drifted away and lost track (like
early versions of my pedestrian tracker!)
I take this opportunity to thank Prof. P. S. Sastry and Prof. Vittal Rao for the mathematics
concepts they imparted to me through their courses. I am very fortunate to have had the
continuous support and guidance of Prof. A. G. Menon. He was instrumental in my decision
to join the PhD program at ECE IISc. I would like to thank Prof. K. R. Ramakrishnan for
having allowed me to join the surveillance related discussions with the Bengaluru City
police officers. I would also like to thank him for the discussions related to the lampTop
project.
I would like to thank TCS for supporting my research through the TCS fellowship pro-
gram. I would also like to thank Dr. Balamuralidhar for taking time to discuss my research
work at TCS labs, Bengaluru.
v
I have been fortunate to have received help from many wonderful colleagues. Suhas
Kashyap has helped a lot in collecting surveillance videos. He also provided ground truth
segmentation data for the test videos used to validate the skip decision and reference frame
selection algorithms. He features prominently in a lot of videos that I have used in this
thesis! I thank Harish for discussions about the inference for the Face ROI encoder. I thank
Bhargava for helping me collect surveillance videos and test the pedestrian ROI encoder. I
wish to learn a lot of deep learning from him now! Thanks to Anirudh for developing the
SSE based convolution code for DPM. Samik helped to perform experiments on shadows
and initial feasibility studies of ROI coding. I would like to thank Ajit Gupte for taking time
to discuss my research at TI and Qualcomm. Working with Doney on the lampTop project
was a lot of fun.
I thank the ECE staff for all support. I would like to thank Srinivas Murthy Sir and
Radhika Madam in particular for all the help that they have provided me. I would like
to express my appreciation for the help and support I received from friends in our lab.
BT, PD, Rajath, Anand, Kaushik, Mohan, Manikandan, Satyam, Janaki, Hitesh, Vikram,
Viveka, Doney, Syam, Siva, Bhargava, Sagar, Karthik, Prachet, Akshay, Pratik, Mallikarjun,
Nagaraju, Auritro, Balram and Abhishek kept the lab environment enjoyable and fun. I used
to pull them out of lab to capture surveillance videos for testing!
I thank the Robert Bosch Center for Cyber Physical Systems for supporting my travel to
the ITS conference to present the lampTop project research work.
This dissertation would not have been possible without the peaceful walks at Sankey
tank. I thank all the people who have worked and continue to work to make it such a nice
place. I also thank all the staff in IISc who keep the gardens and campus beautiful. The
very thought of wading through Bengaluru traffic to get to work after the PhD is frightening.
Finally, I wish to thank my family for their support during these years.
vi
List of publications from this thesis
Journal Articles
• Pushkar Gorur, Bharadwaj Amrutur, Skip Decision and Reference Frame Selection for
Low Complexity H.264/AVC Surveillance Video Coding, IEEE Transactions on Circuits
and Systems for Video Technology, vol.24, no.7, pp. 1156-1169, July 2014.
• Pushkar Gorur, Bhargava Srivatsa, Bharadwaj Amrutur, Region-of-Interest (ROI) Video
Coding for Pedestrian Surveillance Cameras (to be submitted).
Conference Proceeding
• Pushkar Gorur, Bharadwaj Amrutur, Speeded up Gaussian Mixture Model Algorithm
for Background Subtraction, IEEE Conference on Advanced Video and Signal Based
Surveillance (AVSS), pp. 386-391, 2011.
vii
Abbreviations
AV C Advanced Video Coding
BG Background
CABAC Context-Adaptive Binary Arithmetic Coding
CCTV Closed Circuit Television
DCT Discrete Cosine Transform
DPCM Differential Pulse-Code Modulation
DPB Decoded Picture Buffer
DV R Digital Video Recorder
EM Expectation Maximization
FG Foreground
GMM Gaussian Mixture Model
HEV C High Efficiency Video Coding
HD High Definition
HOG Histogram of Oriented Gradients
HQF High Quality Frame
IDR Instantaneous Decoder Refresh
JM Joint Model
KL Kullback-Leibler
LAN Local Area Network
viii
LLC Last Level Cache
MAD Mean Absolute Difference
MB Macroblock
MP Megapixel
MPEG Moving Picture Experts Group
MSE Mean Square Error
MV Motion Vector
NAL Network Abstraction Layer
PIR Passive Infrared
PMV Predicted Motion Vector
PoE Power on Ethernet
POC Picture Order Count
PSNR Peak Signal-to-Noise Ratio
QP Quantization Parameter
RC Rate Control
RD Rate Distortion
RDO Rate Distortion Optimization
ROI Region Of Interest
RONI Region Of No Interest
RORI Region Of Reduced Interest
RPB Reference Picture Buffer
RPLR Reference Picture List Reordering
SE Syntax element
S −MD Sampler based Motion Detection
SRL Statistical Relational Learning
SVM Support Vector Machine
V BR Variable Bitrate
V GA Video Graphics Array
V J Viola Jones
ix
Contents
Abstract iii
Acknowledgements v
List of publications from this thesis vii
Abbreviations viii
1 Introduction 1
1.1 Recent Trends in Video Surveillance . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Bitrate increase in HD surveillance . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 How much resolution is ‘good enough’? . . . . . . . . . . . . . . . . 2
1.2.2 Bitrate versus camera resolution . . . . . . . . . . . . . . . . . . . . 7
1.3 Bitrate increase in low light surveillance . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Interplay between exposure, gain and noise . . . . . . . . . . . . . . 8
1.3.2 Bitrate versus noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Challenges due to increased bitrate . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.1 Proposed surveillance video encoder architecture . . . . . . . . . . . 13
1.5.2 Bitrate & computational complexity reduction . . . . . . . . . . . . . 14
1.6 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
x
2 Background and Related Work 18
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 A review of video encoding techniques . . . . . . . . . . . . . . . . . . . . . 18
2.3 H.264 basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Reference frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Macroblock Skip mode . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.3 Macroblock QP signaling . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Bitrate & complexity reduction techniques for video surveillance . . . . . . . 26
2.4.1 Skip detection techniques . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 Background reference frame selection techniques . . . . . . . . . . . 29
2.4.3 ROI coding techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.4 Mode decision and motion estimation related techniques . . . . . . . 37
2.4.5 Hardware related advancements . . . . . . . . . . . . . . . . . . . . 38
2.4.6 Distributed video coding based techniques . . . . . . . . . . . . . . . 39
2.4.7 Wireless and/or Remote surveillance specific techniques . . . . . . . 40
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3 Speeded up GMM Algorithm for Background Subtraction 43
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Gaussian mixture model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.1 Adaptive Mixture Learning with fast convergence . . . . . . . . . . . 44
3.2.2 Automatic selection of number of components . . . . . . . . . . . . . 46
3.3 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.1 Weight update interval Experiment . . . . . . . . . . . . . . . . . . . 52
3.4.2 Adaptive Mixture Learning Experiment . . . . . . . . . . . . . . . . . 53
3.4.3 Background subtraction experiment . . . . . . . . . . . . . . . . . . 54
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
xi
4 Skip decision & Reference Frame Selection for H.264 Surveillance Coding 60
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Sampling techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.1 Basic sampling techniques . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.2 Adaptive sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4 Sampler based Background MB detection . . . . . . . . . . . . . . . . . . . . 68
4.4.1 GMM S-MD as a Stratified-Adaptive-Cluster sampler . . . . . . . . . 71
4.4.2 Spatio-temporal priors . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.3 Cache performance optimization . . . . . . . . . . . . . . . . . . . . 73
4.5 Reference frame selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.6 Macroblock Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.7 Skip Signalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.8 Optimum Reference Frame selection . . . . . . . . . . . . . . . . . . . . . . 80
4.8.1 Proposed Adaptive Reference Frame Selection Technique . . . . . . . 80
4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5 Results: Skip Decision and Reference Frame Selection 83
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 Skip Selection using GMM S-MD . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4 Analysis of GMM S-MD Performance . . . . . . . . . . . . . . . . . . . . . . 97
5.5 Background PSNR and its Impact . . . . . . . . . . . . . . . . . . . . . . . . 106
5.6 Analysis of the Proposed Adaptive Reference Frame Selection Technique . . 111
5.7 Performance of the Proposed Adaptive Reference Frame Selection Technique 113
5.8 RD performance results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6 ROI video coding for Pedestrian Surveillance 120
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
xii
6.2 Proposed architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.2.1 Low level inferencing . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.2.2 High level inferencing . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.3 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.4 Shadow detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.4.1 Weak shadow detector . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.4.2 Physics based shadow detection over super pixels . . . . . . . . . . . 129
6.4.3 Texture based shadow detection . . . . . . . . . . . . . . . . . . . . . 132
6.5 Skin detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.6 Pedestrian detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.6.1 DPM based pedestrian detection: A brief review . . . . . . . . . . . . 138
6.6.2 Proposed modifications to DPM . . . . . . . . . . . . . . . . . . . . . 141
6.7 Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.8 Detection by Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.8.1 Components of a tracker . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.8.2 FG blob based tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.8.3 Optic flow based tracker . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.9 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.9.1 Bilattice logic for ROI, RORI & RONI super pixel inference . . . . . . 156
6.10 Macroblock mode and quality parameter assignment . . . . . . . . . . . . . 161
6.11 ROI, RORI & RONI video compression results . . . . . . . . . . . . . . . . . 162
6.11.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.11.2 Bitrate reduction and accuracy . . . . . . . . . . . . . . . . . . . . . 163
6.11.3 Impact of detector errors on ROI encoder performance . . . . . . . . 169
6.11.4 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . 174
6.11.5 Complexity control for ROI encoding . . . . . . . . . . . . . . . . . . 175
6.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
xiii
7 Conclusion 180
7.1 Future Challenges and Opportunities . . . . . . . . . . . . . . . . . . . . . . 183
7.1.1 Coding for surveillance cameras on drones . . . . . . . . . . . . . . . 183
7.1.2 Power-Rate-Distortion optimization of ROI encoders . . . . . . . . . 183
7.1.3 360◦ surveillance video coding . . . . . . . . . . . . . . . . . . . . . 184
7.1.4 HDR surveillance video coding . . . . . . . . . . . . . . . . . . . . . 185
A Alternate derivation of the Speeded up GMM update 186
B Sampler design 189
B.1 Analysis of a simple systematic sampler . . . . . . . . . . . . . . . . . . . . . 189
B.1.1 Uniform versus Non Uniform sampling patterns . . . . . . . . . . . . 190
B.1.2 Uniform systematic sampler accuracy . . . . . . . . . . . . . . . . . . 192
B.2 Analysis of the proposed sampler . . . . . . . . . . . . . . . . . . . . . . . . 197
C Bilattice logic based inference 199
xiv
List of Tables
1.1 BSIA standard [1] recommendations for image detail . . . . . . . . . . . . . 3
3.1 Average frame rate using proposed scheme, [2] & [3]. Average speedup
obtained using proposed scheme measured over [3] (Zivkovic) . . . . . . . 55
4.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1 Average execution time reduction of encoder (QP set to 24) . . . . . . . . . 87
5.2 Performance comparison of proposed GMM S-MD on ‘No activity’ datasets . 88
5.3 Average execution time of Skip detection . . . . . . . . . . . . . . . . . . . . 89
5.4 Performance comparison of reference frame selection algorithms . . . . . . 116
5.4 Performance comparison of reference frame selection algorithms . . . . . . 117
xv
List of Figures
1.1 Images with minimum image detail required to perform detection, observa-
tion, recognition & identification surveillance tasks . . . . . . . . . . . . . . 4
1.2 Pinhole camera model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Top view showing the coverage of a MOBOTIX surveillance camera with a
3.6mm lens [4] for recognition tasks . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Average typical optimized bitrate of Bosch security cameras [5, 6] plotted
against the camera resolution . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Low light scene with nominal exposure and gain settings . . . . . . . . . . . 9
1.6 Surveillance snapshots with different camera exposure settings (camera gain
has been increased to improve visibility and contrast) . . . . . . . . . . . . . 10
1.7 Surveillance snapshots with different camera gain settings . . . . . . . . . . 10
1.8 Bitrate versus pixel noise in a low light scene . . . . . . . . . . . . . . . . . 11
1.9 Architecture of the proposed surveillance video encoder . . . . . . . . . . . 14
2.1 Architecture of the H.264 encoder [7] . . . . . . . . . . . . . . . . . . . . . 22
2.2 Decoded Picture Buffer managed using MMCO commands . . . . . . . . . . 23
2.3 Reference frame list management in H.264 . . . . . . . . . . . . . . . . . . . 25
3.1 Weight update using proposed update for a monotonically (a) increasing and
(b) decreasing case. The weights are plotted on the y axis with respect to time 51
xvi
3.2 Weight update using (a) original GMM update equations and (b) proposed
weight update for a GMM with weights=[0.7, 0.25, 0.05] and Tw = 16. The
weights are plotted on the y axis with respect to time. Please note that the
graph shows that proposed technique does not affect the learning rate. The
increase in frame rate provided by the proposed technique is illustrated in
Fig. 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Frame rate (fps) and error % are plotted with respect to the weight update
interval Tw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 (a) Synthetic distribution based on commonly observed surveillance videos
(b) KL divergence achieved by the proposed and the original method [2] . . 54
3.4 Instantaneous frame rates plotted against frame count using proposed scheme,
[2] & [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5 Average precision-recall curves obtained using proposed scheme, [2] & [3]
for the 10 dataset videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6 Detection results on the Hall [8] and video7 [9] dataset videos. (a) & (e)
are original images from the Hall and video7 datasets respectively. (b) &
(f) are the corresponding Ground truth images. (c) & (g) are the segmenta-
tion masks obtained using Lee [2]. (d) & (h) are the segmentation masks
obtained using the proposed scheme . . . . . . . . . . . . . . . . . . . . . . 58
4.1 Proposed surveillance specific video coding architecture . . . . . . . . . . . 61
4.2 Basic sampling techniques (a) Random (b) Cluster (c) Stratified (d) Systematic 66
4.3 Stratified Adaptive Cluster Sampling (a) First stage (d) Second stage . . . . 67
4.4 GMM pixel level classifier + Sampler based Motion Detection (or ‘GMM S-
MD’) flow chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
xvii
4.5 Figure shows the sampling pattern of pixels in an image. The sampled pixels
are partitioned into 4 sparse sets A1, A2, A3, & A4. Also shown are the GMM
data structures of pixels mapped onto different cache lines to improve cache
locality. The models of the dominant modes are arranged in a contiguous
manner. Also, the data elements belonging to a single set of pixels are present
in a contiguous array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.6 Salient MB’s and Sampled pixel plot . . . . . . . . . . . . . . . . . . . . . . 71
4.7 Sequence of frames in display order . . . . . . . . . . . . . . . . . . . . . . . 75
4.8 Macroblock reference assignment . . . . . . . . . . . . . . . . . . . . . . . . 76
4.9 Skip Signalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.10 Pseudo Code of Proposed Reference Frame Selection Scheme . . . . . . . . 81
5.1 Snapshots from the video dataset (a) Entrance (b) Parking Lot (c) Access
Door (d) Backyard1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 RD data for the (a) Bridge & (b) Walkway video sequences . . . . . . . . . . 91
5.3 RD data for the (a) Access Door & (b) Entrance video sequences . . . . . . . 92
5.4 RD data for the (a) PETS-1 & (b) PETS-2 video sequences . . . . . . . . . . 93
5.5 Encoded frames of the (a) Light Switch (b) Bridge and (c) Low light video
sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.6 Encoded frames of the (a) CDW (b) PETS-2 and (c) PETS-3 video sequences 95
5.7 Figure on the left shows a poorly lit corridor scene with increased camera
gain settings. Also, on the right, 100 RGB sample values of a pixel (pixel P
in the image) from the video are plotted in the 3D RGB space. In the same
picture, the background GMM mode is shown, i.e. the points on the sphere
are at a distance of 2.5 σ (Mahalanobis distance) from the mean value of the
mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.8 Impact of varying Tsparse on GMM S-MD performance shown for the (a)
Walkway and (b) Backyard2 sequences . . . . . . . . . . . . . . . . . . . . . 102
5.9 Encoded frame from the ‘Parking lot’ video (a) Without Spatio-Temporal bias
(object is missed) (b) With Spatio-Temporal bias (object is detected) . . . . 103
xviii
5.10 Encoded frame from the ‘Parking lot’ sequence with (a)Ddense = 4 (b) Ddense
= 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.11 (a) Correctly detected foreground objects (marked in yellow) and (b) RD
data for the ‘Parking lot’ video . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.12 Impact of varying learning rate on GMM S-MD performance on (a) Bridge
and (b) Backyard2 sequences . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.13 Snaps of the encoded ‘Walkway’ dataset coded using (a) JM and (b) GMM
S-MD show that the proposed method does not produce any conspicuous
distortion in the background. The DPM detections (yellow rectangles) are
overlaid on the images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.14 Figure shows the PSNR plots for the ‘Walkway’ dataset frame in Fig. 5.13.
The PSNR plots have been computed over the (a) entire frame and (b) Fore-
ground regions. Although the proposed technique reduces the total PSNR, it
significantly improves the RD performance for foreground image regions. . . 109
5.15 (a) and (b) Show two encoded frames (with different sunlight intensities) in
the ‘Sunlight variation’ video. We observe that the proposed scheme does not
wrongly mark FG MB’s as ‘Skip’ under fast illumination changes. . . . . . . . 110
5.16 Slow reduction in illumination observed in the ‘Evening fade’ video. Encoded
frames captured (a) before and (b) after the reduction . . . . . . . . . . . . 110
5.17 No. of MB’s in the set B as a percentage of NMB (Total No. of MB’s in a
frame) for the ‘Entrance’ video . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.18 No. of FG→BG MB’s which require non-zero residual coding (i.e. No. of
FG→ BG 〈U〉 MB’s) in the ‘Entrance’ video . . . . . . . . . . . . . . . . . . . 113
6.1 Number of bits required to encode MB’s of a surveillance frame at uniform
quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2 Skin pixel detection in a surveillance video frame . . . . . . . . . . . . . . . 122
6.3 Face region detection using the Viola Jones detector . . . . . . . . . . . . . . 123
6.4 Architecture of the proposed ROI, RORI and RONI detector . . . . . . . . . 126
6.5 Super pixels detected in a surveillance video frame . . . . . . . . . . . . . . 128
xix
6.6 The shaded volume shown in the RGB color space is considered as shadow
pixel values by the weak shadow detector . . . . . . . . . . . . . . . . . . . 130
6.7 Pixel values of a surface is plotted from a video sequence. Intermittent fore-
ground object motion causes shadows on the surface. . . . . . . . . . . . . . 131
6.8 Shadow scores of super pixels plotted for a surveillance video frame . . . . . 134
6.9 Skin scores of super pixels in a surveillance video frame . . . . . . . . . . . 136
6.10 HOG feature computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.11 DPM part filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.12 Edge enhancement of blob boundaries . . . . . . . . . . . . . . . . . . . . . 142
6.13 Proposed DPM cascade for pedestrian detection . . . . . . . . . . . . . . . . 143
6.14 Sample result of DPM cascade . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.15 Geometry of the surveillance camera system showing ground planes at dif-
ferent elevations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.16 Sample surveillance video snapshots showing feasible and infeasible pedes-
trian hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.17 (a) Search region is initialized using the Kalman filter prediction. (b) Positive
and negative filters are applied on the FG blob to determine the left and right
bounds of the head region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.18 The five part-templates are shown here. Feature matching scores are accu-
mulated over these part-templates. Correspondence vectors are computed
for each part-template. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.19 The template and the current frames (separated in time by 10 frames) are
shown. NCC scores of the five part-templates for the image in (a) is (0.77,
0.9, 0.87, 0.8, 0.81). The order of the scores is (left-upper-body, right-upper-
body, head-shoulder, torso, upper body). NCC scores for the image in (b) is
(0.93, 0.69, 0.88, 0.8, 0.83). Here, the score of the right-upper-body template
is lower due to occlusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.20 Figure shows detector scores of a pedestrian on the Bilattice square . . . . . 158
xx
6.21 (a) Figure shows a pedestrian detection and different super pixels in the
blob (b) The pedestrian bounding box is divided into face, torso and leg
rectangles. The super pixels in the leg region are assigned a prior RORI score
based on the distance ySP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.22 Figure shows the (a) Bitrate and (b) Face region PSNR of 40 frames in the
‘Entrance road’ video. The comparison results have been obtained using (I)
Proposed method and (II) Only skip detection. The overall bitrate reduction
using the proposed technique is 37.2%. The total face region distortion met-
rics using the proposed method and the FG skip detection encoder were both
measured as 40.8dB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.23 Figure shows frames from the ‘Entrance road’ video compressed using (a)
Only skip detection (b) Proposed ROI encoder. . . . . . . . . . . . . . . . . . 165
6.24 Figure shows the (a) Bitrate and (b) Face region PSNR of 40 frames in the
‘Porch’ video. The comparsion results have been obtained using (I) Proposed
method and (II) Only skip detection. The overall bitrate reduction using the
proposed technique is 50.2%. The total face region distortion metrics using
the proposed method and the FG skip detection encoder were both measured
as 40.9dB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6.25 Figure shows frames from the ‘Porch’ video compressed using (a) Only skip
detection (b) Proposed ROI encoder. . . . . . . . . . . . . . . . . . . . . . . 167
6.26 Figure shows that the proposed ROI encoder removes finer details in the
RORI MB’s but maintains image quality of the face region. . . . . . . . . . . 168
6.27 The table shows the different MB labeling errors and their consequences
(cells are color coded to signify the severity). Here, the rows correspond
to true MB labels and columns to the MB labels assigned by the ROI detector. 169
xxi
6.28 Figure shows that the DPM detector has failed to detect pedestrians A & B.
Pedestrian A is severely occluded by B. The head region of pedestrian B has
poor contrast. Accurate detection of pedestrian B would have reduced bit
cost of the frame by 15kbits. In contrast, detection of pedestrian A would
reduce bit cost by only 1kbit. . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.29 (a) Figure shows few more DPM detector failures on small pedestrians (b)
Here, the tracker has tracked the pedestrian (bounded by the green box)
based on a previous detection. If only the DPM detector was applied on the
current frame, the pedestrian would have been missed. . . . . . . . . . . . . 171
6.30 (a) Figure shows localization error of the DPM detector. The detector has
included the shadow regions below the pedestrian (due to incorrect shadow
detection) in the bounding box (b) The torso has been detected as a head
shoulder region. Again, the shadow region has been included in the bound-
ing box due to incorrect detection. . . . . . . . . . . . . . . . . . . . . . . . 172
6.31 Figure shows three frame a, b & c (that are temporally ordered) in which the
DPM detector has detected the child before (i.e. in (a)) and after (i.e. in (c))
the occlusion. However, the detector has failed during the occlusion (i.e. in
(b)). If the child was not tracked by the tracker, his face image regions would
be encoded in low quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.32 Figure shows the tracker bounding box positions as the tracked pedestrian
gets occluded and reappears later. During the occlusion (i.e. in (b)), the
NCC score of the tracked pedestrian drops from 0.76 to 0.41. This would
trigger the execution of the DPM detector. However, since the template is not
updated, the tracker reassigns the correct bounding box when the pedestrian
reappears from occlusion in (c). The NCC score also increases to 0.7. . . . . 174
6.33 Bit count savings is plotted against the height of the pedestrian image in
pixels. QP = 24 for the video encoded without ROI coding. For the ROI
encoded video, QPROI = 24 and QPRORI = 32. . . . . . . . . . . . . . . . . 176
xxii
6.34 Scene shows multiple pedestrians in the scene. Pedestrians A & D cover a
large number of MB’s in the image. Hence, ROI detection on image regions
of these pedestrians provides higher bitrate savings. . . . . . . . . . . . . . . 177
B.1 GMM pixel level classifier + Sampler based Motion Detection (or ‘GMM S-
MD’) flow chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
B.2 1D sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
B.3 ROC curve of the pixel level classifier for different values of v (or normalized
signal level) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
B.4 Sampler accuracy for different values of stride length (v = 3, L = 20, T = 2.5)196
B.5 Sampler accuracy for different values of pixel level classifier threshold (v =
3, L = 20, dsys = 10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
C.1 Double Hasse diagrams of different bilattices. In (c), a surveillance video
frame is shown. Also, the logic values of pedestrian and non pedestrian
image regions are shown in the double Hasse diagram. . . . . . . . . . . . . 207
C.2 Double Hasse diagrams show partial ordering based on belief and informa-
tion in bilattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
C.3 Construction of the square bilattice . . . . . . . . . . . . . . . . . . . . . . . 209
xxiii
Chapter 1
Introduction
1.1 Recent Trends in Video Surveillance
The video surveillance market has undergone marked changes in the last decade. Early
surveillance camera networks were mostly installed and operated by government munici-
pal corporations, prisons, banks and casinos. However, since the beginning of this century,
we see an increased adoption of surveillance cameras in private premises including homes
and commercial buildings. Greater density of surveillance cameras has resulted in higher
coverage. This has significantly increased the use of surveillance footage in criminal investi-
gations and as evidence in judicial inquiries. Surveillance videos have been instrumental in
solving many crimes e.g. the murder of toddler James Bulger, the Boston marathon bomb-
ing & the London 7 July 2005 attacks. However, we frequently find instances where low
resolution surveillance footage has delayed or hindered investigation, for example in the
case of the Bangkok bombing & the Bodh Gaya bombings. Although consumer awareness
of the importance of high definition (HD) surveillance camera footage has increased, mar-
ket penetration of such cameras has been slow. This is primarily due to cost considerations
and insufficient security budgets.
Advances in VLSI manufacturing over the past decade have reduced the cost of HD cam-
eras. However, increasing the resolution of surveillance videos increases the bitrate of the
encoded streams. Also, low light conditions severely affect the compression performance
of video encoders. The increased bitrate due to these factors has resulted in higher data
communication and storage costs. Unlike camera and storage costs, data communication
expenses are recurring in nature. Hence they increase the operating costs of the surveil-
lance system. In addition, higher data communication bandwidth requirements of HD cam-
era streams would necessitate upgrading of the network infrastructure in some cases. Also,
1
Chapter 1. Introduction 2
multiple HD video streams routed to a central server would increase network congestion in
routers close to the server.
The aforementioned issues have hindered the adoption of high resolution surveillance
systems. Hence, it is very important to reduce the bitrate of HD camera videos to deliver the
operational requirement of the consumer. In this chapter, we discuss all the stated issues
in detail. We then describe the surveillance video encoder architecture that we propose to
address the challenges. We also enumerate the techniques proposed in this thesis to reduce
the bitrate and computational complexity of surveillance video encoders. The organization
of the thesis is provided in the concluding section of this chapter.
1.2 Bitrate increase in HD surveillance
1.2.1 How much resolution is ‘good enough’?
The BSIA (British security industry association) code of practice for CCTV surveillance sys-
tems document [1] has classified tasks based on the intended objectives as: Monitor, Detect,
Observe, Recognise & Identify. Table 1.1 lists the task and the image detail (equivalent pix-
els per meter) required at the target distance. It also lists the number of pixels that should
cover the face region in the image (The average width of the human face is 16 centimeters).
In this table, there is a subtle difference between recognition and identification tasks. The
recognition task requires that the observer has seen the pedestrian before. A typical use case
is to make the surveillance video public in media after the crime is committed (Many crimes
have been solved in which people recognise the accused in the publicly released videos and
report to the police). The identification task allows matching of the pedestrian image with
database records. Clearly, the identification task has higher utility in comparison with the
recognition task.
To gain a better understanding of the data in Table 1.1, we extract pedestrian image
snapshots with image detail matching the resolution requirements. Fig. 1.1 shows these
Chapter 1. Introduction 3
Table 1.1: BSIA standard [1] recommendations for image detail
Task Description Pixels/m Pixels/face
(horizontal)
Monitor View the number, direction and speed
of movement of people (given that their
presence is known)
12.5 -
Detect Determine presence of any target (e.g. a
person or vehicle)
25 4
Observe View characteristic details of an individ-
ual (e.g. clothing)
62.5 10
Recognise Identify individuals if operator has seen
individual before
125 20
Identify Identify individuals beyond reasonable
doubt
250 40
images for Detect, Observe, Recognise & Identify tasks. We can easily appreciate the im-
portance of capturing high resolution images to perform recognition and identification op-
erations. Before deployment of a surveillance camera system, the task that the system is
required to perform has to be decided. Based on this, the required resolution of the video
can be determined using the parameters of the camera.
Chapter 1. Introduction 4
(a) Detection (b) Observation
(c) Recognition (d) Identification
Figure 1.1: Images with minimum image detail required to perform detection, observation,
recognition & identification surveillance tasks
Chapter 1. Introduction 5
Fig. 1.2 shows the model for a pinhole camera. Let the image detail (i.e. number of
pixels that cover 1 meter length at the object distance) required for the surveillance task be
dtaskI . The coverage of the camera is defined as the region on the ground plane where the
image detail satisfies the requirements for the surveillance task (listed in Table 1.1). This
region is specified by two parameters, Dtaskmax and Stask
max (see Figs. 1.2 and 1.3). Dtaskmax is the
maximum distance (from the camera) and Staskmax is the maximum horizontal span within
which the image detail requirements are met. Let f be the focal length of the camera.
Let W & H be the width and height of the image sensor respectively. Let Rhorz & Rvert
be the horizontal and vertical camera resolutions respectively. Let αhorz & αvert be the
horizontal and vertical angles of view respectively. Dtaskmax & Stask
max can be computed from the
camera parameters using Eqns. 1.1, 1.2 & 1.3 (We note that the camera tilt has not been
considered while deriving the equations. This would result in a small difference between
the true coverage and coverage computed using Eqns. 1.1, 1.2 & 1.3).
Dtaskmax = min
(
Rhorzf
dtaskI W,Rvertf
dtaskI H
)
(1.1)
= min
(
Rhorz
2dtaskI tan (αhorz/2),
Rvert
2dtaskI tan (αvert/2)
)
(1.2)
Staskmax =
Rhorz
dtaskI
(1.3)
Eqns. 1.1 & 1.2 show that for a given task and fixed camera lens parameters, the
coverage region increases linearly with the resolution. In Fig. 1.3, Dtaskmax and Stask
max (of
a MOBOTIX surveillance camera with a B036 lens [4]) required to perform recognition
operations are shown. The figure shows that the coverage region of low resolution cameras
is very small. For example, at VGA resolution, the maximum distance at which we can
recognize the subject is 2.4m. This clearly indicates that HD cameras are required to provide
reasonable coverage for recognition and identification tasks.
Also, previously, when operators were compelled to use low resolution cameras (due to
Chapter 1. Introduction 6
cost considerations), surveillance monitoring over large regions was achieved by increasing
the number of cameras. However, it is preferred to have larger coverage regions using fewer
cameras since reducing the number of cameras avoids the installation and wiring of multi-
ple devices. This further motivates the adoption of high definition cameras to achieve the
surveillance task. Anticipating these market requirements, almost all the leading surveil-
lance camera manufacturers have introduced 12MP (4000 * 3000) resolution imagers in
their latest product offerings.
horzα
f taskDmax
W
H
taskSmax
Image sensor
Image detail of object = dI
task pixels/meter
Figure 1.2: Pinhole camera model
Chapter 1. Introduction 7
10.3m
Camera
7.7m
2.4m
6m
19.4m
25.9m
Coverage at 0.3MP (640x480) resolution
Coverage at 6MP (3072x2048) resolution
Coverage at 3MP (1920x1536) resolution
Figure 1.3: Top view showing the coverage of a MOBOTIX surveillance camera with a
3.6mm lens [4] for recognition tasks
1.2.2 Bitrate versus camera resolution
Fig. 1.4 shows the average typical optimized bandwidth of Bosch security cameras [5, 6]
plotted against the image resolution. This data suggests network bandwidth requirements
to be in the range of 4 - 6Mbps for 12MP H.264 surveillance videos. Such a camera used
for identification tasks can provide coverage up to 11m (when horizontal angle of view is
set to 70°). In comparison, the bitrate of a VGA stream is ≈ 600kbps. However, coverage
distance achieved by the VGA camera for the identification task is only ≈ 1.8m. Hence, a
VGA camera would not be suitable for such tasks. The graph also shows that incrementing
the frame rate increases the bitrate only sub linearly. This is due to efficient removal of
temporal redundancy by the H.264 encoder when frames are more closely spaced in time.
Chapter 1. Introduction 8
�
����
����
����
����
����
����
����
� � � � �� �� ��
�����������
��������������� �����������
����
����
���
���
����
���� �� � ����
Figure 1.4: Average typical optimized bitrate of Bosch security cameras [5, 6] plotted
against the camera resolution
1.3 Bitrate increase in low light surveillance
In the previous section, the average bitrate of Bosch surveillance cameras was found to
increase to up to 6Mbps when 12MP cameras are used. In low light conditions, the encoded
video bitrate increases further due to increased noise in the camera image. To gain insight
into this issue, we first need to understand the interplay between image noise, camera blur
and scene lighting.
1.3.1 Interplay between exposure, gain and noise
In Fig. 1.5, we show a surveillance camera image captured under low light conditions. Fig.
1.6 shows snapshots from the scene captured with high and low exposure settings. We can
observe that the facial details are blurred when exposure setting is high. Recognition tasks
will not be possible with such videos. We now set the exposure such that the blur is reduced
(like in Fig. 1.6b) and vary the gain of the camera. When the gain of the camera is low,
the visibility is very poor. Hence, the gain has to be increased to enable recognition of the
Chapter 1. Introduction 9
person in the video.
However, increasing the gain of the camera increases the noise in the image. This can
be clearly seen in Fig. 1.7b. This noise in the image is caused due to amplification of the
photon and read noise of the imager [10]. Low light performance can be improved by
using bigger image sensors. However, the cost of large image sensor based cameras are
significantly higher than the small pixel size cameras. Hence, surveillance installations very
commonly utilize small image sensors.
Figure 1.5: Low light scene with nominal exposure and gain settings
Chapter 1. Introduction 10
(a) Long exposure time (b) Short exposure time
Figure 1.6: Surveillance snapshots with different camera exposure settings (camera gain
has been increased to improve visibility and contrast)
(a) Low gain (b) High gain
Figure 1.7: Surveillance snapshots with different camera gain settings
Chapter 1. Introduction 11
1.3.2 Bitrate versus noise
To understand the impact of noise on the bitrate of the compressed surveillance footage, we
have varied the camera gain and encoded the video using the x264 video encoder [11]. Fig.
1.8 shows the bitrate plotted against the noise. Here, noise is computed as the standard
deviation of a pixel time series (100 frames) averaged over the entire image. Only the gray
scale values of the pixels are considered. The scene did not have any foreground objects.
Fig. 1.8 shows that the bitrate increases drastically as the pixel noise increases. The high
frequency noise content in the image results in large residual content and hence severely
affects the compression performance (It is well known that explosions in gaming videos
result in very high bitrates).
�
����
����
����
����
����
����
����
���
� � � � ��
�����������
������������� ��������������
��������� ����������
Figure 1.8: Bitrate versus pixel noise in a low light scene
Chapter 1. Introduction 12
1.4 Challenges due to increased bitrate
From the discussion in Section 1.2, it is clear that higher resolutions (12MP) are required to
enable recognition and identification capabilities of the surveillance system. Unfortunately,
such high resolution video streams require large data communication bandwidths. In a
typical scenario where we need to support large number of cameras (e.g. in a campus),
transporting such high bandwidth video streams necessitate high speed Ethernet LAN in-
frastructure. Existing network systems which have been designed for VGA or slightly higher
resolutions might require upgrading. High bitrate video streams also increase the storage
cost. In the case of surveillance encoders that employ rate control, insufficient network
bandwidth would force the encoder to reduce the image quality or to reduce the frame
rate.
Also, the discussion in Section 1.3 showed that low light conditions exacerbate the issues
of high bandwidth requirement. Surveillance videos captured in poor lighting conditions
are very common. Identification tasks are inherently difficult in such videos. High bitrate
would cause rate control based encoders to further reduce the quality. This would make
such videos unusable.
‘On demand’ video surveillance systems have been used increasingly for applications
such as temporary asset monitoring & crisis monitoring. Surveillance camera coverage
could be increased temporarily during specific events (predictable events, e.g. sporting
events or unpredictable events, e.g. street protests) by building adhoc networks of low
power battery operated wireless cameras. Video surveillance data communication for such
applications in the market [12] currently use 3G / 4G (LTE and HSPA+) cellular standards.
However, delivering high bandwidth video data over such wireless networks would severely
limit the number of cameras that can be deployed. Higher bitrate also increases operating
costs due to higher data communication costs and increased storage requirements.
In addition to bitrate reduction, it is also important to reduce the computational com-
plexity and power consumption of the camera platform. Computational complexity reduc-
tion helps in reducing the cost of camera. Also, reducing the complexity lowers the power
Chapter 1. Introduction 13
consumption. Although power is not a very big concern in fixed camera platforms (pow-
ered from a wall socket), it is very critical in On-Demand surveillance applications where
the camera is powered by a battery. The total power consumption of such platforms is equal
to the sum of the image sensor power, video encoder power and the network device (wired
/ wireless) power. The power consumption of the encoder and the network device dominate
the total platform power. For example, the Omnivision OV9712 720p resolution image sen-
sor consumes 110mW. In comparison, the 720p resolution Bosch TINYON IP 2000 camera
platform consumes 2.65W. Increasing the video resolution increases the power consumed
by the encoder and network device. For example, the 12MP resolution Bosch DINION IP
ultra 8000 MP camera consumes 9W of power. For On-Demand surveillance applications,
such high energy requirements would reduce the operational time of the camera.
1.5 Thesis Contribution
In the previous section, we have described the challenges of increased bitrate & computa-
tional complexity in surveillance camera systems. In this thesis, we propose four techniques
to alleviate these challenges. We now provide a brief introduction of our contributions in
this section.
1.5.1 Proposed surveillance video encoder architecture
Fig. 1.9 shows the high level architecture of the surveillance encoder system. The input
video frames from the camera sensor are stored in an image buffer. The proposed tech-
niques analyze the frames and generate control parameters (e.g. MB QP’s, skip decision of
MB’s, index of the reference frame to be replaced) that are signaled to the H.264 encoder.
The frames that have been analyzed by the proposed techniques are compressed using the
H.264 video encoder. The output NAL unit stream is transmitted using a wired / wireless
communication module.
Chapter 1. Introduction 14
H.264/AVC encoder
Camera Video stream
Proposed techniques for bitrate reduction
Motion Est. & Compensation
DCT and Quant.
Entropy coding
NAL Buffer
Mode decision
ReconstructFrame buffer
Speeded up GMM + Skip decision
Reference frame selection
Face ROI region detection
Figure 1.9: Architecture of the proposed surveillance video encoder
1.5.2 Bitrate & computational complexity reduction
The four techniques we introduce in this thesis to reduce bitrate and computational com-
plexity of surveillance encoders are: (I) Speeded up foreground segmentation (II) Skip de-
cision (III) Reference frame selection & (IV) Face Region-of-Interest (ROI) coding. To offer
better insights of our contributions, we partition the bitrate cost and describe how the pro-
posed techniques reduce each cost component. The bit cost of a static-camera surveillance
video stream can be partitioned as follows:
• Background image region coding cost
• Uncovered background image region coding cost
• Shadow image region coding cost
• Non face image region (clothing, arms) coding cost
• Face image region coding cost
Background, uncovered background and shadow image regions do not contain useful
information required to perform surveillance tasks. Hence, coding of these regions can be
skipped. Image regions of foreground objects such as cars and pedestrians in the scene are
Chapter 1. Introduction 15
required. However, high image fidelity on all the foreground image regions is not required.
For example, in pedestrian surveillance, only the face image regions need to be encoded
in high quality. Non face image regions, i.e. images capturing clothing and arms can be
encoded in lower quality. From this discussion, it is clear that optimal bit allocation based
on regions of interest (ROI) can help to reduce the bitrate without affecting the surveil-
lance task. However, such a ROI encoder system should accurately label image regions as
‘background’, ‘foreground’, ‘non face’ and ‘face’. Incorrect marking of ‘face’ regions as ‘non
face’/‘background’ will severely impact the utility of the encoded video. Also, as already
noted in the previous section, the computational complexity of the ROI detector should
be be minimized. In this thesis, we introduce multiple techniques to accurately determine
regions of interest required to reduce the bitrate.
The uncovered background image regions cannot be marked as ‘Skip’ if the reference
frames do not contain the appropriate image content required for reconstruction. Hence,
we introduce a technique to optimally select the reference frames required to reconstruct
uncovered background image regions. We also describe the computation of the H.264 en-
coder parameters (skip mode, QP and reference frame index) required to implement the
proposed techniques in this thesis.
We now provide a synopsis of the proposed techniques:
1. Speeded up foreground segmentation: To label background image regions, we pro-
pose to use the GMM based segmentation algorithm. The variance parameter of the
GMM algorithm models the noise statistics in low light conditions and the multiple
modes of the GMM capture environment noise, hence reducing false positives. How-
ever, the GMM algorithm is compute intensive. Hence, we propose a modification to
the adaptive Gaussian Mixture Model (GMM) based foreground segmentation algo-
rithm to reduce computational complexity. This is achieved by replacing expensive
floating point computations with low cost integer operations. To maintain accuracy,
we compute periodic floating point updates for the GMM weight parameter using the
value of an integer counter. Experiments show speedups in the range of 1.33 - 1.44
on standard video datasets where a large fraction of pixels are multimodal.
Chapter 1. Introduction 16
2. Skip decision: As we have already seen, bit cost of background regions is very high
under low light conditions. Even under good lighting, bit cost of background regions
increases with increase in environmental noise (e.g. shaking tree leaves). To detect
such noisy background regions, we propose a spatial sampler based skip decision
technique. The spatially sampled pixels are segmented using the speeded up GMM
algorithm. The storage pattern of the GMM parameters in memory is also modified
to improve cache performance. Skip selection is performed using the segmentation
results of the sampled pixels. Using a two stage sampler reduces the computation
complexity without affecting the accuracy of the skip detector. Experimental results
show bit rate savings of up to 94.5% over methods proposed in literature on video
surveillance data sets. The proposed techniques also provide up to 74.5% reduction in
compression complexity without increasing the distortion over the foreground regions
in the video sequence.
3. Reference frame selection: A reference frame selection algorithm is proposed to
maximize the number of background Macroblocks (MB’s) (i.e. MB’s that contain
background image content) in the Decoded Picture Buffer. This reduces the cost of
coding uncovered background regions. Distortion over foreground pixels is measured
to quantify the performance of skip decision and reference frame selection techniques.
4. Face Region-of-Interest (ROI) coding: A face ROI encoding technique for pedestrian
surveillance is proposed. Face and shadow region detection is combined with the skip
decision algorithm to perform ROI coding for pedestrian surveillance videos. As we
showed, person identification requires high quality face images, MB’s containing face
image content are encoded with a low Quantization Parameter (QP) setting (i.e. high
quality). Other regions of the body in the image are considered as RORI (Regions of
reduced interest) and are encoded at low quality. The shadow regions are marked as
Skip. Techniques that use only facial features to detect the ROI MB’s (e.g. Viola Jones
face detector) are not robust in real world scenarios. Hence, to accurately determine
the ROI, RORI & RONI MB’s, we combine the outputs of multiple detectors. We pose
Chapter 1. Introduction 17
the MB labelling task as a super pixel classification problem. Shadow and skin detec-
tor scores of super pixels are computed. Pedestrians are detected using deformable
part models. The face region is determined using the deformed part locations. De-
tected pedestrians are tracked using an optical flow based tracker combined with a
Kalman filter. The tracker improves the accuracy and also avoids the need to run
the object detector on already detected pedestrians. Bilattice based logic inference is
used to combine multiple likelihood scores and determine the labels of the super pix-
els. The coding mode and QP values of the MB’s are computed using the super pixel
labels. Results show that the proposed face ROI coding technique provides a further
reduction in bitrate of up to 50.2%.
1.6 Organization of the thesis
A review of state-of-the-art video coding techniques for surveillance is presented in Chapter
2. Chapter 3 describes the technique we propose to reduce computational complexity of
foreground segmentation. The results of the proposed algorithm are also included in the
same chapter. Chapter 4 introduces a skip decision technique for H.264 surveillance video
coding. A reference frame selection algorithm for H.264 surveillance video coding is also
described in Chapter 4. The results of the skip decision and reference frame selection
techniques are provided in Chapter 5. The Face ROI encoder that we propose is described
in Chapter 6. Chapter 7 concludes the thesis and discusses few open problems for future
work.
Chapter 2
Background and Related Work
2.1 Introduction
In the previous chapter, we described the challenges faced by consumers due to increasing
bandwidth requirements of surveillance video cameras. We also listed the four techniques
we propose to reduce bitrate and computational complexity of surveillance encoders. We
achieve this by exploiting unique characteristics and requirements of static camera surveil-
lance video encoders. Before we delve into the details of these techniques in the successive
chapters, we review the state-of-the-art surveillance video compression algorithms here.
We begin by providing a concise summary of different video encoding techniques. We
briefly describe the H.264 standard and a few relevant aspects. Following this, we review bi-
trate and computational complexity reduction techniques proposed by previous researches.
2.2 A review of video encoding techniques
Video coding has evolved rapidly with the development of Motion JPEG, MPEG1, MPEG2,
MPEG-4 Part 2, MPEG-4 Visual, H.264/MPEG-4 AVC, VP8, VP9 and HEVC video standards.
Along with the standardization of video coding techniques by committees, research groups
in industry and academia have explored various video coding ideas. All these techniques
can be broadly classified as lossy and lossless. Lossless compression permits the decoder
to reconstruct the original video from the compressed data. Lossy compression allows only
the reconstruction of an approximation of the original video. Well designed lossy com-
pression systems provide good bitrate savings without degrading the quality too much. All
surveillance systems use lossy compression techniques which we will now briefly review.
18
Chapter 2. Background and Related Work 19
1. Discrete Cosine Transform (DCT) based intra frame coding: Each video frame of
the video sequence is compressed separately using DCT based image coding. The
quantized coefficients are reordered and losslessly packed into the output bit stream.
DCT intra frame encoders do not exploit the inter frame redundancy and hence have
low compression ratios. However, this coding technique is very robust against com-
munication channel errors. MJPEG is a popular DCT intra frame coding standard
which was used by early surveillance systems. Current surveillance systems continue
to offer capability of streaming MJPEG format videos [5].
2. Object based video coding: The scene image is considered to be a collection of
multiple video objects (VO). The background is also considered as a video object.
The VO’s are coded separately. The objects can have arbitrary shapes. This coding
technique also allows individual objects to be encoded with different quantization
parameters. It also allows different temporal resolutions to be specified for the objects.
Hence, object based video standards seamlessly allow encoders to implement ROI
compression.
MPEG-4 Visual Part 2 is one of the popular standard that supports object based video
coding. This standard specifies shape coding, motion compensation and texture cod-
ing of arbitrary-shaped video objects. In shape coding, the shape of the edge of the
object needs to be specified by the encoder. The mask which indicates the pixels that
are part of the VO is coded using Context based binary Arithmetic Encoding. Each
pixel is assigned a transparency parameter. Block DCT of the gray scale transparency
data is quantized, reordered, run-level and entropy coded. Richardson [13] provides
a detailed discussion of the MPEG-4 standard.
3. Block based DPCM/DCT coding: This has been one of the most popular coding
techniques adopted by a majority of the successful video compression standards, e.g.
MPEG-1, MPEG-2, H.261 and H.263, H.264/MPEG-4 AVC, VP8, VP9, HEVC. Encoders
supporting these standards consist of four stages, i.e. motion estimation & compensa-
tion, transform computation, entropy coding and reference frame reconstruction. The
Chapter 2. Background and Related Work 20
H.264 standard which is used in this thesis is described in more detail in Section 2.3.
4. Distributed video coding: Motivated by theoretical results of distributed source cod-
ing by Slepian and Wolf [14, 15], there have been a lot of attempts to shift the
computational complexity from the encoder to the decoder [16, 17, 18]. Here, tem-
poral redundancy of the video sequence is exploited only at the decoder, i.e. motion
estimation is performed at the decoder to obtain estimates for the side information.
The complexity of the encoder is very low since it only uses intra frames. This has
special appeal in the case of battery operated wireless video surveillance systems in
which reducing the complexity of the encoder increases the life of the system. The
decoder would be part of the network transcoder or a server connected to a power
line. The robustness of this technique against communication errors is also higher
than inter frame encoders since errors do not accumulate over multiple frames.
Until recently, H.264 has been the best performing video coding standard widely used by
surveillance video encoders. It is a block based DPCM/DCT coding standard that specifies a
large set of encoding techniques for efficient video compression, e.g. variable block-size mo-
tion compensation, multiple reference frames, Bi-directional predicted frames, various intra
prediction modes, six tap filtering, an in-loop deblocking filter, CABAC. HEVC which is the
newly finalized standard has improved upon H.264 by introducing numerous advances, e.g.
large coding tree blocks, more intra prediction directions, improvements to the deblocking
filter, adaptive motion vector prediction. However, these techniques are computationally
very expensive and require dedicated hardware accelerators to implement the encoder. As
a result, surveillance camera manufacturers continue to mostly use the H.264 standard in
almost all their product offerings. Hence, in this thesis, all the proposed techniques have
been implemented and tested using the H.264 encoder. However, we note that the algo-
rithms introduced here can also be applied to the new DPCM/DCT based standards (i.e.
HEVC and VP9). In the next section, we describe the H.264 video encoding standard, with
greater emphasis provided on techniques relevant to this thesis.
Chapter 2. Background and Related Work 21
2.3 H.264 basics
The architecture of the H.264 encoder [7] is shown in Fig. 2.1. Inter prediction (Motion
estimation & Motion compensation) and intra prediction are performed on the macroblocks
(MBs) in the input frame FC . Mode decision determines the best mode for each of the
macroblocks. The MB residuals for the best mode are computed and are further transform-
quantized. The quantized coefficient data is reordered in a zig zag sequence to ensure
that the low frequency DCT coefficients are clustered. The reordered residual data, motion
vector values and associated header information for each macroblock are entropy coded
and packed into NAL units.
The encoder would require the decoded frame to perform Motion Estimation (ME) on
future frames. Hence, inverse transform & inverse quantize operations are performed on the
quantized data and the reconstructed frame is stored in the Decoded Picture Buffer (DPB).
An in-loop deblocking filter is also applied on the reconstructed frame before storing it into
the DPB.
Ch
ap
ter
2.
Back
gro
un
dan
dR
ela
ted
Work
22
FC (Current)
F`x
(reference)
ME
Choose Intra prediction
T
MC
Intra prediction
Filter T-1 Q-1
Q
F`C
(reconstructed) +
+
+
-Reorder
Entropy encoding NAL
Rate controlMode
decision
Encoded video stream
F`y
(reference)
DPB
Abbreviations:
ME: Motion estimationMC: Motion compensationT: TransformQ: Quantize
Figure 2.1: Architecture of the H.264 encoder [7]
Chapter 2. Background and Related Work 23
2.3.1 Reference frames
Inter predicted MB’s use image data in the reference frames (stored in the DPB) to reduce
the residual content. After the current frame is compressed, it is reconstructed and is stored
in the DPB for inter prediction by future frames. The H.264 standard allows storage of up
to 16 frames in the DPB. Along with this, the standard also allows the encoder to manage
the DPB, i.e. to insert or remove frames from the DPB. The standard specifies two ways to
achieve this: (I) Sliding window in which the oldest short term reference frame is removed
(II) Adaptive memory control which is supported through Memory Management Control
Operations or MMCO commands. Reference pictures in the DPB are marked as either short
term or long term. The oldest short term picture is removed from the DPB when the DPB
is full. Fig. 2.2 shows an example where MMCO commands have been used to manage
the DPB. At time instance n + 1, the reconstructed frame has been marked as a long term
reference picture and has been inserted into the DPB with long term frame index set to 4.
45485254
Short term
1Long term
DPB
(a) Frame n
45485254
Short term
14
Long term
DPB
(b) Frame n+ 1
48525457
Short term
14
Long term
DPB
(c) Frame n+ 2
48525457
Short term
45
Long term
DPB
(d) Frame n+ 3
Figure 2.2: Decoded Picture Buffer managed using MMCO commands
Prediction for each MB can be obtained from either a single frame (in a P type slice)
or from two frames (in a B type slice). H.264 specifies two reference lists which contain
‘Picture Order Count’ (POC) values of frames in the DPB. For MB’s in a P frame, only a single
reference picture is used for prediction. Hence, only the index of the reference picture List 0
Chapter 2. Background and Related Work 24
is signaled in the bitstream. B frame MB’s utilize two frames for prediction and consequently
require the encoder to specify the indices of both the reference frame lists. B frames increase
the latency and the complexity of the encoder. In this thesis, we utilize only P slices. Hence,
we review reference frame list management details for only P frames.
Fig. 2.3 shows the List 0 and a sequence of video frames. Macroblock MB1 in the
current frame uses a previous picture (frame num 45) for motion compensation. The
previous picture with frame num equal to 45 is marked as a short term reference frame
in the DPB. Its position in List 0 (zero indexed from top) is 1. Hence, the encoder signals
List0(1) as the reference frame in the mb pred syntax element (SE) for MB1.
The default reference picture list order places the short term frames on top of the list
in decreasing order of PicNum. The long term reference frames are placed below the short
term frames in increasing order of LongTermPicNum. The H.264 standard allows the en-
coder to change the order in the list using Reference Picture List Reordering (RPLR) com-
mands. In the example in Fig. 2.3, we can see that reordering has been performed to bring
the long term frame 4 onto the top of List 0.
2.3.2 Macroblock Skip mode
Residual data in static scenes is very low. Since such static image blocks are commonly
found, the H.264 standard allows the encoder to mark these MB’s as skip. No transform
coefficient data and motion vector data is transmitted for a skipped MB. The decoder re-
constructs the MB image using motion vector prediction (MVP). The reference picture for a
skip MB is always the frame indexed at the top of List 0, i.e. List0(0).
2.3.3 Macroblock QP signaling
The standard provides a combined definition for the transform, scaling and quantization
operations performed on the residual data of a MB (Richardson in [7] provides a very
good description of the transform and quantization computations in H.264). Let matrix X
represent the image content in a MB (under default settings, X is a square matrix with size
4∗4). Let Y denote the output of the transform, scaling and quantization operations. Then,
Chapter 2. Background and Related Work 25
45424039
Short term
41
Long term
LT 4ST 45ST 40ST 42
List 0
ST 39LT 1
List0(1)
Current frameframe_num = 46
MB reference frame List 0 index
List0(3)
Frames in video sequence (Reference frames are shaded in gray)
MB1
MB2
DPB
44 454342
List 0 has been reordered
frame_num
Figure 2.3: Reference frame list management in H.264
Y = round
(
[Cf ] . [X] .[
CTf
]
◦m (QP%6) .1
215+floor(QP/16)
)
(2.1)
Here Cf is the forward core transform matrix and is primarily responsible for energy
compaction. The scaling and the quantization processes are combined to obtain the re-
maining terms in Eqn. 2.1. m(QP%6) represents a matrix whose element values depend on
the value of QP (Please see [7] for more details). Here, QP is a called as the ‘Quantization
Parameter’. Increasing the value of QP increases the quantization and hence reduces the
quality. As noted in [7], all the arithmetic operations in Eqn. 2.1 can be done using integer
Chapter 2. Background and Related Work 26
arithmetic.
The H.264 standard allows rate control at the MB level, i.e. the quantization parameter
of each MB can be set by the encoder [19, 20]. QP for each MB is specified in the MB layer
using the delta qp SE. The element delta qp signals a change in the QP from its previous
value. Here, the previous value refers to the QP of the previous macroblock in decoding
order in the current slice (Slice data consists of a series of macroblocks). If there is no
change in QP from the previous value, delta qp is set to 0.
2.4 Bitrate & complexity reduction techniques for video surveil-
lance
The increasing importance of high resolution surveillance camera footage has prompted
researchers to develop surveillance specific bitrate and complexity reduction techniques
[16, 21, 22, 23, 24, 25]. We review some of these methods in this section and briefly
describe advances we make to the existing techniques (detailed comparisons are provided
in later chapters). ‘Skip decision’, ‘Reference frame selection’ and ‘ROI coding’ based bitrate
reduction techniques are most relevant to this thesis. Hence, we present a more detailed
literature survey of these techniques.
2.4.1 Skip detection techniques
Although traditional compression techniques (e.g. DPCM/DCT based methods) remove
spatial and temporal redundancy in surveillance videos, a lot of unwanted information
continues to exist in the encoded bit stream. For example, the noise of the sensor in the
static background image regions leads to increased bitrates. To remove this redundancy,
researches have been proposed to encode only the foreground regions in high quality (i.e.
with low QP setting). Regions in the image which do not contain useful information are
either marked as Skip or are encoded in low quality. These methods can be classified as
follows:
Chapter 2. Background and Related Work 27
• Segmentation based: Many researches [26, 21, 22, 23, 24] use Background subtrac-
tion [27, 3, 28, 29] to segment the foreground objects. The background regions are
marked as skip or else encoded at low quality. A good comparison of popular and ef-
fective methods for Background subtraction can be found in [30]. In [22], Vetro et al.
propose a MPEG-4 based surveillance video coding scheme which utilizes a two stage seg-
mentation method to detect interesting objects in motion. The Gaussian Mixture Model
(GMM) based background subtraction algorithm followed by image correlation is used to
filter the video frame. However, image correlation computation costs are prohibitive to
implement on low power embedded platforms [31]. In [23], Chien et al. have proposed
a low complexity moving object detector for object based video encoders. The method
in [23] maintains a background frame in a buffer. The difference between all the pixels
in the current frame and the background frame is modeled using a Gaussian distribution.
A threshold is applied on the frame difference to classify a pixel as either foreground or
background. More recently, Jin et al. [24] have proposed a motion detection based ‘Skip’
scheme for H.264/AVC surveillance video coding. The method in [24] uses chrominance
features to decide whether a MB needs to be skipped or coded using mode decision. The
mean values of the pixel chroma components are initially used to detect foreground MB’s.
When the mean pixel chroma values are similar to those in the previous frame, individual
chroma components are compared and threshold’s are applied to decide whether the MB
can be skipped. MB’s are skipped only when the coarse motion search vector is equal to
the PMV (predicted motion vector). In [32], Yang Yu et al. use a codebook based back-
ground segmentation algorithm to determine moving regions. MPEG-4 object coding is
used to encode the foreground objects. The background is encoded using MPEG-4 frame-
based coding. Shih-Chang Hsia et al. [33] propose a segmentation based technique to
perform MPEG-4 surveillance video encoding. The difference image between adaptively
chosen frames is used to determine the shape of the objects. Spatial processing is used to
refine the shape. The objects are encoded using MPEG-4 video object coding. Venkatesh
Babu et al. [34] determine foreground objects using background subtraction. Object-
based motion compensation is performed and the shape adaptive DCT coefficients of the
Chapter 2. Background and Related Work 28
compensation error is computed. In [35], Hwangjun Song et al. perform frame differenc-
ing followed by median & morphological filtering to determine the foreground regions.
In [36], Pierpaolo Baccichet et al. compute the Mean Absolute Difference (MAD) be-
tween the pixels in the filtered input frame and the previous encoded frame. The MAD
values are thresholded to determine the ROI MB’s. Ching-Yu Wu et al. [37] use back-
ground segmentation for traffic surveillance video encoding. In [38], Liu et al. use the
Mean Absolute Difference (MAD) (with MV set to zero) to determine the ROI. Thomas et
al. in [39] separately transmit the segmented background and foreground object images
along with watermarks. The receiver authenticates the images using the watermarked
data.
• RD cost based: Skip mode selection techniques proposed for generic video content has
been primarily based on thresholding of the RD (Rate-Distortion) cost [40, 41, 42]. In
[41], Zeng et al. determine the threshold using the value of QP. If the RD cost for Skip
mode is less than the value of the threshold, the MB is marked as Skip and other modes
are not processed. In [40], Kannangara et al. maintain a running estimate of the RD cost
for all the modes. If the RD cost estimate for the skip mode is lesser than all the other
estimates, the MB is marked as Skip.
• Motion Vector based: In [43], Kannur et al. group regions based on the MVs into (I)
Moving regions and (II) Static regions. The moving regions are further classified into
multiple regions depending on the motion magnitude. In [44], Hang Li et al. determine
moving regions based on the motion vector values, i.e. MB’s whose motion vector are
high are considered as foreground. The moving regions are encoded at higher quality.
In this thesis, we use a segmentation based technique to determine the background
image regions. We advance the state of the art in two ways. We first develop a speeded up
GMM segmentation algorithm. Next, we combine this segmentation algorithm with a two
stage sampler to efficiently and accurately mark skip MB’s. We also propose to rearrange
data structures to improve cache performance. We test the speeded up GMM algorithm
on standard video datasets. The skip decision technique is tested using an exhaustive set
Chapter 2. Background and Related Work 29
of surveillance videos that we have captured. The proposed speeded up GMM algorithm
provides speedup (over the GMM algorithm proposed by Zivkovic in [3]) in the range of
1.33 - 1.44 on the standard video datasets. The skip detection algorithm that we propose
provides bit rate savings of up to 94.5% and compression complexity reduction of up to
74.5% (over methods proposed in literature) without increasing the distortion over the
foreground regions.
2.4.2 Background reference frame selection techniques
• Standard non compliant techniques: In [45], Xianguo et al. divide the video sequence
into Super Group of Pictures (GOP). Mean shift is applied on a training set chosen from
the super GOP to generate a background frame. The difference between each input
frame and the background frame is computed and encoded using the H.264 standard.
The background frame for the super GOP is also encoded and transmitted. In [46], Xi-
anguo et al. extend this further by including a background difference prediction model.
They also derive criteria for a block to be predicted by either the short term reference
picture, the background picture or the background difference data. In [47], Manoranjan
Paul et al. determine the background frame using a Gaussian Mixture Model (they refer
to this image as the Most Common Frame in Scene or McFIS). McFIS is encoded as a con-
ventional I-frame. All the frames in the video sequence are encoded using inter coding.
The McFIS along with the previous frame is used as the reference set for inter coding.
In [48], Manoranjan Paul et al. avoid transmitting the McFIS to the encoder. The McFIS
is generated by the decoder independently using the same dynamic background model
used by the encoder. In [49], Totozafiny et al. propose a JPEG2000 standard based en-
coder for road surveillance. The encoder uses the static background as a reference frame.
Video frames from the camera are segmented to determine objects. The segmented data
and the reference frame are transmitted to the decoder. At the decoder, the ROI binary
mask is implicitly inferred using the Maxshift method which is part of the JPEG-2000
standard. The reference frame update procedure is performed across multiple frames.
In [50], Shumin Han propose a background reconstruction based coding technique for
Chapter 2. Background and Related Work 30
a moving camera. A panorama image of the background is generated. Feature point
pairs in the panorama and the current frame are detected. These point pairs are used
to estimate the global motion transformation matrix which is later use to reconstruct the
background image for the current frame. The background panorama is intra coded using
the MPEG-4 standard.
Although non compliant schemes simplify the encoder complexity, they cannot be easily
adopted into commercial products. The surveillance system comprises of the camera,
network devices, server side software (for visualization and analytics), server side hard-
ware (for decoding) and server size storage. Surveillance installations typically procure
these components from different vendors. Hence, interoperability is a very important
issue. Increasing vendor compliance to the recent ONVIF and PSIA standards [51, 52]
also clearly shows that successful market adoption depends on standard compliance.
• Standard compliant techniques: The MPEG-4 standard [32, 33] has been used by pre-
vious researches for ROI video compression. Here, the background image is transmitted
using MPEG-4 frame-based coding. The foreground regions are encoded using MPEG-4
Visual object coding. However, recent video standards such as H.264 and HEVC are block
based and do not support object based video coding. Instead, they allow motion compen-
sation using multiple long term and short term reference frames stored in the decoded
picture buffer.
Researches on long term reference frame selection for H.264 have proposed to use high
quality long term reference frames (HQF’s) to improve RD performance on generic videos
[53, 54, 55]. Liu et al. [55] proposed a scheme to select the HQF’s for generic video
content based on the predicted error variance of the coded picture (with the HQF set as
the reference). Experimental results of all these methods showed that the PSNR (Peak
Signal to Noise Ratio) of the low quality frames improved by referencing to the long
term high quality frames. However, in static camera surveillance encoders, the position
of the objects in the coded frame would be very different from the position in the long
term reference frames. Hence, coding quality of foreground objects in motion does not
Chapter 2. Background and Related Work 31
benefit from high quality long term reference frames. However, the selection of long term
reference frames influences the cost to encode uncovered background regions. This has
been recognized by researchers and a few techniques to optimally select the reference
frames have been proposed in [56, 57, 58]. Xianguo Zhang et al. [58] have proposed
a background model based technique for a HEVC surveillance video encoder. A running
average algorithm is used to generate the background frame. The background picture is
encoded using intra prediction. The ‘no display’ option provided by the HEVC standard
is used to transmit the background picture. However, the bitrate of the encoded video
would increase due to this intra coded background frame.
Li et al. in [56] proposed a technique to select reference frames for a High Efficiency
Video Coding (HEVC) video encoder. The method in [56] utilized cloud compute re-
sources to perform optimal reference frame selection for offline coding of generic video
content. Li et al. [57] extended the work in [56] by developing multiple, low-complexity
algorithms in addition to a quality-adjustment scheme for generic video content. The first
technique introduced in [57] is called the ‘r×’ algorithm which is essentially a greedy
strategy. It relies on the assumption that if a picture (if marked as reference) does not
provide benefit to the encoding process of the current frame, then it would not do so for
the following frames as well. Since the computation cost of the r× algorithm is r times
the cost of a normal encoder, they propose 2 lower complexity algorithms called ‘1×’ and
‘2×’.
In this thesis, we propose a standard compliant reference frame selection technique
for H.264 surveillance video coding. The ‘1×’ and ‘2×’ complexity algorithms in [57] are
most relevant to the technique that we propose in this thesis. Although the ‘2×’ algorithm
provides bitrate savings almost equal to that obtained by the proposed technique, its com-
putational complexity is significantly higher (it requires a second encode pass). The 1×
technique has reduced complexity (compared to the ‘2×’ algorithm) but it does not pro-
vide bitrate savings. In contrast, the proposed technique determines the optimum reference
frame with very low complexity 30 − 40µsec/frame. Also, since it avoids the coding of
uncovered background regions, additional compute savings are obtained. The proposed
Chapter 2. Background and Related Work 32
technique reduces bit rate by up to 24.7% and execution time by up to 7.3%.
2.4.3 ROI coding techniques
QP’s for the MB’s inside the ROI can be determined using a static assignment procedure or
using a rate control model based algorithm. Researches have used ROI coding for movie
content and video conferencing applications. A lot of these ideas are quite generic and
can also be applied to surveillance. Hence, we will also include such references which are
relevant in the context of surveillance. We group various ROI based coding techniques and
present them here. Grois et al. have provided a detailed overview of some of the recent
ROI coding techniques in [59].
• Object detection and/or tracking based methods: Pattern recognition techniques
have been used to perform MB/blob level object detection and tracking [26, 60]. One
of the early methods proposed for ROI coding [61] used block level frequency domain
features to classify ROI. In the first step, ROI region proposals were obtained. These
proposals were used to train a fine detail neural network classifier. A similar system
was also proposed in [62]. In [26], Lai-Tee Cheok et al. use a vehicle/person classi-
fier to detect pedestrians in the scene. The detected objects are tracked. The person
detector output is used to modulate the weights in a MB-level rate control equation.
Pedestrians are assigned higher weight and hence higher quality. In [60], Fernandez
et al. combine MB level background segmentation, temporal and spatial filtering,
MB clustering and tracking to determine ROI’s for surveillance videos. In [63, 64],
Christopher et al. use Viola Jones face detection [65] to detect faces in each frame. An
iterative mean shift based object tracker is initialized for each new detection. Detec-
tions which match state objects are used to update the object representations. In [66],
Ming-Chieh Chi et al. use face detection to mark ROI regions.
• Skin detection based techniques: These methods have been mostly proposed for
video conferencing applications. In [67], Yang Liu et al. use direct frame difference
and skin-tone classification to determine ROI’s for a video conferencing application.
Chapter 2. Background and Related Work 33
They use a low pass filter to dilate the skin-tone area to accurately mark the ROI.
In [68], Shu-Fen Huang et al. propose a ROI video transcoder. ROI’s are deter-
mined based on the MV value and the skin pixel probability. Pixel level classification
of skin/non skin is done by thresholding CbCr values. In [69], Douglas Chai et al.
overcome limitations of color segmentation based skin detection by combining it with
probability based morphology and luminance regularization. The detected face re-
gions are encoded using the H.261 video encoder.
• Moving camera related techniques: Absence of a static background increases the
complexity of ROI detection. Researchers have proposed to determine the camera
motion and use it to improve the coding efficiency.
– Pan tilt cameras: In [70] Dalei Wu et al. jointly consider the video coding, trans-
mission and camera control tasks for a pan-tilt camera installed in a wireless
network. A network-delay aware Kalman filter based tracker is used to con-
trol the pan tilt camera. Different video coding parameters result in different
packet lengths and packet loss rates, which will lead to different amounts of
transmission-induced distortion. They determine set of coding parameters that
optimize the expected distortion.
– Aerial platforms [71, 72, 73]: In [72], Holger Meuel et al. perform global mo-
tion estimation using the Harris corner detector and KLT (Kanade-Lucas-Tomasi)
tracker. New areas are determined by global motion compensation. Projection
parameters are computed and used to align two frames. Regions in the current
frame, which are projected outside the previous frame, are detected as new ar-
eas (ROI-NA) and need to be encoded and transmitted. The difference image
between the current frame and the motion compensated frame is used to detect
moving objects in the scene.
• Surveillance Operator controlled methods: Multiple researches propose to allow
the surveillance operator to control the ROI for video compression [74, 75, 76, 77]. In
[75], Mavlankar et al. propose an encoder which allows the user to define the region
Chapter 2. Background and Related Work 34
of interest. The frame is divided into multiple slices hence allowing transmission
of only the ROI. A temporal median filter is used to obtain the background frame
which is later intra coded. The optimal slice size is also determined. ROI can also
be determined based on the actions of the operator. In [71], Hui Cheng propose an
algorithm which analyzes the camera operations such as pan, tilt and zoom control
performed by the operator. Based on this the ROI regions are marked.
• Saliency based techniques:
– Frame center ROI [44, 38]: In [44], Hang Li propose to enhance perceptual qual-
ity of a video conferencing system. The central region of the frame is encoded at
higher quality since they are more important than the marginal regions.
– Eye tracker based: Fadi Boulos et al. [78] determine ROI’s using an eye tracker.
Fixation duration and fixation velocity parameters are thresholded to determine
salient regions. Such regions which are viewed by multiple viewers are marked
as the ROI.
– Saliency model based: With the development of saliency models [79], researches
have proposed to use them to determine regions of interest for bit allocation
[80, 81, 82]. In [81], Laurent Itti et al. use saliency based attention predic-
tion to detect interesting regions in the video. Saliency of the image region is
used to determine the bit allocation. They show improvement of up to over 2 dB
(eye-tracking-weighted PSNR or EWPSNR measure of subjective quality). How-
ever, for surveillance specific coding, the salient regions are the pedestrian face &
the vehicle number plate. Hence, ROI detectors based on object representations
would perform better than saliency model based techniques.
• FMO based techniques: These methods [60, 43, 36] detect ROI objects and encode
them using different slice groups. Flexible MB Reordering (FMO) supported by the
H.264 standard is utilized to achieve this. In [43], Kannur et al. utilize the ‘explicit
slice group ordering’ option in FMO to define slice groups (SG). Here, different slices
Chapter 2. Background and Related Work 35
correspond to groups of MBs with different motion properties. These SG’s are coded
with different quality.
• Error resilience and encryption: Detection of ROI’s enables the encoder to increase
resilience to data communication errors. The encoder can provide unequal protection
to the MB’s depending on their importance (e.g. face image regions of a pedestrian
are most important in surveillance). In [78], Boulos et al. encode ROI MB’s using the
Intra mode to reduce propagation of errors. Andreas Unterweger et al. [83] study the
impact of slice group coding on post-compression encryption for surveillance appli-
cations. In [84], Sourabh Khire et al. propose to use multiple down-sampled repre-
sentations to improve burst error resiliency. The technique ensures that errors due to
a burst loss does not impair co-located frames of all the representations. This allows
the receiver to conceal the error and improve the picture fidelity.
• Rate control for ROI encoding: When transmission bandwidth of the wired / wire-
less network drops, the surveillance video encoder will need to either reduce the
frame rate or increase the quantization parameters of MB’s in the frame. Video en-
coders employ rate control algorithms [35, 37, 43, 85, 38, 86, 87, 88, 89] to determine
these parameters. In an ROI encoder, the rate control algorithm can preferentially al-
locate bitrate so that the most important regions in the image (e.g. faces in pedestrian
surveillance) are compressed with high fidelity. In [35], frame-layer and macroblock-
layer rate control is performed using a moving-region-weighted MSE based distortion
model. In [38], Liu et al. assign higher bits to MB’s which have higher MAD val-
ues. Yu Sun et al. [89] propose a joint source-channel region based MB level rate
control algorithm for wireless video transport. Bitrate allocated to ROI is higher than
that assigned to non-ROI MB’s. In [90], Chung-Ming et al. use motion detection
and tracking to determine the foreground objects. The background quality is set to
a low value. The neighbours of ROI MB’s are considered as ROI-contour extensions
and are coded at a slightly higher quality. ROI MB’s are given highest priority and
coded with highest quality. Appropriate QP values to the ROI, ROI contour extensions
Chapter 2. Background and Related Work 36
and background are determined. A recently proposed rate control approach in [91],
although not directly related to ROI encoding, is very interesting in the context of
surveillance applications. Here, the rate control algorithm preserves image features
that are required to perform computer vision tasks such as image retrieval.
• Commercial systems: Having realized the immense benefits of ROI detection and
coding, commercial vendors have included similar techniques in their cameras. The
DINION and FLEXIDOME HDR cameras from Bosch [92] detect objects such as faces,
people and vehicles and control the imager settings (e.g. auto exposure) to ensure
that high picture quality of the objects is obtained. Sony cameras [93] allow operators
to select the portion of an image they want to monitor in 4K resolution. The rest of
the image is streamed in lower resolution. Axis Zipstream technology [94] proposes
to dynamically determine ROI’s to preserve forensic details such as faces and tattoos.
The VideoBANDIT suite [95] by General Dynamics uses Dynamic region-of-interest
coding to transmit video over ultra low bandwidth communication links.
In this thesis, we combine multiple, low and mid level detectors to compute regions
of interest. We also integrate a tracker to reduce the computational complexity. We show
that object detection using only skin and face detection does not provide good ROI seg-
mentation. Hence, the proposed technique uses multiple visual cues to accurately mark the
regions of interest. The proposed scheme provides bitrate reduction of up to 50.2% over the
x264 video encoder. In the context of bitrate reduction under limited compute capability,
the possibility of an optimal processing order of the blobs in a video sequence is suggested.
The complete implementation of a complexity control mechanism for pedestrian ROI video
coding is left to future work.
While the proposed system assumes a static camera to perform foreground segmen-
tation, the techniques can be combined with image registration and adopted in Pan-tilt
surveillance cameras. Surveillance cameras mounted on drones present several challenges
(e.g. rolling shutter correction, image stabilization, limited compute and energy resources)
which we leave to future work. Operator controlled techniques to determine ROI (e.g. faces
Chapter 2. Background and Related Work 37
and vehicle number plates) are not scalable. Hence, we suggest to use operator assistance
only to mark regions in the images which are guaranteed to have no interesting content.
Infrequent tasks such as marking of control points required to compute the scene geome-
try can be performed by the operator. Saliency models are more suited for movie content
than video surveillance and hence are not discussed further in this thesis. Also, we have
not explored using FMO since it is not well supported by commercially available decoders.
The proposed face ROI encoder can be used to improve error resilience. The proposed
techniques will also need to be combined with rate control algorithms to enable surveil-
lance video streaming over limited bandwidth networks. We briefly discuss these ideas in
Chapter 5.
2.4.4 Mode decision and motion estimation related techniques
Unique characteristics of surveillance videos can be used to reduce the mode decision and
motion estimation complexity of the video encoder. Tong Gan et al. [96] propose a fast
H.264/AVC mode decision scheme for tunnel traffic surveillance where flashing lights pose
significant challenges. Significant change of luminance levels and appearance of new ob-
jects in the scene cause large number of MB’s to be intra coded. Such MB’s are usually
clustered. Hence, if three or more neighbours of a MB are coded as Intra, then the mode
for the MB is also marked as Intra. This reduces the ME compute cost of such MB’s. In [26],
Cheok et al. use a person detector to detect pedestrians in the scene. When a scene change
occurs, MB’s which contain people are encoded using inter prediction.
Muhammad Akram et al. [97, 98, 99] propose three different Motion Estimation (ME)
techniques for surveillance encoders: (I) Selective ME - search is performed only on frames
which have some activity (II) Tracker based ME - Surveillance video tracker results are
used to perform ME (III) Multi frame ME - difference between current and previous refer-
ence frames is computed. Pixel locations where difference is non zero are considered as
candidate locations for matching blocks in the current reference frame.
In moving camera platforms, global motion estimation has been used to reduce ME
Chapter 2. Background and Related Work 38
computation cost. In [100], Guili Xu et al. propose a block-based motion estimation tech-
nique for pan tilt cameras. Global motion is estimated and Kalman filtering is used to
determine the MV’s. Computation time reduction of about 95% is achieved. In [73, 101],
global motion estimation is applied to video captured by a quadrocopter mounted cam-
era. Global motion compensation based on the projective transform is performed. Only the
blocks which contain moving object detections are coded. At low altitudes, the projective
transform is replaced by a mesh-based global motion compensation technique.
In this thesis, background & uncovered background MB’s are reconstructed from image
content in the DPB. Motion estimation on these MB’s is not required. Hence, we obtain
significant computational savings of upto 74.5% without affecting the foreground image
quality.
2.4.5 Hardware related advancements
Stolberg et al. [102] have developed a single chip solution for surveillance applications. The
chip consists of three cores which can together perform MPEG-4 encoding and object track-
ing. The first core with a 16-datapath SIMD array is optimized for image and general digital
signal processing tasks. The second core is designed to perform macroblock processing for
the video encoder. The third core performs bit stream processing and combined with the
MB processor jointly implement the encoder. The object tracking algorithm is implemented
on the DSP core.
Researchers have integrated many of the necessary surveillance related data processing
elements into the focal plane. Pixel-level capacitors or photo-diode devices are used as
storage elements of the image. In [103], Chi et al. describe a capacitive motion detection
circuit which is built into the pixel. This work was extended further in [104] by using a
18.5 MHz micro-controller which computes the fast binary DCT of image blocks. With a
compression ratio of about 48:1, surveillance events of interest can be discerned. Bo Zhao
et al. [105] integrate on-chip moving object detection and localization capabilities into a
64 × 64 CMOS image sensor. A clustering algorithm is implemented in the image sensor
chip itself. The algorithm can localize up to three moving objects in the scene. Region of
Chapter 2. Background and Related Work 39
interest picture capture is also supported. In [106], Mizuno et al. utilize the photo-diode
array itself as a frame memory. Ming Zhang et al. [107] propose two CMOS-based motion
detection circuits to perform ROI detection. Nicola Massari et al. [108] have demonstrated
edge detection, motion detection, image amplification, and dynamic-range boosting oper-
ations using pixel level analog processing. The imager uses switched capacitor techniques
to perform the image processing operations over a kernel of 3 × 3 pixels. Multiple re-
searches [109, 110, 111, 112, 113] have integrated compression algorithms on the image
sensor. A comprehensive review of image sensors with on-chip image compression is avail-
able in [114].
Although this thesis does not implement specialized hardware elements, the proposed
techniques indicate several details which can be used to improve the performance of surveil-
lance specific hardware systems. In Chapter 3, we describe a speeded up GMM algorithm
that replaces expensive floating point computation with integer operations. The technique
provides speedup of up to 44% (over techniques proposed in literature) and also reduces
the memory bandwidth by a minimum of 16% for multimodal pixels. In Chapter 4, we
have optimized the cache performance of the sampler based skip detection algorithm. Re-
sults show upto 12.3% reduction in execution time and 30.2% reduction in Last Level Cache
or LLC references.
2.4.6 Distributed video coding based techniques
Rohit & Kannan [18] propose a distributed coding architecture called PRISM which uses
channel coding concepts to shift the motion estimation complexity from the decoder side
to the encoder. The quantized codeword space of the input data is partitioned and the
syndrome of the quantized data is transmitted. Motion estimation is not performed at the
encoder. At the decoder, motion search is performed to obtain candidate predictors. The
decoder recovers the data using the received syndrome and the candidate predictor as side-
information. Encoding performance is shown to be between that of inter and intra coding
modes of H.263+. Chuohao Yeo et al. [115] extend PRISM to support multi-view video
compression. The key idea is to use predictors from other views when few predictors are lost
Chapter 2. Background and Related Work 40
due to packet drops. If the block to be reconstructed is visible in the other view, its predictor
is used. In this method, the cameras do not need to know about the geometry/positions of
any other image sensors in the scene.
Liu et al. [16] proposed a surveillance specific Wyner-Ziv encoder in which intra frames
used in traditional Wyner-Ziv coding were replaced by backward predictively coded frames
(BP frames). However, as observed by Girod et al. in [17], distributed video coding al-
gorithms continue to lag behind conventional video coding schemes in rate-distortion per-
formance. [16] also requires a backward channel which will prevent adoption in cameras
which store the video in a local memory device. Video compression standards committees
and the surveillance industry also have not adopted these techniques. Hence, we do not
apply distributed coding techniques in this thesis.
2.4.7 Wireless and/or Remote surveillance specific techniques
Wireless commercial systems for homes that run on batteries have become very popu-
lar [116, 117]. They are triggered by motion and offer a complete remote video monitoring
solution. Wireless and remote surveillance systems operate under severe energy and band-
width resource constraints. The communication channel is also error prone. Some extreme
examples of such remote deployments have been described in [118, 119]. Lijuan and Qiang
propose a video surveillance system to monitor a large scale wind farm [118]. Carl Hartung
et al. [119] use a web camera and satellite communication to monitor weather conditions in
rugged wildland fire environments. Yun Ye et al. [120] provide a detailed survey of wireless
surveillance researches.
Such systems can be classified as (I) Event driven and (II) Continuous transmission.
Event detection based systems using passive infrared (PIR) sensors / low level image pro-
cessing to detect motion. When an event is detected, the video encoder is woken up to
encode the video. Lee et al. [21] have used background subtraction as an event detector
for surveillance. A scheduler for the encoder configurations is also proposed to determine
the optimal settings based on the estimate of future events and remaining battery charge.
In [121], Jongpil Jung et al. do not turn off the image sensor. Instead they continuously
Chapter 2. Background and Related Work 41
capture images, encode them using a JPEG encoder and store them in DRAM. The sys-
tem is designed to store up to 10s of surveillance video. When an event is detected, the
stored sequence of JPEG frames are transcoded to H.264 and transmitted over the wireless
channel.
Optimal allocation of system resources i.e. energy and communication bandwidth is crit-
ical in continuous transmission wireless surveillance systems. Hence, multiple researches
have studied cross-layer control (radio, encoder, imager, PTZ actuator) of such systems
[122, 70, 123, 124, 125, 126]. In [70], Dalei Wu et al. jointly optimize the video coding
quality, transmission bandwidth and camera control for a resource constrained pan-tilt wire-
less surveillance system. Zhihai He et al. [125] develop an analytic Power-Rate-Distortion
(P-R-D) model of the video encoder. The model is used to study the optimum power al-
location between video encoding and wireless transmission. In [126], Malisa Marijan et
al. determine the optimal power allocation among the image sensor, compression, and
transmission modules. A sigma-delta image sensor that allows easy control of P-R-D perfor-
mance of the imager is used. The distortion of the video is minimized under power budget
constraints.
In this thesis, we propose a continuous transmission surveillance system which is more
suited for cities and towns. The proposed skip decision method marks background regions
as skip. The bit cost of marking MB’s as skip is very small and hence the proposed technique
provides bitrate reduction of up to 94.5% (over techniques proposed in literature). Further,
we also propose a face ROI video encoding technique that provides up to 50.2% bitrate
reduction in pedestrian surveillance videos.
2.5 Summary
We have described different video compression techniques used for surveillance. The H.264
standard, which we adopt to implement our proposed techniques, has been described. In
particular, reference frame, skip mode and QP related aspects which are relevant to this
thesis have been discussed in detail. Also, we have categorized and presented the various
Chapter 2. Background and Related Work 42
techniques proposed in literature. Skip detection, reference frame selection and ROI video
coding techniques have been discussed in more detail since they are more relevant to the
thesis. We have also briefly described some of the mode decision, motion estimation, dis-
tributed coding, wireless surveillance and hardware related schemes in literature that are
specific to video surveillance. We now provide a brief summary of advancements that the
proposed techniques achieve over existing systems.
The proposed speeded up GMM algorithm uses windowed weight updates to reduce
floating point complexity. It provides speedup of 1.33 - 1.44 (over Zivkovic [3]) on standard
video datasets without affecting segmentation accuracy. The speeded up GMM algorithm
is combined with a low computational complexity, sampler based skip detection scheme
to accurately determine skip MB’s. The skip detector combines stratification and adaptive
sampling techniques to achieve up to 94.5% bit rate reduction and 74.5% computational
complexity reduction. We also propose a very low complexity reference frame selection
technique for H.264 video surveillance encoding. Results show that the proposed reference
frame selection method reduces bit rate by up to 24.7% and execution time by up to 7.3%
(compared to the 1× algorithm [57]). Finally, we have proposed a face ROI encoding tech-
nique for pedestrian video surveillance. We combine multiple, low & mid level visual cues
using Bilattice logic to accurately determine face ROI’s. Face image regions are encoded in
high quality. Non face face regions are encode in lower quality and the shadow regions are
marked as skip. The proposed technique has been integrated into the x264 video encoder.
Experiments show bitrate savings of up to 50.2%.
Chapter 3
Speeded up GMM Algorithm for
Background Subtraction
3.1 Introduction
Background subtraction is often the first step in static camera video surveillance applica-
tions. It reduces the computation required by the downstream stages of the surveillance
pipeline which usually comprises of video coding, object detection and tracking. Conse-
quently, it constitutes the most active/resource demanding stage of the surveillance pipeline
since it processes each incoming pixel in the video stream. In Chapter 4, we propose a skip
selection scheme for a H.264 surveillance video encoder. Here, MB’s which contain only
background image content are marked as Skip. This reduces the bitrate of the encoded
video stream. However, accurate detection of foreground regions is essential to prevent
Skip-coding of the objects in the scene.
Shaking trees, foliage, sunlight intensity changes due to active cloud motion, rain have
been the main sources for reduced accuracy of simple background subtraction algorithms.
The Gaussian mixture model(GMM) scheme proposed by Stauffer and Grimson [27] has
been one of the most successful techniques that works well in such uncontrolled outdoor
environments. However the original GMM algorithm [27] suffered from slow learning
rates during the initial phase. KaewTraKulPong and Bowden [127] corrected this using a
two stage learning scheme where the GMM is updated initially using the sufficient statistics
based equations and is later switched to a ‘L-recent window’ version. Lee [2] further
improved upon this by using a modified schedule that gradually switches between the two
update modes.
43
Chapter 3. Speeded up GMM Algorithm for Background Subtraction 44
The good accuracy of the GMM approach comes at the cost of significantly high compu-
tation and memory bandwidth requirements. Benezeth [30] found that the GMM algorithm
with 3 modes is about 3.7 times slower compared to the single Gaussian scheme. Zivkovic
[3] described a significant improvement to reduce the computation time and memory band-
width. He formulated a Bayesian approach to select the required number of Gaussian modes
for each pixel in the scene. In scenes with static background (traffic sequence in their pa-
per), this approach assigns a single mode Gaussian to model most of the pixels which helps
to reduce average processing time by 32%. However in the outdoor video (trees sequence),
results show only a 2% improvement since a significantly large portion of the scene requires
a multi-modal model.
Although real time performance of the adaptive GMM scheme has been demonstrated
on native PC’s, an increasing demand to move the analytics onto the camera itself requires
embedded platforms with low compute resources to support the algorithms. In this chapter,
we propose an orthogonal approach that provides computation time reduction by mini-
mizing floating point computations. We also combine the fast learning of [2] with the
automatic selection of number of modes in [3] to obtain a highly efficient and accurate
scheme.
In the next section, we review the modification proposed by D.S.Lee followed by a
very brief description of the improved AGMM algorithm proposed by Zivkovic. We refer
to [2] & [3] for a detailed discussion of the algorithms. In section 3.3, we present our
proposed improvisation to the GMM algorithm that significantly reduces the computation
time. Detailed experimental results of the proposed algorithm are discussed in Section 3.4.
3.2 Gaussian mixture model
3.2.1 Adaptive Mixture Learning with fast convergence
Each pixel in a frame is modelled using a Gaussian mixture model (GMM). The parame-
ters of the Gaussian mixture model (usually with 3 modes) are estimated using an online
Chapter 3. Speeded up GMM Algorithm for Background Subtraction 45
version of the EM algorithm. To prevent the foreground pixels from corrupting the back-
ground model, Stauffer et al. [27] proposed to use modes with low weights to model the
foreground. The description of the algorithm is given below:
The modes of the GMM are arranged in decreasing order of their weights. A predefined
fraction of the weights is used to determine the modes that model the background. This
favors modes with higher weight to be selected as the background. A match of an incoming
pixel to any of the modes is defined to occur if the Mahalanobis distance from the pixel is
less than a predefined threshold Tσ. If the match occurs on one of the background modes,
the pixel is labelled as background, else it is classified as foreground. The following update
equations are applied for the parameters of the Gaussian mode ‘k’ with the highest weight
that matched the incoming pixel x(t):
wk(t) = (1− α)wk(t− 1) + α (3.1)
µk(t) = (1− ηk)µk(t− 1) + ηkx(t) (3.2)
σ2k(t) = (1− ηk)σ
2k(t− 1) + ηk(x(t)− µk(t− 1))2 (3.3)
where ηk is the adaptive learning rate given by:
ηk =1− α
ck+ α (3.4)
ck is a counter which is maintained independently for each mode. Its value is initialized
to 1 for a new mode and is incremented whenever a match with an incoming pixel occurs.
ηk is a parameter that controls the learning rate of the modes.
As can be observed from Eq. (3.4), the learning rate is initially set to match the sufficient
statistics based update. As time progresses, it converges to a L-recent window based update
mode with a fixed learning rate of α. The weight for the remaining modes is updated using
Eq. 3.5.
Chapter 3. Speeded up GMM Algorithm for Background Subtraction 46
wk(t) = (1− α)wk(t− 1) (3.5)
If none of the modes match, then the mode with the least weight is replaced with a new
mode having low initial weight and large variance.
3.2.2 Automatic selection of number of components
In the GMM algorithm described above, the weights of the Gaussian mixture represent
the fraction of the data samples ‘x(t)’ that belongs to the particular mode in the model.
Defining nm to represent the number of samples that belong to the mth mode, the weights
of the GMM can be considered to define a multinomial distribution for the nm’s. Instead
of using the ML estimate that results in the original GMM update equation, Zivkovic used
a Dirichlet prior with negative coefficients. This is done with an intention of accepting a
class only if there is enough evidence from the data samples for the existence of the class.
Solving for the MAP(Maximum a posteriori posterior) estimate, the final adaptive update
Eqs. (3.1) & (3.5) are modified as follows:
wk(t) = (1− α)wk(t− 1) + α− αcT (3.6)
wk(t) = (1− α)wk(t− 1)− αcT (3.7)
cT is a parameter that represents the minimum fraction of samples required to support
the existence of a mode (set to be equal to 0.01 in [3]). We need to normalize the weights
after each update so that they add up to one. The modes whose weights become negative
are discarded. New modes are initialized with mean set to be equal to the pixel values that
didn’t match any of the existing modes. The variance is initialized to a large value. The
mean and the variance updates are similar to Eqs. (3.2) & (3.3) with ηk defined to be equal
to α/wk(t) instead of (3.4). This division by the weight significantly improves the learning
Chapter 3. Speeded up GMM Algorithm for Background Subtraction 47
rate compared to [27], however as Lee mentions in [2], ηk is unbounded and hence might
lead to divergence.
3.3 Proposed Algorithm
From the description provided in sections 3.2.1 & 3.2.2 we list the main steps involved in
the GMM algorithm as: (A) Sort the Gaussian modes (B) Match the pixel to the modes and
(C) Update the parameters of the modes.
We also note 4 observations in [2] and [3] that suggest our modification:
1. The weights of the Gaussian modes change slowly with time constant of roughly ≈
1/α which is typically of the order of a few hundred frames
2. Set of Background modes also doesn’t change rapidly since the weights change slowly
3. The mean and variance update Eqs. (3.2) & (3.3) are independent of the weight
values
4. A newly formed mode takes a minimum of a few tens of cycles to be removed. For
cT = 0.01 in [3], it takes about the order of 1/cT frames (100 frames) for a newly
formed mode to be removed in the case where none of the pixels match that mode.
Based on the above observations, we propose to update the weights only once in Tw
frames where Tw is a constant set to be equal to 16. The details for the choice of Tw is
discussed in the results section. We refer to Tw as the ‘weight update interval’. The set of
modes that belong to the background are also determined only once in Tw frames. New
modes are allowed to be created for all the frames. However, we determine mode deletions
only once in Tw frames. We refer to the cycle when we perform the true weight update as
the ‘fine update cycle’. To ensure that learning is unaffected, we need to perform accurate
weight updates based on the values of the pixels in the past Tw frames. We enable this
by using a low resolution (4 bit) integer counter to count the number of matches to a
mode that occurs in the Tw frames. The counter values are used to perform an accurate
Chapter 3. Speeded up GMM Algorithm for Background Subtraction 48
update during the ‘fine update cycle’ using modified equations derived below. Since we
now update the floating point weight values and determine the set of background modes
only during the ‘fine update cycle’ (once in Tw frames), the computational complexity is
reduced. Expensive floating point computations during the remaining Tw − 1 cycles are
replaced by simple integer increment operations. The derivation for the weight update is
provided now:
The Gaussian mixture distribution of the pixel x (we have dropped the time index t
here only for the sake of clarity) formulated in terms of discrete latent variables z is shown
in Eqn. 3.8 [128]. Here z is a K dimensional binary random variable having a 1-of-K
representation (z = [z1, z2, ....zK ]T ). zk = 1 indicates that the pixel x was generated from
the kth mode of the mixture model.
p (x) =∑
z
p(z)p(x|z) =K∑
k=1
wkN(
x|µk, σ2k
)
(3.8)
From the EM algorithm, the weight update at time instant t is given by Eqn. 3.9 [128].
γ (zk (i)) is the posterior probability of zk (i) = 1 (i is the time index) once we have observed
the incoming pixel x(i). γ (zk (i)) can also be interpreted intuitively as the responsibility that
the mode k takes to explain away the pixel data at time instant i.
wk(t) =
∑ti=1 γ (zk (i))
t(3.9)
=
∑t−Tw
i=1 γ (zk (i)) +∑t
i=t−Tw+1 γ (zk (i))
t(3.10)
Stauffer et al. [27] proposed to set γ (zk (i)) to 1 for the mode k which matched the
incoming pixel. The responsibility for other modes is set to 0. Since Nk is the number of
times the incoming pixel matched the mode k in the time interval Tw, Eqn. 3.10 can be
rewritten as shown in Eqn. 3.11.
Chapter 3. Speeded up GMM Algorithm for Background Subtraction 49
wk(t) ≈(t− Tw)wk(t− Tw) +Nk
t(3.11)
≈
(
1−Tw
t
)
wk(t− Tw) +Nk
t(3.12)
(1/t) is set equal to α as proposed in [27]. We also add the weight decay term from [3]
to obtain the final weight update in Eqn. 3.13:
wk(t) = (1− Twα)wk(t− Tw) +Nkα− TwαcT (3.13)
In Fig. 3.1, the weight update is plotted using the original GMM update equation and
the proposed method for the case where the weights are monotonically increasing and
decreasing with Tw = 16. The modified weight update is also shown applied to a realistic
scenario in Fig. 3.2. Here data points generated from a synthetic distribution with mass
function = [0.7, 0.25, 0.05] are used to update the weights using both the original GMM
Eqs. (3.6) & (3.7) and the proposed update Eq. (3.13). The initial weight is set arbitrarily
to [0.4, 0.4 ,0.2]. We observe that the learning rate and weight values obtained using
the proposed technique matches well with those obtained from the original GMM update
equations.
The complete pseudo code is described in Algorithm 1. Here maxModeF lag is used to
indicate that the mode with least weight was replaced during the Tw window. This is used
during the fine update cycle to reset the true weights of that mode. BG is a set that contains
the list of modes that belong to the Background model. This set is updated during the ‘fine
update cycle’. The integer weight counters are represented by weightCounti’s. The ‘fine
update cycle’ is staggered across pixels and in time such that only npixels/Tw receive the
fine update during each cycle (where npixels is the number of pixels in the frame). This
ensures that the processing time is uniform across all the frames.
An alternate derivation using a heuristic is provided in Appendix A.
Chapter 3. Speeded up GMM Algorithm for Background Subtraction 50
Algorithm 1: Proposed scheme for a single pixel x
Init: BG = { }, numModes = 1, Reset maxModeF lag, ∀i ∈ {1..maxNumModes}µi =∞, σi = σinit, wi = αData: input pixel x(t)while New Data x(t) do
for i ∈ {1....numModes} do
if x(t) matches mode i then
if i ∈ BG then
x(t) is Background
else
x(t) is Foreground
ci ←− ci + 1 (refers to ci from D.S.Lee)
Update µi, σi using Eqs. (3.2), (3.3) & (3.4)
weightCounti ←− weightCounti + 1
if ∀i ∈ {1..numModes}, x(t) doesn’t match mode i then
x(t) is Foreground
if numModes < maxNumModes then
numModes←− numModes + 1
Initialize new mode j;
else
Replace mode j where j = arg mini{wi}
cj ←− 1, weightCountj ←− 1
µj ←− x(t), σj ←− σinitif numModes = maxNumModes then
Set maxModeF lag
Once in Tw frames:
if t is a multiple of Tw then
if maxModeF lag is Set then
wi ←− α where i = arg mini{wi}Reset maxModeF lag
for i ∈ {1....numModes} do
Update wi using Eq. (3.13)
if wi < 0 then
delete mode
numModes←− numModes - 1
Normalize wArrange modes in decreasing order of wi’s
Determine Set of Background Modes, BG
BG = {1....nBG} where nBG = arg minb{∑b
k=1wk > TBG}
Chapter 3. Speeded up GMM Algorithm for Background Subtraction 51
100 150 200 250 300
0.3
0.4
0.5
0.6
0.7
GMMproposed
(a)
100 150 200 250 300
0.3
0.4
0.5
0.6
0.7
GMMproposed
(b)
Figure 3.1: Weight update using proposed update for a monotonically (a) increasing and
(b) decreasing case. The weights are plotted on the y axis with respect to time
0 1000 2000 3000 40000
0.2
0.4
0.6
0.8
1
w1w2w3
(a)
0 1000 2000 3000 40000
0.2
0.4
0.6
0.8
1
w1w2w3
(b)
Figure 3.2: Weight update using (a) original GMM update equations and (b) proposed
weight update for a GMM with weights=[0.7, 0.25, 0.05] and Tw = 16. The weights are
plotted on the y axis with respect to time. Please note that the graph shows that proposed
technique does not affect the learning rate. The increase in frame rate provided by the
proposed technique is illustrated in Fig. 3.4
Chapter 3. Speeded up GMM Algorithm for Background Subtraction 52
3.4 Experimental results
We initially describe experiments done to determine the optimum weight update inter-
val Tw. Later, we discuss results of experiments performed to obtain KL divergence on 1-
dimensional synthetic datasets showing good learning performance of the proposed scheme.
Next, we present a quantitative evaluation of the proposed algorithm on a set of 10 standard
videos (5 outdoor and 5 indoor): video4, 6 & 7 from [9], fountain, hall, lobby, shopping-
Mall, bootstrap & campus from [8] and HighwayI from [129]. Dataset [8] provided 20
frames of manually segmented foreground masks from each video set. VSSN 2006 provides
foreground truth for all the frames. Ground truth was generated manually for 10 randomly
chosen frames in the HighwayI sequence. Precision-Recall curves for the proposed algo-
rithm is compared with those obtained using [2] and [3].
We measure the frame rate on a Core i5 processor running at 2.53Ghz with 4GB of
system memory. All the programs are single threaded and have been compiled in Release
mode using Microsoft Visual C++. The following parameter values were found to work
well on all the videos: α = 0.004 & TBG = 0.8. The maxNummodes was set to 3 (two for
the background and one for the foreground).
3.4.1 Weight update interval Experiment
Fig. 3.3 shows the measured frame rate plotted as a function of Tw for the VSSN06 dataset
videos using the modified weight update technique. We observe that the speedup saturates
for Tw in the range of ≈ 12 - 18. The maximum error % of the proposed method is also
plotted where the error is defined as the maximum difference between the weights obtained
using ‘per cycle update’ Eqs. (3.6) & (3.7) and the weight obtained using the proposed
update Eq. 3.13 at the end of Tw frames. All possible combinations of matches during the
Tw frames are considered and the maximum deviation is plotted. The initial weight at the
beginning of the Tw frames is set to a realistic value winit = [0.7, 0.25, 0.05]. We can
observe that the error increases linearly as Tw is increased. Since speedup saturates for
Tw in the range of ≈ 12 - 18, we choose Tw to be 16. Choosing a higher Tw increases the
Chapter 3. Speeded up GMM Algorithm for Background Subtraction 53
0
0.5
1
1.5
2
2.5
100
120
140
160
180
200
1 2 4 6 8 10 12 14 16 18
Err
or %
Avg
. Fra
me
rate
(fp
s)
Weight update interval (Tw)
Average Frame Rate Maximum Error %
Figure 3.3: Frame rate (fps) and error % are plotted with respect to the weight update
interval Tw
error without any benefit. On a dedicated hardware system, a weight update interval of 16
results in compact 4 bit counters for the coarse weight updates. We find that the small error
in weight doesn’t have any impact on the accuracy in real dataset videos. Detailed accuracy
data is described below for the chosen weight update interval of 16.
3.4.2 Adaptive Mixture Learning Experiment
The accuracy of the proposed method is first validated on one dimensional synthetic data.
Fig. 3.4a shows a typical pixel intensity distribution (plotted against frame count) observed
in surveillance videos. Here, a pixel which was initially unimodal (e.g. pixel belongs to the
‘sky’) changes to a multimodal process (wind causes tree leaves to vacillate on a static ‘sky’
background). Fig. 3.4b shows the KL divergence achieved by the original update equations
in [2] and by the proposed method during the phase where the model is learning the pa-
rameters for the new mode. We find the learning achieved by the proposed method to be
very similar to that obtained using the original update equations. The divergence has been
Chapter 3. Speeded up GMM Algorithm for Background Subtraction 54
500 1000 1500 20000
50
100
150
200
250
(a)
800 1000 1200 1400 1600 1800 20000
0.01
0.02
0.03
0.04
0.05
D.S.Leeproposed
(b)
Figure 3.4: (a) Synthetic distribution based on commonly observed surveillance videos (b)
KL divergence achieved by the proposed and the original method [2]
computed using Monte Carlo sampling averaged over 5 datasets. Similar experiments per-
formed on slowly varying illumination models showed that the accuracies of the proposed
method matched well with the original scheme in [2].
3.4.3 Background subtraction experiment
The precision-recall curves for the 10 videos listed in section 5.2 have been determined. We
observed that the accuracy of the proposed algorithm matches the accuracy of the GMM
formulations of Lee [2] and Zivkovic [3] in all the videos. The average precision is plotted
against the recall rate in Fig. 3.5 showing no degradation of accuracy with the proposed
scheme. Since a false negative or a ‘miss’ is undesirable in surveillance applications, we
only show recall rates varied from 0.65 to 0.95. However, we verified that the proposed
scheme doesn’t impact accuracy at lower recall rates as well.
Fig. 3.4 shows the frame rate of the different methods averaged over 5 trials with
Tσ = 3. The average frame rate computed as the reciprocal of the average computation
time is listed in Table 3.1. We find that the proposed scheme provides significant speedup
for the case where there are multiple modes required to model a significant fraction of the
scene. In Figs. 3.5a, 3.5b & 3.4d, we observe high speedups since frequent foreground
Chapter 3. Speeded up GMM Algorithm for Background Subtraction 55
Table 3.1: Average frame rate using proposed scheme, [2] & [3]. Average speedup ob-
tained using proposed scheme measured over [3] (Zivkovic)
Average frame rates (fps)
Dataset D.S.Lee Zivkovic ProposedAvg.
Speedup
HighwayI 116 145 211 1.44
Campus 287 351 432 1.23
Hall 265 326 401 1.22
Lobby 307 375 440 1.17
Mall 115 153 204 1.33
Fountain 342 423 457 1.08
Bootstrap 291 333 458 1.37
video4 119 149 182 1.22
video6 116 143 185 1.28
video7 137 174 191 1.09
motion results in continuous creation of new modes. Similarly, large background motion in
the ‘campus’ sequence causes a significant fraction of pixels to require a multimodal model.
Hence, the proposed method provides speedup in this sequence as well. In Fig. 3.5c, we
notice that the initial speedup is low. This is due to the relatively static scene during the
initial phase of video4. Since a significant fraction of the scene requires only a single mode,
Zivkovic’s scheme itself provides high speedup dominating the proposed method. However,
we observe that the speedup of the proposed method over Zivkovic’s scheme improves
beyond frame 400 since the number of multimodal pixels increase (due to shaking leaves
and appearance of foreground objects).
The proposed technique also provides a minimum memory bandwidth reduction of 16%
for a pixel which has more than 1 mode. This is because the floating point weight variables
are fetched from the memory only once in 16 frames. Extra memory required (worst case)
to store weight counters, background set and a flag is equal to 11bits/pixel (0.42MB for a
VGA resolution input).
Chapter 3. Speeded up GMM Algorithm for Background Subtraction 56
0
50
100
150
200
250
16 80 144 208 272 336 400
D.S.Lee Proposed Zivkovic
(a) HighwayI
0
50
100
150
200
250
16 176 336 496 656 816 976 1136
D.S.Lee Proposed Zivkovic
(b) Shopping Mall
0
50
100
150
200
250
16 144 272 400 528 656 784
D.S.Lee Proposed Zivkovic
(c) video4
Chapter 3. Speeded up GMM Algorithm for Background Subtraction 57
0
100
200
300
400
500
600
16 496 976 1456 1936 2416 2896
D.S.Lee Proposed Zivkovic
(d) bootstrap
Figure 3.4: Instantaneous frame rates plotted against frame count using proposed scheme,
[2] & [3]
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.65 0.7 0.75 0.8 0.85 0.9 0.95
Pre
cisi
on
Recall
Proposed Zivkovic D.S.Lee
Figure 3.5: Average precision-recall curves obtained using proposed scheme, [2] & [3] for
the 10 dataset videos
Ch
ap
ter
3.
Sp
eed
ed
up
GM
MA
lgorith
mfo
rB
ack
gro
un
dSu
btra
ction
58
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 3.6: Detection results on the Hall [8] and video7 [9] dataset videos. (a) & (e) are original images from the Hall and
video7 datasets respectively. (b) & (f) are the corresponding Ground truth images. (c) & (g) are the segmentation masks
obtained using Lee [2]. (d) & (h) are the segmentation masks obtained using the proposed scheme
Chapter 3. Speeded up GMM Algorithm for Background Subtraction 59
3.5 Summary
The computational complexity of modelling the background pixels using adaptive GMM
proposed in [3] can be significantly reduced for highly active pixels by our proposed scheme
of windowed weight updates. This method reduces processing time without affecting seg-
mentation accuracy. Experimental results shows a speedup of up to 44% in scenes where
a large fraction of the pixels require multimodal Gaussian models. The proposed modifica-
tions are also quite suitable for a hardware implementation. In the next chapter, we will
adopt the speeded up GMM algorithm to perform skip selection in the H.264 surveillance
video encoder.
Chapter 4
Skip decision & Reference Frame
Selection for H.264 Surveillance
Coding
4.1 Introduction
A substantial fraction of the Macroblocks (MB’s) in a static camera surveillance video stream
usually do not contain any objects of interest. Coding such MB’s using the Skip mode pro-
vided by the H.264/AVC standard provides a significant reduction in coding cost. Hence, it
is essential to accurately classify the MB’s into 2 sets: (1) MB’s which contain foreground
objects of interest (FG MB’s) and (2) MB’s which do not contain objects of interest (back-
ground MB’s or BG MB’s). Objects of interest in a surveillance scene typically are in a state
of motion (e.g. humans, cars, handbags). This naturally motivates the adoption of a mo-
tion detection algorithm to perform skip decision. However, the motion detection algorithm
needs to be computationally simple to operate on low power embedded camera platforms.
The method also needs to be accurate since skipping regions of interest will drastically
impact the utility of the encoded video stream. In this chapter, we describe a spatial sam-
pler based skip detection algorithm. Sampled pixels in the frame are segmented using a
speeded up GMM (Gaussian Mixture Model) algorithm that we proposed in the previous
chapter. The data structures of the GMM are rearranged to improve cache performance.
The remainder of the chapter is organized as follows: A high level overview of the pro-
posed sampler based skip decision technique is provided in Section 4.2. Different sampling
60
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 61
techniques used in practice are reviewed in Section 4.3. The proposed low cost skip de-
tection method is described in Section 4.4. MB classification & the H.264 skip signalling
procedure are provided in Sections 4.6 & 4.7 respectively. Experimental results of the pro-
posed technique are reported in Chapter 5.
4.2 Proposed Architecture
GMM based Motion
Detection
Background model
M = {B(n)}Reference frame selection & signalling
H.264/AVC encoderVideo stream from camera
Transmit
Skip signalling
{Si}
Ref. Pic. Buffer RPB = {R(n)}
Bcurrent
nR
{B(n)}
H.264/AVC Decoder
Ref. Pic. List 0
Ordering specified by
List 0
Decoded Pic. BufferDPB = {R(n)}
Ref. Pic. List 0
Spatial Sampler
GMM parameters
Skip Detection
Figure 4.1: Proposed surveillance specific video coding architecture
A block diagram of the proposed surveillance processing pipeline is shown in Fig. 4.1. A
spatial sampler is used to select pixels in the camera-captured image. Pixel level background
subtraction is performed at the sampled locations using the GMM algorithm. Motion detec-
tion is performed based on the segmented output from the GMM algorithm. Macroblocks
which do not have foreground pixels are considered as ‘Regions of no interest’ (RONI) and
are marked as BG MB’s. Accurate determination of blocks containing objects of interest
in the scene is essential to reduce bandwidth without distorting the foreground image.
However, the computational cost and power required to execute foreground segmentation
algorithms should be low enough to be feasible on embedded platforms. The proposed
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 62
method achieves this using a combination of 4 techniques: (A) Pixels in the frame are seg-
mented using a 2-Step adaptive sampling process (B) Spatio-Temporal priors are used to
bias the decision of marking a MB as BG or FG (C) The speeded up GMM based pixel level
background subtraction algorithm that we proposed in [130] is utilized to further reduce
the computational cost of motion detection (D) Data structures of the GMM are rearranged
to improve cache performance. A detailed description of the proposed Skip detection tech-
nique is provided in Section 4.4.
Let Bcurrent denote the set of indices of macroblock’s that are marked as RONI (i.e. BG
MB’s) in the current frame. Si = {Mode,Ref,MV,Residual} is a set consisting of coding
decisions for the ith MB in the current frame. Here Mode ∈ {P16× 16, SKIP} and MV
refers to the motion vector used to perform motion compensation utilizing the reference
frame indexed by Ref . The Residual is added to the motion compensated image to obtain
the reconstructed MB. Let N be the maximum number of reference frames allowed in the
reference picture buffer and RPB = {R(n) : 0 ≤ n < N} denote the ordered list of reference
frames R(n) present in the buffer. The ordering of R(n) in RPB is defined by the reference
picture ‘List 0’, i.e. R(0) denotes the frame in the buffer referenced by the top of ‘List 0’. The
set of indices of background MB’s in the nth reference frame of ‘List 0’ is denoted by B(n).
The ordered list M = {B(n) : 0 ≤ n < N} contains information of all the background MB’s
present in the reference buffer and is called the background model.
The proposed scheme takes the set of BG MB’s (determined using ‘Motion detection’) as
input and attempts to recreate the uncovered background regions using predicted data from
the frames in the Reference Picture Buffer. Since small residual errors in the background
regions do not impact surveillance systems, residual coding for BG MB’s is skipped. If the
reference frames in the Reference Picture Buffer contain the BG MB’s uncovered in the
current frame, the MB’s can be transmitted without residual coding and they incur a low
coding cost. Hence choosing the right pictures as reference is important to reduce bit rate.
An adaptive reference frame selection technique is proposed in Sec. 4.8 to optimally mark
encoded frames as Reference.
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 63
4.3 Sampling techniques
In our proposed technique, a sampler is used to determine foreground MB’s. Sampling
theory has been studied extensively in literature [131, 132, 133]. Different sampling tech-
niques have been used for various practical applications [133]. A few representative exam-
ples are (I) Environmental data collection which is further used to determine contamination
risks and (II) Estimation of oil reserves for which carefully chosen sample holes are drilled.
Increasing the accuracy of the estimate and reducing the cost to determine the estimate are
the main challenges of such sampling operations. For example, in the case of environmental
sampling, accurate contamination data is very critical. In the case of estimating oil reserves,
drilling holes is very expensive and hence optimal sampling is very important.
4.3.1 Basic sampling techniques
Basic sampling techniques that are commonly used are shown in Fig. 4.2. We now briefly
describe them here:
• Simple random sampling: As described by Steven K. Thompson in [132], distinct
units are selected from the population such that all possible combinations of the sam-
pled units are equally likely. This is effective when the population is homogeneous.
However, simple random sampling can be expensive than other designs if the cost of
obtaining the sample is high in the randomly chosen locations (e.g. in the case of
estimating oil reserves).
• Cluster sampling: The population is split into primary and secondary units. Each
primary unit consists of one or more secondary units which are clustered in the space
or time. Whenever a primary unit has been chosen as a sample, all its secondary units
are also included in the collection.
• Systematic sampling: Samples are chosen at regularly spaced intervals (in space
or time). This sampling pattern provides the largest coverage in a region for a fixed
number of units. Although systematic and cluster samplers appear to be very different,
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 64
they share an underlying design principle [132, 131]. The selection of the systematic
sampler can be considered to be the selection of a primary unit that constitutes the
whole sample. For example, we can divide the input 2D image data into four large
sets (or primary units) of pixels where each set consists of pixels sampled on a grid
which is offset from the other primary units. Sampling using the systematic sampler
involves the selection of one of these primary units. Systematic sampling has been
shown to be very effective in natural populations [131]. However, the accuracy of
this technique can reduce due to periodicity in the population.
We now consider the variance of the unbiased estimator for the population-total ob-
tained using a systematic sampler. The variance is related to sampled data parameters
as shown in Eqn. 4.1.
var(τ) ∝ σ2[
1 + (M − 1)ρ]
(4.1)
Here ρ is the within-primary-unit correlation coefficient. M is the average number of
secondary units in a primary unit. σ2 is the variance of the data (Please refer to [132]
for more details).
Eqn. 4.1 shows that in the case of estimation, it is optimal to sample such that the
within-primary-unit correlation coefficient is low. In natural populations, similar data
characteristics are found in samples which are clustered in space and/or time. Sys-
tematic sampling which spreads the secondary sample points apart causes the within-
primary-unit correlation coefficient to be low. Hence, the systematic sampler has been
found to be very effective in real world applications.
• Stratified sampling: Here, the population is partitioned into sets which are called
strata. The strata are chosen based on existing prior information about the process
or from domain experience (e.g. ecological surveys stratify based on soil type and
vegetation). The sampler design is chosen differently for each strata.
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 65
4.3.2 Adaptive sampling
Here, sampling pattern decisions are made based on the data that has been analyzed thus
far. Such designs can utilize the sampled dataset characteristics to refine future patterns.
They can be designed to very efficiently determine rare elements in the data. For example,
in pollutant estimation, the region is initially sampled sparsely. Units which are considered
interesting (i.e. pollution levels are greater than a threshold) will result in inclusion of
neighbouring points in the sample set. Natural populations are typically aggregated and
hence, such adaptive sampler designs significantly improve performance.
Different adaptive designs include (I) Adaptive cluster sampling (II) Systematic and
strip adaptive cluster sampling & (III) Stratified adaptive cluster sampling. We propose to
utilize a modified version of the stratified-adaptive-cluster sampler to perform skip detec-
tion. Hence, we describe only this technique here. The reader is referred to [132] for a
comprehensive description of other sampler designs.
• Stratified Adaptive Cluster Sampling
Stratified adaptive cluster sampling combines the ideas of adaptive and stratified
sampling [132]. Adaptive sampling utilizes sampled-unit characteristics to efficiently
choose future data points. In comparison, stratified sampling uses prior information
or domain knowledge to decide sampler patterns. Hence by combining both these
techniques, stratified adaptive cluster sampling improves the performance of the sys-
tem.
An example of this sampling technique is shown in Fig. 4.3. The population is first
sampled using the stratified design technique in Fig. 4.3a. When sampled units which
are of interest are found, additional samples from their neighbourhood are chosen.
This is shown in Fig. 4.3b where the added sample points are marked with a cross. In
the next section, we will describe the proposed skip detection algorithm based on the
Stratified adaptive cluster sampling technique.
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 66
(a) Random (b) Cluster
(c) Systematic (d) Stratified
Figure 4.2: Basic sampling techniques (a) Random (b) Cluster (c) Stratified (d) Systematic
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 67
(a) (b)
Figure 4.3: Stratified Adaptive Cluster Sampling (a) First stage (d) Second stage
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 68
4.4 Sampler based Background MB detection
The duration in which foreground objects appear in a surveillance scene is typically a small
fraction of the total time. As a consequence, the energy consumption of the system is
strongly related to the energy required to detect foreground objects in the video sequence.
We propose to combine stratification and adaptive sampling techniques described in the
previous section to efficiently detect the background MB’s.
The input image is initially sampled sparsely. The sampled pixels are classified as ei-
ther background/foreground using the GMM algorithm. The regions surrounding the fore-
ground pixels are considered to be salient. These salient regions are further sampled using
a dense sampler. The sampled pixels are segmented to verify the presence of foreground
objects. MB’s which do not contain foreground pixels are included in BCurrent (the set of in-
dices of MB’s which contain only background objects). We abbreviate the proposed sampler
based motion detection scheme used to determine BCurrent as ‘GMM S-MD’. A flowchart of
the proposed method is shown in Fig. 4.4. A more detailed explanation of the proposed
GMM S-MD algorithm is provided below.
The motion detection process can be considered to be a cascade of two stages: (1)
Salient MB detection and (2) Background MB detection.
(1) Salient MB detection: Fig. 4.5 shows a macroblock (marked on a foreground object)
and the expanded view of the pixel grid. A set of sparsely located pixels A1 is obtained by
uniformly sampling the input image with inter pixel spacing set to Dsparse. The pixel posi-
tions are offset by a distance equal to Ddense to obtain multiple sparse pixel sets A2, A3, ...
Fig. 4.5 shows one such sampling pattern with 4 sets of pixels A1, A2, A3 and A4 inter-
spersed on a regular grid. In the first stage, the image is sparsely sampled by selecting only
one set of pixels i.e. A1, A2, A3 or A4. The selection is performed in a sequential fashion i.e.
pixels belonging to set A1 are sampled at frame n, pixels belonging to set A2 are sampled at
frame n+1 and so on. Background subtraction is performed on the sparsely sampled set of
pixels Ai using the GMM algorithm that we proposed in [130]. Chang et al. [134] showed
that sampling does not impact the learning performance of the GMM model. We denote
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 69
Sparse Sampler + BGS
Fsparse
Morphological Dilation using a 3x3
elementF’sal
Dense Sampler +
BGSIFG
Erosion using a 2x1 element
� FsalImage BCurrent
Fprev = (Bprev)C
Fprev
1 frame delayBprev
Salient MB detection BG MB detection
Fsparse: Set of indices of MB’s which were detected as foreground by the sparse sampler
Fsal: Set of indices of MB’s which are candidates for dense sampling (before dilation)
F’sal: Set of indices of MB’s which are candidates for dense sampling (after dilation)
Fprev: Set of indices of MB’s which were marked as foreground in the previous frame
IFG: Output binary image obtained by segmenting densely sampled pixels
BCurrent: Set of indices of MB’s in current frame which contain only background content
Notations:
Figure 4.4: GMM pixel level classifier + Sampler based Motion Detection (or ‘GMM S-MD’)
flow chart
the threshold applied by the GMM algorithm on the sparse pixels as Tsparse. If one or more
pixels in MBi are classified as foreground, then the index i is included in the set Fsparse,
i.e. Fsparse ← Fsparse ∪ {i}. Fsparse contains the set of indices of MB’s which are detected
as FG MB’s by the sparse sampler (Fsparse is cleared before each frame in the input video
sequence is processed). The presence of foreground objects in the MB during the previous
frame increases the probability of the MB to contain foreground pixels in the current frame.
Hence, the union of Fsparse and Fprev (set of indices of FG MB’s in the previous frame)
is computed to obtain Fsal (set of indices of MB’s which have a large probability of con-
taining foreground objects). Likewise, the presence of foreground pixels in a MB increases
the probability of its neighbors to contain foreground objects. Hence, the final set of MB
candidates to be considered as salient, i.e. F ,sal is constructed by including the indices of
neighboring MB’s. This is performed by applying a morphological ‘dilation’ operator using
a 3x3 element.
(2) Background MB detection: F ,sal contains the indices of the MB’s which are consid-
ered to be salient. Background subtraction is performed on the pixels of all the four sparse
Ch
ap
ter
4.
Skip
decisio
n&
Ref.
Fra
me
Sel.
for
H.2
64
Su
rveilla
nce
Cod
ing
70
A1 A2
A3 A4
A1
A3
A1 A2 A1
MB boundary
8x8 block
�������������������������������������������������������������������������������������������
�������������������������������������������������������������������������������������������
�������������������������������������������������������������������������������������������
�������������������������������������������������������������������������������������������
�������������������������������������������������������������������������������������������
�������������������������������������������������������������������������������������������
�������������������������������������������������������������������������������������������
�������������������������������������������������������������������������������������������
�������������������������������������������������������������������������������������������
�������������������������������������������������������������������������������������������
�������������������������������������������������������������������������������������������
�������������������������������������������������������������������������������������������
�������������������������������������������������������������������������������������������
MB
Dsparse = 8
Ddense = 4
�1 �21 N1
w1 w2 w3
Cache line 1
�2 �22 N2Cache line 2
�3 �23 N3
�1 �21 N1
w1 w2 w3
�2 �22 N2
�3 �23 N3
�1 �21 N1
w1 w2 w3
�2 �22 N2
�3 �23 N3
�1 �21 N1
w1 w2 w3
�2 �22 N2
�3 �23 N3
Cache mappingPixels in a MB
Figure 4.5: Figure shows the sampling pattern of pixels in an image. The sampled pixels are partitioned into 4 sparse sets A1,
A2, A3, & A4. Also shown are the GMM data structures of pixels mapped onto different cache lines to improve cache locality.
The models of the dominant modes are arranged in a contiguous manner. Also, the data elements belonging to a single set
of pixels are present in a contiguous array.
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 71
sets (A1, A2, A3 and A4) in the salient MB’s. We denote the GMM threshold used by the
dense sampler as Tdense. The classified output is represented using a binary image IFG
where IFG (x, y) = 1 indicates that the pixel at location (x×Ddense, y ×Ddense) in the in-
put image is a foreground pixel. A morphological ‘erosion’ operator is applied on the image
IFG using a 2x1 element to filter out the noise. MBi is marked as a BG MB if all the sam-
pled pixels in the filtered output which belong to MBi are 0. Fig. 4.6 shows a set of salient
MB’s detected on a human in the scene. The figure also shows the pixels that are sampled
to detect the set of background MB’s.
Salient MB’s
Sampled pixels
Figure 4.6: Salient MB’s and Sampled pixel plot
4.4.1 GMM S-MD as a Stratified-Adaptive-Cluster sampler
Although the proposed sampler would appear to be different from the Stratified-Adaptive-
Cluster sampler, the two techniques indeed share a same set of underlying concepts. The
set of MB’s (in the current frame) that were marked as FG in the previous frame, and
their neighboring macroblocks constitute stratum 1. The remaining MB’s in the current
frame constitute stratum 2. Strata 1 & 2 have been sampled using a systematic sampler
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 72
with stride parameter set to Ddense & Dsparse respectively. Stratum 1 has been sampled
densely since the likelihood of MB’s in this image region is high. Stratum 2 image regions
are sparsely sampled to detect new objects appearing in the scene. Sparsely sampling the
stratum 2 image regions reduces the computational complexity of the detector. Sampled
units in Stratum 2 which are marked as foreground result in further sampling of pixels in
their neighboring MB’s. This adaptive sampling technique helps to reduce the ‘miss rate’ of
foreground detection. The final set of pixels marked as foreground by the dense sampler is
filtered to reduce the false alarm rate.
Combining stratification and adaptive sampling helps to reduce the computational cost
of the skip detector without sacrificing on accuracy. If we adopt only adaptive cluster sam-
pling, isolated FG MB’s (for example, FG MB’s of small objects) that are incorrectly marked
as BG by the sparse sampler would have been skip coded. However, since GMM S-MD uses
stratification based on the results of the previous frame, such isolated MB’s that were de-
tected in the previous frame would be considered as salient. Also, using only stratification
based on previous frame results does not cause all the MB’s of newly entered objects to be
detected. Even if a few MB’s on the object are not detected by the sparse sampler, adaptive
sampling will include these MB’s in stratum 1 (Densely sampled set). This is because FG
MB’s detected by the sparse sampler will cause their neighboring MB’s also to be marked as
salient. In Chapter 5, we will present experimental results which show the effectiveness of
combining stratification and adaptive cluster sampling techniques for skip detection.
Along with the standard Stratified-Adaptive-Cluster sampling techniques adopted, GMM
S-MD also incorporates features specific to skip detection, namely (I) Spatio-temporal priors
and (II) Cache performance optimization, which we will discuss here.
4.4.2 Spatio-temporal priors
The accuracy of the proposed multi stage sampler based BG MB detector depends upon
the sampling parameters of Strata 1 & 2 and the accuracy of the pixel level classifiers (i.e.
sparse and dense pixel classifiers). The precision and recall values of the pixel level classifier
depends upon the GMM thresholds. Reducing the GMM threshold, the false positive rate
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 73
increases and the recall rate improves. Similarly, increasing the sampling stride reduces
the recall rate. A detailed analysis of these relationships is provided in Appendix B. GMM
S-MD allows different thresholds for the sparse and dense samplers. As we will later show
in Chapter 5, to detect camouflaged small objects, we reduce the threshold of the dense
sampler. The spatial prior introduced by this configuration of the GMM S-MD skip detector
closely resembles the priors imposed by doubleton clique potentials (spatial smoothness
priors) in Markov Random Field based image segmentation tasks. When a pixel is marked
as a FG pixel by the sparse sampler, a lower GMM threshold is applied on the neighboring
pixels, hence biasing the pixel to be marked as a FG. If any of the neighboring pixels are
also marked as FG, the morphological filter output marks the MB as FG. Since GMM S-MD
considers FG MB’s detected in the previous frame when computing the set of salient MB’s,
it also incorporates temporal priors along with the spatial bias.
As explained above, we observe that the GMM S-MD algorithm incorporates Spatio-
Temporal priors to influence the final classification of a MB (as FG/BG), i.e. an MB which
contains foreground-pixel-detections biases its spatial and temporal neighbors to be marked
as FG. This 2-Step sampling process improves the accuracy of Skip detection. A large frac-
tion of the MB’s belonging to the background are filtered by the sparse sampler. The dense
sampler is applied only on the set of Salient MB’s. Hence, the inclusion of Spatio-Temporal
priors is achieved with low computational cost. The five parameters of GMM S-MD are
Tsparse, Tdense, GMM learning rate α, Ddense and Dsparse. We provide a detailed discussion
on the selection of these parameters in Chapter 5. We also present experimental results
which show that the Spatio-Temporal priors introduced by GMM S-MD helps to detect small
objects in surveillance videos.
4.4.3 Cache performance optimization
In our application, we note that a large fraction of MB’s in a video sequence usually do not
contain foreground. Hence the only memory accesses involved in such cases is fetching the
parameters of the dominant modes which represent the background image (mode with the
highest weight). Hence we maintain the models of the dominant modes in a contiguous
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 74
manner (in memory) as shown in Fig. 4.5. The other observation is that the modes of only
one of the sets of pixels (A1, A2, A3 or A4) are fetched to perform skip decision of MB’s
which contain only background images. Hence, the modes of the pixels are reordered to
ensure that data elements belonging to a single set of pixels are present in a contiguous
array.
4.5 Reference frame selection
Consider a sequence of frames in display order as shown in Fig. 4.7. The first frame
in the sequence is an IDR frame. Let FC be the current frame being encoded. Let the
mth frame in the sequence be denoted by Fm. Let P denote the ‘key frame’ period in the
sequence. The set of reference frames available in the reference picture buffer when coding
the current frame are shaded in grey. In static camera surveillance videos, the previous
frame in display order typically provides the best prediction (least R-D cost) for foreground
MB’s. Hence every encoded frame is marked as reference for the successive frame and is
placed as the first entry of reference picture ‘List 0’ using Reference Picture List Reordering
(RPLR) commands. Let FR(n)m indicate that the mth frame in the video sequence is the
nth reference frame in the reference picture buffer. As a consequence, the previous frame
is denoted by FR(0)C−1 . After encoding the current frame, the encoder needs to replace a
reference frame in the buffer by the current frame. The index of the frame in the picture
buffer to be replaced is denoted by nR. Selection of the value of nR is discussed in Sec. 4.8.
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 75
IDR frame 0
IDR frame 1
Current FrameReference Frames in RPB
( )1−NnRxF ( )2−NnR
yF PF1−PFCF)0(1
RCF −1F0F
Previous Frame is always a reference frame
Time
Figure 4.7: Sequence of frames in display order
4.6 Macroblock Classification
Fig. 4.8 shows a pictorial description of a surveillance video sequence. Consider a Mac-
roblock MBi (ith MB in raster scan order) in the current frame which contains moving
foreground objects (e.g. MB1 in Fig. 4.8). The MB will not be skipped and the H.264/AVC
encoder will perform motion estimation (ME) and mode decision to determine the optimum
mode. Let CFGi denote the number of bits required to code the FG macroblock MBi. Since
almost all FG MB’s obtain predicted data from the previous frame, CFGi is only dependent
upon the encoder complexity parameters (e.g. motion search range) and not on the selec-
tion of reference frames. MB2 contains only background objects and hence is marked as
‘Skip’. The coding cost to mark such MB’s as ‘Skip’ is very low and is denoted by CBG→BGi
(for the ith MB). Macroblock MB3 contains only background objects in the current frame.
However, the encoder cannot mark it as ‘Skip’ since the collocated block in the previous
frame contains foreground objects. The encoder will need to refer to other frames in the
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 76
MB1
MB2
MB3
MB4
( )2 frame Ref. nRxF ( )1 frame Ref. nR
yF
( )) frame
(Previous 0 frame Ref.01
RCF −
CF frameCurrent
Figure 4.8: Macroblock reference assignment
RPB in which the collocated block is a background MB (i.e. the MB does not contain fore-
ground objects). If the background MB is available in the RPB, the current macroblock can
be coded using the inter mode with motion vector and residual set to 0. This would result
in a low coding cost CFG→BG〈A〉i (for the ith MB). However if none of the reference frames
contain a collocated background macroblock for MBi (e.g. MB4 in Fig. 4.8), then the
encoder will need to perform motion estimation and mode decision. It will also have to
encode the residual and will incur a bit cost denoted by CFG→BG〈U〉i . The notations used to
classify the macroblocks are summarized in Table. 4.1
The total coding cost (in bits) for the current frame is given by:
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 77
Table 4.1: Notations
MBi type Description
FG MBi is a FG MB in the current frame
BG→ BGMBi is a BG MB in both the current and
previous frames
FG→ BGMBi is a FG MB in the previous frame and a
BG MB in the current frame
FG→ BG 〈A〉MBi is a FG→ BG MB
∃ n such that 0 < n < N & i ∈ B(n)i.e. a collocated BG MB is 〈A〉vailable in RPB
FG→ BG 〈U〉MBi is a FG→ BG MB
i /∈ B(n) ∀n such that 0 < n < Ni.e. a collocated BG MB is 〈U〉navailable in RPB
cost =∑
i∈IFG
CFGi +
∑
i∈IBG→BG
CBG→BGi
+∑
i∈IFG→BG〈A〉
CFG→BG〈A〉i +
∑
i∈IFG→BG〈U〉
CFG→BG〈U〉i
(4.2)
Here, the first summation represents the total bit cost of all foreground image regions
in the current frame. The second term represents the cost to mark BG → BG MB’s as
skip and is hence very small. The third term represents the total cost to encode uncovered
background MB’s which can be directly reconstructed (without residual) using reference
pictures in the DPB. Since such MB’s do not need coding of residual information, this cost
is very small. The last term represents the coding cost of uncovered background MB’s that
need residual coding (due to unavailability of image content in the reference pictures in the
DPB).
Let CFG, CBG→BG, CFG→BG〈A〉 and CFG→BG〈U〉 denote the bit cost summations for
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 78
FG, BG → BG, FG → BG 〈A〉 and FG → BG 〈U〉 macroblock’s respectively in Eq. 4.2.
The total cost of coding FG→BG MB’s, CFG→BG is equal to the sum of CFG→BG〈A〉 and
CFG→BG〈U〉.
4.7 Skip Signalling
MB’s which have been determined to contain only background image content (BG MB’s)
can be marked as Skip based on the availability of background MB’s in the model M . The
appropriate coding decisions Si for macroblock MBi are obtained as shown in the flowchart
in Fig. 4.9. Here, IFG, IBG→BG, IFG→BG〈A〉 and IFG→BG〈U〉 denote the sets of all indices
of FG, BG → BG, FG → BG 〈A〉 and FG → BG 〈U〉 macroblock’s in the current frame
respectively. MV and MVP represent the motion vector and motion vector predictor of the
MB respectively.
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 79
Yes
Entropy coding
No
Yes
Ref = 0
No
Yes
No
No
Yes
Motion estimation and Mode decision
H.264/AVC encoder
?ABGFGIi →∈
?BGBGIi →∈
?FGIi ∈
?0==MVP
iMBfor Signalling Skip
SKIP=Mode( ) NrrBi <≤∈ 1 ,
P16x16,0
,0
==
=
Mode
Residual
MV
End
*
* Residual and quantized coefficients for the entire 16x16 MB are set to 0
Ref = r such that
Figure 4.9: Skip Signalling
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 80
4.8 Optimum Reference Frame selection
In Chapter 2, we noted that high quality long term reference frames (HQF’s) do not reduce
the coding cost of FG objects (i.e. CFG) in surveillance video encoders. However, the cost
to encode uncovered background regions CFG→BG〈U〉 depends upon the set of reference
frames in the DPB. We now propose a H.264 standard compliant reference frame selection
technique to reduce the cost of coding uncovered background regions in surveillance videos.
Later, in Chapter 5, we implement different reference frame selection strategies in Matlab
and analyze the performance of the proposed scheme. We also compare the proposed
technique with the 1× and 2× algorithms using real world surveillance videos.
4.8.1 Proposed Adaptive Reference Frame Selection Technique
From the discussion in Sec. 4.6, we observe that the coding cost of uncovered background
regions, i.e. CFG→BG, is dependent on the reference frames present in the RPB. Also, we
noted in Sec. 4.5 that, after coding FC (the current frame), the encoder will need to replace
an existing picture in the RPB with the current frame. The choice of the frame in the RPB
to be replaced will decide the set of reference pictures available for future frames and will
hence determine CFG→BG. The optimal selection procedure will attempt to maximize the
number of FG→BG macroblocks which can be reconstructed from the collocated positions
in the reference frames without coding any residual. However, this would require the en-
coder to ‘look ahead’ and would also be computationally very expensive. We propose a low
computational cost reference frame selection algorithm to mark the reference picture in the
RPB which will be replaced by the current frame. We obtain the theoretical upper bound
in Section 5.6 and show that the performance of the algorithm is very close to the upper
bound.
Consider the state of the RPB after the current frame has been encoded. The complete
set of MB’s in the background model is: B = ∪N−1n=0 B(n). B is referred to as the background
set. If the encoder marks the current picture as the nthR reference picture, then the updated
background set would be ∪N−1,n 6=nR
n=0 B(n) ∪ Bcurrent. The marking decision is made so as
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 81
Init: C = 0, M = ∅, B(n) = ∅ ∀n ∈ {0...(N − 1)}Data: Input FrameFC
while New FrameFC doGMM S-MD: DetermineBcurrent
if FC is an IDR framethenClear RPBM ←− ∅
B(n)←− ∅ ∀n ∈ {0...(N − 1)}nR ←− 0;I-frame coding ofFC
elseRPLR : Set previous frameFC−1 asR(0)(first entry in Ref. list 0). Apply samereordering to Mfor i ∈(1....NMB) do
Skip Signalling forMBi (See Fig. 6)end for
nR ←− argmaxn
∣
∣
∣
∣
∪N−1,i6=ni=0
B(i) ∪Bcurrent
∣
∣
∣
∣
end ifB(nR)←− Bcurrent (Update Background Model)R(nR)←− FC (Insert current frame into RPB)C ←− C + 1
end while
Figure 4.10: Pseudo Code of Proposed Reference Frame Selection Scheme
to maximize the number of background macroblocks in the set B. The pseudo code for the
proposed method is shown in Fig. 4.10. Here, NMB denotes the number of MB’s in a single
frame. The background model, {Bi : 0 ≤ i < N} is initialized to the empty set ∅ at the start
of the video sequence. It is updated after every frame has been encoded and the reference
frame marking decision has been completed. If the current frame is marked as an IDR frame,
the background model is reset to ∅ (since the DPB is flushed by the decoder when an IDR
frame is received). The worst case computational cost of the proposed algorithm is equal to
N(N−1)NMB logical bitwise OR operations (required to compute ‘Set Unions’) followed by
conditional increment operations (required to compute the cardinality of the ‘Set Union’).
Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 82
However, we show in Sec. 5.6 that the coding performance approaches the upper bound
when 2 reference frames (i.e. N = 2) are used. Hence, the proposed algorithm has a very
low computational cost and can be implemented on embedded platforms. The method also
provides computational cost reduction by avoiding the coding (Motion estimation, mode
decision and residual coding) of several FG→ BG MB’s.
4.9 Summary
In this chapter, a low computational complexity, sampler based architecture has been pro-
posed to detect foreground MB’s in static camera surveillance videos. A brief introduction
to relevant sampling techniques has been discussed. A multi stage sampler that combines
stratification and adaptive sampling techniques has been developed. The proposed scheme
reduces the complexity without affecting the accuracy of the detector. H.264/AVC standard
compliant skip signalling techniques for background MB’s have also been described. We
also proposed a reference frame selection technique for a static camera surveillance video
encoder. The proposed scheme maximizes the number of BG MB’s available in the DPB and
hence reduces the cost of coding uncovered background regions. In chapter 5, we present
RD performance results of the proposed scheme on real world surveillance videos.
Chapter 5
Results: Skip Decision and Reference
Frame Selection
5.1 Introduction
In Chapter 4, we have described the skip decision and reference frame selection algorithms
that we propose to reduce the bitrate of surveillance videos. We now present experimental
results validating the performance of the proposed techniques. We initially describe the
experimental setup and the test video dataset. Next, the RD performance of the proposed
GMM S-MD algorithm is compared with other techniques in literature. Rate distortion
curves of 6 videos have been plotted. Bit rate reduction data obtained on ‘No activity’
datasets (videos which do not contain FG objects) is provided. Complexity reduction re-
sults of the proposed encoder (i.e. encode time reduction) is also described. We also show
encoded output frames for a few videos in the dataset. Performance of the GMM S-MD al-
gorithm in challenging conditions such as, presence of obscured/camouflaged small objects
and low lighting is analyzed. The necessity to incorporate spatio-temporal bias in the skip
detector architecture is illustrated using sample test cases. The impact of varying the learn-
ing rate and threshold parameters of the GMM D-MD algorithm on the bitrate and accuracy
is described. Next, we show that the proposed technique increases the distortion computed
over the entire image but does not affect the utility of the encoded surveillance video. We
also show that slow and fast varying lighting conditions do not cause any ‘false miss’ out-
puts in the skip detection algorithm. Next, we provide an analysis of the proposed adaptive
reference frame selection algorithm. We then compare it with a recently proposed reference
frame selection technique. Finally, a summary of the experimental results is provided.
83
Chapter 5. Results: Skip Decision and Reference Frame Selection 84
5.2 Experimental Setup
Sixteen uncompressed 720p (1280x720 resolution) surveillance videos with a wide variety
of characteristics (indoor, outdoor, no foreground activity, fast motion, small objects, per-
sisting foreground, low lighting conditions, different white balance and exposure settings,
lighting change) have been collected at 10fps in 4:2:2 format. The entire dataset along with
the encoded videos have been published on the Internet1. The videos are down-sampled
to the 4:2:0 format and used as the test set. Sample snapshots of the dataset are shown
in Fig. 5.1. Along with this dataset, we also use three videos from the PETS 2009 video
dataset (PETS-1: View 001 sparse crowd, PETS-2: View 006 & PETS-3: View 001 dense
crowd) [135] and one video from the CDW dataset (wetSnow) [136] (we resize/crop the
images to PAL resolution i.e. 768 × 576 pixels). 100 frames of each video sequence are
encoded (larger number of frames are encoded for the ‘Parking lot’ and the ‘Evening fade’
datasets) and QP values are varied to obtain different sample points on the R-D (Rate-
Distortion) plane. Foreground pixels for 25 randomly selected frames (in each video) are
manually annotated for 10 videos and the distortion is computed over their Luma values
(We use the ground truth of all the 100 frames provided in the CDW dataset). The settings
used for the GMM S-MD parameters are discussed in Section 5.4.
In this work, we use adaptive memory control to manage the DPB. To evaluate the
benefits of multiple reference frames, we have integrated the proposed methods into the
H.264/AVC reference software JM 18.0 [137]. The proposed techniques have been im-
plemented in the C + + programming language. The Windows operating system has been
used to perform all the experiments. Rate distortion optimization (RDO) has been enabled.
Main profile with P slices and CABAC (Context-adaptive binary arithmetic coding) entropy
coding is used for all the experiments. To measure speedup, we use the highly optimized
x264 video encoder [11]. Single pass mode with IPPP coding structure is used for low de-
lay and low complexity encoding. RD mode decision for all frames and fast skip detection
on P-frames has been enabled. Single threaded mode is chosen and the computation time
1http://chips.ece.iisc.ernet.in/index.php/Pushkar G
Chapter 5. Results: Skip Decision and Reference Frame Selection 85
is measured on a Core i5 processor (having 2x64KB L1, 2x256KB L2 and 3MB L3 caches)
running at 2.53Ghz with 4GB of system memory.
(a) (b)
(c) (d)
Figure 5.1: Snapshots from the video dataset (a) Entrance (b) Parking Lot (c) Access Door
(d) Backyard1
5.3 Skip Selection using GMM S-MD
Figs. 5.2 & 5.3 compares the RD performance of ‘Skip detection’ using the proposed GMM
S-MD technique with those in [24], [41] and JM [137]. GMM S-MD encodes a large number
of background MB’s as ‘Skip’ and hence provides a significant increase in R-D performance
of up to 2dB at high bitrates (‘Bridge’ dataset). However, at low bitrates, we find that the
average reduction in data rate across the video dataset is not high. This is because the R-D
cost of the skip mode (at low bitrates) is low and hence the RDO based encoder chooses
Chapter 5. Results: Skip Decision and Reference Frame Selection 86
the skip mode for most of the background MB’s. However, GMM S-MD provides bitrate
reduction of 27.3% compared to [24] & [41] on the ‘No Activity1’ sequence (at low bitrate;
QP set to 32). On the same sequence, GMM S-MD also provides execution time reduction
of 40.8% compared to [24] & [41] (measured using x264 [11] with QP set to 32). Figs.
5.3 shows that the proposed technique reduces bitrate by 29.2% & 30.3% on the PETS-1 &
CDW videos (with QP set to 24). Experiments also show that GMM S-MD provides 11.4%
& 4.9% bitrate reduction on PETS-2 & PETS-3 videos respectively. Since R-D data indicates
that the methods in JM, [24] and [41] do not skip a significant number of BG MB’s, we
studied the impact of reducing the thresholds for skip selection. Tc and Te values in [24]
and the value of Tlow in [41] were increased. We found that this causes a few foreground
regions to be incorrectly marked as ‘Skip’ hence reducing the foreground PSNR.
Table. 5.1 lists the reduction in encoding time obtained by adopting the proposed
method. The proposed method provides up to 74.5% reduction in encoder execution time
over [41] & [24] (measured using x264 [11]). We observe that [41] provides good com-
putational complexity reduction on indoor scenes which contain objects with little texture.
However, it does not skip a large number of BG MB’s in scenes with rich texture (e.g. in
the ‘No activity1’ dataset). [24] provides nominal reduction in scenes which are brightly
illuminated. However, under relatively low lighting conditions, the increase in the pixel
noise causes a significant number of BG MB’s to be marked for mode decision. The variance
parameters of the GMM model used in the proposed approach track changes in the pixel
statistics and hence reduce the number of such false alarms across all the video datasets.
We note that a large number of surveillance cameras are monitoring scenes with little or
no foreground objects for a significant fraction of the time. Hence reducing the false alarm
rate is important to reduce the average bandwidth and the average power consumption.
Table 5.2 shows the bitrate reduction on two surveillance videos which have no foreground
activity. We find that the proposed skip decision method provides bitrate reduction of up to
94.5% (over [41] & [24]) by reducing the number of false alarms. The 2x1 morphological
erosion operation performed on IFG (to remove noise) was found to provide a 53.9% re-
duction in bitrate in the ‘No Activity1’ dataset. To quantify the benefit of the ‘cache aware
Chapter 5. Results: Skip Decision and Reference Frame Selection 87
Table 5.1: Average execution time reduction of encoder (QP set to 24)
Note: ∆Encode Time in % is with respect to x264 [11]
SequenceZeng [41] Jin [24] GMM S-MD
∆Encode time ∆Encode time ∆Encode time
Entrance 26.4% 4.9% 42.4%
Walkway 37.4% 11.3% 51.7%
Access Door 51.7% 4.5% 56.3%
BackYard1 11.4% 14.4% 31%
BackYard2 31.9% 22.4% 58.6%
Parking Lot 21.8% 0.1% 77.9%
Bridge 9.7% 1.4% 30.1%
No Activity1 18.9% 23.1% 80.4%
No Activity2 78.8% 6.9% 82.3%
CDW 13.2% 52.7% † 36.5%
PETS-1 15.6% 27.3% 61.3%
PETS-2 40.4% 17.7% 44%
PETS-3 15.6% 3.6% 26.8%
† Large number of FG MB’s in CDW are marked as ‘Skip’ by Jin [24]
placement’ of GMM parameters, we have coded GMM S-MD with cache optimization en-
abled and disabled. We have also measured the last level cache (LLC) references using CPU
counters. We find that cache optimization provides 12.3% reduction in execution time (in
the ‘No Activity2’ dataset) and 30.2% reduction in LLC references. We note that larger exe-
cution time savings would be obtained in embedded platforms (which typically do not have
L3 caches) since such LLC references would have to be serviced by the main memory.
Table 5.3 shows that the execution time of the proposed GMM S-MD method is in the
range of 1ms-3.6ms. The table also lists the computation time required when the sampler
is disabled i.e. pixel segmentation is performed using the modified GMM algorithm pro-
posed by Zivkovic [3] followed by a 2x2 morphological erosion operation. The proposed
GMM S-MD method provides speedup in the range of 22 - 33 over [3] for video datasets
containing foreground objects. The computation time of the proposed method measured
Chapter 5. Results: Skip Decision and Reference Frame Selection 88
Table 5.2: Performance comparison of proposed GMM S-MD on ‘No activity’ datasets
Note: Bit rate reduction in % computed with respect to JM
Sequence
JM [137] Jin [24] Zeng [41] GMM S-MD
Bitrate (kbps) ∆Bitrate ∆Bitrate ∆Bitrate
No Activity1
2596 22% 0.95% 86%
776 14.4% 0.5% 72.9%
277 7.2% 0.2% 55.3%
139 3.4% 1.1% 29.8%
No Activity2
1458 6.1% 35.4% 96.4%
233 4.1% 59.2% 89.6%
34 1.4% 55.9% 58.2%
21 33.3% 34% 34%
on the ‘No Activity’ datasets is very low (1ms - 1.5ms) since most of the MB’s are classified
as ‘Non Salient’ and are hence sparsely sampled. The low computation cost of the GMM
S-MD algorithm enables execution on low power embedded camera platforms. Park et al.
proposed a random-sampler based method for background subtraction in [138]. A set of
sparsely sampled pixels are segmented as either foreground/background. The regions sur-
rounding the sampled pixels marked as foreground are further classified. Based on the
number of foreground pixels around the sampled locations, further spatial expansion is
performed. In [139], Lee et al. further improved upon [138] by classifying pixels in an
inter weaved order. They showed speedup in the range of 2.3-3.4 over [3]. In comparison,
GMM S-MD utilizes a fixed 2-Step sampling structure to efficiently incorporate MB-level
Spatio-Temporal priors for Skip decision. The 2-Step sampling structure of GMM S-MD al-
lows different GMM thresholds to be applied on pixels of ‘Salient’ and ‘Non Salient’ MB’s. In
Section 5.4, we show that this enables accurate detection of small obscured objects. GMM
S-MD also samples pixels over a regular grid and hence enables cache performance opti-
mizations through data-structure rearrangements. Chang et al. [134] proposed to modify
the sampler density based on the foreground probability. However, the computational cost
required to determine the foreground probability model and the sampler map limited the
Chapter 5. Results: Skip Decision and Reference Frame Selection 89
speedup obtained (over [3]) to 3. Guo et al. [140] used a hierarchical, block & pixel level
segmentation technique to reduce computational complexity. Unlike in GMM S-MD, spatial
samplers are not adopted and hence every pixel in the frame is accessed to compute the
block level features. Spatio-Temporal priors are also not used to bias block-level decisions.
Hence, a low threshold is applied on the block level features to ensure that foreground
detections are not missed. They showed speedup of 5.7 over GMM.
Table 5.3: Average execution time of Skip detection
Sequence
Zivkovic [3] Jin [24] GMM S-MD
Time (ms) Time (ms) Time (ms)
Entrance 78.1 3.9 3.4
Walkway 66.8 3.5 2.3
Access Door 70.8 4 2.6
BackYard1 67.5 3.4 2.8
BackYard2 53.8 3.4 1.6
Parking Lot 68.4 4.4 2.9
Bridge 75.7 4 3.6
No Activity1 53.6 3.4 1.5
No Activity2 52.4 3.9 1
CDW 24.4 1 1.56
PETS-1 19.6 1 0.8
PETS-2 22.5 1 1
PETS-3 22.4 1.3 1.2
We now analyze the relation between noise, resolution and bitrate of the encoded video.
We mentioned in Chapter 1 that, for scenes which are not well illuminated, we need to
increase gain. Fig. 5.7 shows one such example where the gain is increased to improve
visibility of the corridor. Fig. 5.7 also shows 100 RGB sample values of a pixel from the video
plotted in the 3D RGB space. We can see that the background GMM mode has accurately
modelled the probability distribution of the pixel. Due to the increased gain, the noise is
high. The bitrate of the video encoded using JM is measured to be 220kbps. Using GMM
S-MD, the required reduces to 33kbps. When the resolution of the video is reduced to 800×
Chapter 5. Results: Skip Decision and Reference Frame Selection 90
480 pixels (using bilinear interpolation), the bitrate of the JM coded output bitstream drops
to 30kbps. At this reduced resolution, the GMM S-MD output video bitrate is measured to
be 14kbps. The bilinear interpolation operation filters the noise and causes the drastic
reduction in bitrate. A large number of background MB’s are marked as skip since the
residual after quantization reduces to 0. However, the proposed technique can encode
the full resolution (1280 × 720 pixels) video at almost the same bitrate. Also, at lower
resolution, i.e. 800 × 480 pixels, GMM S-MD provides bitrate reduction of 53.3% over JM.
Chapter 5. Results: Skip Decision and Reference Frame Selection 91
��
��
��
��
��
��
��
��
��
��� ���� ���� ���� ���� ���� ����
������������ ����������
�����������
������
��
����
���
������
(a)
��
��
��
��
��
��
�
�
��
��
��
��
��� ��� ���� ���� ���� ����
������������ ����������
�����������
�������
��
����
���
������
(b)
Figure 5.2: RD data for the (a) Bridge & (b) Walkway video sequences
Chapter 5. Results: Skip Decision and Reference Frame Selection 92
��
��
��
��
��
��
�
��
��
�
��
��
��� ��� ���� ���� ���� ����
������������ ����������
�����������
�����������
��
����
���
������
(a)
��
��
��
��
��
��
��
��
�
�
��
��� ��� ���� ���� ���� ���� ���� ���� ����
������������ ����������
�����������
��������
��
����
���
������
(b)
Figure 5.3: RD data for the (a) Access Door & (b) Entrance video sequences
Chapter 5. Results: Skip Decision and Reference Frame Selection 93
��
��
��
��
��
��
��
��
�
�
��
��
��
��
��
�� ��� ��� ��� ��� ���� ���� ����
������������ ����������
�����������
������
�������
����� �
������
��������
(a)
��
��
��
��
��
��
��
��
�
�
��
��
��
��
��
��
��
�� ��� ���� ���� ���� ����
������������ ����������
�����������
�����
�������
����� �
������
��������
(b)
Figure 5.4: RD data for the (a) PETS-1 & (b) PETS-2 video sequences
Chapter 5. Results: Skip Decision and Reference Frame Selection 94
(a)
(b)
Object can be noticed in the video
(c)
Figure 5.5: Encoded frames of the (a) Light Switch (b) Bridge and (c) Low light video
sequences
Chapter 5. Results: Skip Decision and Reference Frame Selection 95
(a)
(b)
(c)
Figure 5.6: Encoded frames of the (a) CDW (b) PETS-2 and (c) PETS-3 video sequences
Chapter 5. Results: Skip Decision and Reference Frame Selection 96
3040
5060
70
30
40
50
60
70
25
30
35
40
45
50
55
60
65
70
75
R
G
B
P
Figure 5.7: Figure on the left shows a poorly lit corridor scene with increased camera gain
settings. Also, on the right, 100 RGB sample values of a pixel (pixel P in the image) from
the video are plotted in the 3D RGB space. In the same picture, the background GMM mode
is shown, i.e. the points on the sphere are at a distance of 2.5 σ (Mahalanobis distance)
from the mean value of the mode.
Chapter 5. Results: Skip Decision and Reference Frame Selection 97
5.4 Analysis of GMM S-MD Performance
In this section we will initially discuss the nominal parameter-settings of the GMM S-MD
algorithm and their impact on its performance. These settings can be applied to videos in
which (a) Objects cover more than 3 - 4 MB’s and (b) Lighting conditions are good (i.e.
directly illuminated scenes unlike in Fig. 5.6c). Later, we discuss specific tuning of the
parameters required for videos which do not satisfy these constraints .
• Nominal parameter settings: Experiments show that setting Tsparse = Tdense = 2.5,
Dsparse = 8, Ddense = 4 and learning rate α = 0.004 works well. The GMM S-MD
classifier performance is not very sensitive to the precise values of Tsparse, Tdense & α and
works well within a reasonable range of settings. To verify the robustness of the sampler,
we have encoded the videos (sequences 1-5 in Table 5.3) with different values of Dsparse
and determined the number of FG MB’s that were incorrectly marked as BG. The value of
Ddense was set to 4. We observe only one mis-detection in the ‘Access Door’ video when
Dsparse was increased to 12 (and 3 mis-detections when Dsparse was set to 20). No mis-
detections were observed in all the other video datasets even when Dsparse was set to a
high value of 20. This is due to the fact that even a single detection will trigger dense
sampling in the salient MB’s (around the foreground detected pixel).
Fig. 5.8 shows the impact of varying the sparse sampler threshold Tsparse on the perfor-
mance of the GMM S-MD algorithm for the ‘Walkway’ and ‘Backyard2’ video sequences.
In the case of the ‘Backyard2’ sequence, none of the foreground MB’s are wrongly marked
as ‘Skip’. In scenes with high environment noise, e.g. ‘Walkway’ sequence, increasing the
Tsparse threshold reduces the number of false alarms (i.e. background MB’s marked as
FG). However, it also causes a few miss detections. Hence, we recommend setting Tsparse
to a conservative value equal to 2.5.
• Obscured and camouflaged small objects: (occupying area less than 3-4 MB’s and
having similar appearance with the background): In such cases (e.g. ‘Parking lot’ &
‘Bridge’ videos), we reduce the GMM threshold of the dense sampler i.e. we set Tdense =
2. As a result, the probability that salient MB’s are marked as FG increases. Salient
Chapter 5. Results: Skip Decision and Reference Frame Selection 98
MB’s classified as FG in the current frame further cause their neighbors to be marked as
‘Salient’ in the next frame. Hence, even a few sparse sampler detections on the object in
the current frame would ensure successful detection in succeeding frames. Reducing the
value of Tdense (and not Tsparse) does not result in a drastic increase in the bitrate since a
large fraction of background MB’s are filtered out by the sparse sampler. For example, on
the ‘Parking lot’ video, setting Tsparse = 2.5 & Tdense = 2 resulted in detection accuracy
equal to the case when Tsparse = Tdense = 2. However, the bitrate in the former case
(when only Tdense was reduced to 2) was 41.5% lower compared to the case when both
the thresholds, i.e. Tsparse & Tdense were set to 2. To demonstrate these findings, we have
executed GMM S-MD with different combinations of Tsparse and Tdense. Fig. ?? shows
the encoded frames. We observe that when both Tsparse and Tdense are set to a high value
(i.e. 2.5), some regions of the foreground object are missed. By reducing only Tdense, all
the foreground objects are correctly detected. Reducing both Tsparse & Tdense increases
bitrate significantly as mentioned above.
As described in Section 5.4, the Spatio-Temporal priors incorporated in the GMM S-MD
algorithm assist in continuous detection of small objects. To analyze the importance
of the Spatio-Temporal bias in the GMM S-MD design, we disable it and determine its
impact, i.e. we set F ,sal to be equal to Fsparse. Fig. 5.9 shows that without the Spatio-
Temporal bias, we miss detection of one foreground object. With the Spatio-Temporal
bias enabled, GMM S-MD detects all the foreground objects accurately.
The maximum speed of images of small objects ( < 50 pixels wide) in surveillance videos
(measured in pixels/time) is typically about (10 × 16 pixels/sec) i.e. 1 MB width/frame
at 10fps. Hence, the Spatio-Temporal priors incorporated by the GMM S-MD algorithm
are found to successfully assist in continuous detection of small objects. We do note that
the sampler occasionally misses FG MB’s when objects with area smaller than that of a
MB (16× 16 pixels) move amidst foliage (in the ‘Parking lot’ video sequence). Detection
capability of such small objects is not required by a large fraction of surveillance systems.
However, systems which require such detection capabilities will need to use a higher
sampling density (e.g. Ddense = 2 detects very small objects in the ‘Parking lot’ sequence),
Chapter 5. Results: Skip Decision and Reference Frame Selection 99
albeit with higher computational complexity and memory requirements.
• Irregular environment noise: Fig. 5.2a & Fig. 5.4b show that the proposed GMM S-
MD skip detection provides 27.1% & 30.3% bitrate reduction on the ‘Bridge’ & ‘CDW’
videos (QP set to 24). However, further analysis of the MB skip maps show that despite
the significant bitrate savings obtained, a large number of dynamic background image
regions are not marked as skip. This is due to the highly irregular motion of the back-
ground objects. In The ‘Bridge’ video, the large motion of vegetation causes these MB’s
to be marked as FG. On the frame shown in Fig. 5.6b, GMM S-MD was found to provide
31.4% reduction of bit count compared to JM. To determine the best achievable bitrate
reduction, we manually annotated the frame and measured the bit cost of only the true
FG MB’s. These measurements show that the maximum achievable bitrate reduction is
89.3%. Similarly, in the ‘CDW’ video, irregular noise due to rain and snow are not mod-
eled by the GMM and are incorrectly marked as foreground. More elaborate techniques
can be adopted to improve the accuracy, albeit with greater computational cost.
• Persisting foreground: Continuous occlusions of the background scene due to objects
(as in the ‘Bridge’ sequence) cause false inclusions in the background modes of the GMM.
Similarly, slowly moving objects (as in the ‘Slow motion’ sequence) also introduce errors
in the background model. Fig. 5.12a shows that 3 FG MB’s are marked as Skip when α is
increased to 0.006 in the ‘Bridge’ sequence. Hence, to avoid wrongly marking foreground
regions as ‘Skip’ in such cases, we cannot set the learning rate α of the GMM algorithm
to a high value. However, increasing α for noisy background pixels (e.g. shaking foliage
in the ‘Bridge’ video) reduces the bitrate due to improved learning performance of the
GMM. On the ‘Bridge’ sequence, results show that the bitrate drops by 10% when α is
increased to 0.006 but 3 MB’s are wrongly marked as BG MB’s. From this discussion,
we clearly observe a tradeoff that exists in the choice of α. We observe that α set in the
range of 0.002 - 0.005 works well on all the videos (including ‘Slow Motion’ & ‘Bridge’).
We report results for the ‘Bridge’ video with α set to a conservative value of 0.002.
Fig. 5.12b shows that varying the learning rate on the ‘Backyard2’ sequence does not
Chapter 5. Results: Skip Decision and Reference Frame Selection 100
have a large impact on the bitrate. It also does not cause any foreground MB’s to be
marked as BG. This is because of the absence of persisting foreground objects and low
noise characteristics of the ‘Backyard2’ sequence.
The tradeoff described above has been studied by multiple researchers in the past. Lin et
al. in [141] provide a comprehensive list of related work and also introduce an adaptive
learning-rate control scheme to resolve this tradeoff. The algorithm (in [141]) uses dif-
ferent learning rates for pixels at different locations. We note that this technique in [141]
can be easily adopted by GMM S-MD to improve performance.
• Dense foreground object presence: When number of FG objects in the scene increases
and the noise in the background image regions is not high, the technique proposed by
Zeng et al. provides good bitrate reduction. Hence, the savings obtained over their
technique is reduced. As an example, on the PETS-3 video (in which large number of
pedestrians walk across the scene), GMM S-MD provides 5% reduction in bitrate over the
RD cost based skip detection technique by Zeng et al. [41]. The computation required
to perform skip detection also increases as shown in Table 5.3 (PETS-3 is captured at
the same location as PETS-1 but with higher foreground motion and hence increased
execution time required for the skip detector). However, since GMM S-MD uses a 2 step
sampler based technique, the average execution time for skip detection (per frame) is
only 1.2ms. Interestingly, when the number of FG MB’s in motion is high, the savings
obtained by the proposed reference frame selection technique increases. We discuss this
in Sec. 5.7.
• Very low lighting: In such conditions, camera gain is usually increased to improve per-
ceptibility. Due to this, noise in low light surveillance videos is high. Hence, low light
conditions are particularly challenging. We need to lower Tsparse and Tdense to maintain
accuracy (Tsparse is set to 2 & Tdense is set to 1.5). The exposure control routine of cam-
eras can be easily adopted to lower the threshold values when illumination reduces. Fig.
5.6c shows an encoded frame from the ‘Low light’ sequence. We found that a significant
fraction of the background regions in the frames are not skipped. However, we note that
Chapter 5. Results: Skip Decision and Reference Frame Selection 101
GMM S-MD provides 30% bitrate reduction compared to [24] and [41].
Chapter 5. Results: Skip Decision and Reference Frame Selection 102
�
�
�
�
�
�
�
�
��
��
��
��
��
�
��
��
��� ��� �� � ���
������������ ��
��
������������
�������
�������
������ �������
(a)
�
�
���
���
���
���
���
���
���
��� ��� ��� � ���
������������ ��
��
������������
�������
�������
������ �������
(b)
Figure 5.8: Impact of varying Tsparse on GMM S-MD performance shown for the (a) Walk-
way and (b) Backyard2 sequences
Chapter 5. Results: Skip Decision and Reference Frame Selection 103
(a) (b)
Figure 5.9: Encoded frame from the ‘Parking lot’ video (a) Without Spatio-Temporal bias
(object is missed) (b) With Spatio-Temporal bias (object is detected)
(a) (b)
Figure 5.10: Encoded frame from the ‘Parking lot’ sequence with (a)Ddense = 4 (b) Ddense
= 2
Chapter 5. Results: Skip Decision and Reference Frame Selection 104
(a)
��
��
��
��
��
��
��
��
� ��� ��� ��� ��� ���� ���� ���� ���� ���� ����
������������ ����������
�����������
���������
��
����
���
������
(b)
Figure 5.11: (a) Correctly detected foreground objects (marked in yellow) and (b) RD data
for the ‘Parking lot’ video
Chapter 5. Results: Skip Decision and Reference Frame Selection 105
�
�
��
��
��
��
��
��
��
��
��
����
����
����
����
����
����
����
����
����
����
����
� ���� ���� ���� ���� ���
������������ ��
��
������������
Learning rate ' �'
�������
������ �������
(a)
�
�
���
���
���
���
���
���
���
���
���
���
� ����� ����� ����� ����� ����
������������ ��
��
������������
Learning rate ' �'
�������
������ �������
(b)
Figure 5.12: Impact of varying learning rate on GMM S-MD performance on (a) Bridge and
(b) Backyard2 sequences
Chapter 5. Results: Skip Decision and Reference Frame Selection 106
5.5 Background PSNR and its Impact
We note that the proposed approach achieves bitrate reduction by skipping MB’s which
belong to the background. While this would result in a reduction of the total PSNR, we
have shown in Section 5.3 that the PSNR computed on the foreground objects remains
unaffected. An informal study was also performed to verify that the proposed method does
not cause any distraction/irritation to viewers. The encoded videos were presented to five
subjects who were informed that the task was to monitor the surveillance zones. Subjects
indicate that they do not observe any distraction/irritation in the videos encoded using
GMM S-MD. Visual attention which has been very actively studied as a part of experimental
psychology is known to be highly dependent upon the task and context [142]. Numerous
experiments have also concluded that attention is completely masked by foreground objects
in motion [143]. Hence, the context, task and motion in the video masks small distortions
introduced in the background.
The reduction of the total PSNR due to the proposed approach was found to be high in
the ‘Walkway’ datasets. Hence, we show 2 images in Fig. 5.13 from the ‘Walkway’ dataset,
one obtained by encoding using the JM encoder and the other based upon the proposed
GMM S-MD encoder. Fig. 5.13 also shows the total PSNR and the foreground PSNR (of
the displayed frame) plotted against the bit rate. We clearly find that the proposed method
does not impact the utility of the encoded video. Small Blocking artifacts can be noticed if
viewed carefully. However, these artifacts are masked by the motion of foreground objects
as already mentioned.
Since surveillance video footage is increasingly being monitored by automatic computer
vision based methods, it is important that the proposed encoding scheme does not reduce
the performance of such algorithms. Most of the successful object detection algorithms such
as ‘Discriminatively Trained Deformable Part Models’ or DPM [144] use gradient features
(e.g. HOG or ‘Histogram of Oriented Gradients’). To verify that the block artifacts around
the object do not impact the accuracy of algorithms such as DPM, we have performed object
detection tests on the videos compressed using JM and the proposed scheme. Results show
Chapter 5. Results: Skip Decision and Reference Frame Selection 107
that the proposed scheme does not affect the accuracy of DPM. Fig. 5.13 shows sample
detections in the ‘Walkway’ dataset obtained using the ‘person final’ model.
We also note that lighting changes do not impact the quality of the foreground images.
Encoders place IDR (Instantaneous Decoder Refresh or key) frames typically once in 5-10s
(50-100 frames at 10fps) to provide fast-seek and error recovery capabilities. Slowly varying
light changes are updated by the IDR frames and hence the MB’s which are skipped contain
pixels which appear very similar to those in the current frame. The ‘Evening Fade’ video (in
which illumination gradually reduces over a duration of 3 minutes) has been used to verify
that no noticeable distortion is introduced in the encoded frames. Very fast lighting changes
(e.g. switching on a light) are not tracked by the GMM model and hence the MB’s will be
marked as FG. Fig. 5.6a shows the encoded video frame (of the ‘Light Switch’ dataset)
immediately after the light switch was turned ON. Fig. 5.15 shows the encoded frames
from the Sunlight variation video captured at two different time instants. We can observe
that the foreground MB’s have been coded without any artifacts.
Chapter 5. Results: Skip Decision and Reference Frame Selection 108
(a)
(b)
Figure 5.13: Snaps of the encoded ‘Walkway’ dataset coded using (a) JM and (b) GMM
S-MD show that the proposed method does not produce any conspicuous distortion in the
background. The DPM detections (yellow rectangles) are overlaid on the images.
Chapter 5. Results: Skip Decision and Reference Frame Selection 109
��
��
��
��
��
��
��
��
�
�
��
��
��
��
��
��� ��� ���� ���� ����
����������� ��
�����������
��������
��
����
���
������
(a)
��
��
��
��
��
��
��
�
�
��
��
��
��
��� ��� ���� ���� ����
����������� ��
�����������
������������
��
����
���
������
(b)
Figure 5.14: Figure shows the PSNR plots for the ‘Walkway’ dataset frame in Fig. 5.13.
The PSNR plots have been computed over the (a) entire frame and (b) Foreground regions.
Although the proposed technique reduces the total PSNR, it significantly improves the RD
performance for foreground image regions.
Chapter 5. Results: Skip Decision and Reference Frame Selection 110
(a) (b)
Figure 5.15: (a) and (b) Show two encoded frames (with different sunlight intensities) in
the ‘Sunlight variation’ video. We observe that the proposed scheme does not wrongly mark
FG MB’s as ‘Skip’ under fast illumination changes.
(a) (b)
Figure 5.16: Slow reduction in illumination observed in the ‘Evening fade’ video. Encoded
frames captured (a) before and (b) after the reduction
Chapter 5. Results: Skip Decision and Reference Frame Selection 111
5.6 Analysis of the Proposed Adaptive Reference Frame Selec-
tion Technique
We first provide an analysis of the proposed reference frame selection scheme to gain some
insight into its performance. Next we provide detailed RD data and compare it with one of
the recently proposed reference frame selection technique.
The best possible reference frame replacement policy involves a combinatorial search
over all possible selection options in the video sequence and hence is computationally in-
feasible even for an offline analysis. Instead, we obtain bounds and show that the perfor-
mance of the proposed reference frame selection algorithm (using 2 reference frames) is
very close to the optimum. We have obtained the foreground maps based on ‘GMM S-MD’
for all the frames in the video datasets. Different reference frame selection strategies have
been implemented in Matlab and analyzed as follows: Let us assume a hypothetical video
encoder which marks every frame after the IDR picture as a reference frame. We see that
the number of FG → BG MB’s which can be skipped, i.e. the number of FG → BG 〈A〉
Macroblock’s in this case is greater than the number of FG → BG 〈A〉 MB’s in any other
practical encoder (which only marks a limited number of frames as reference). The hypo-
thetical encoder would also maximize the number of MB’s in the set B, i.e. MB’s containing
only background pixels. Hence, we can obtain the upper bound for the number of MB’s in
set B (and likewise the lower bound for FG → BG 〈U〉 MB’s) by maintaining every frame
as a reference which can contribute to the background model. Fig. 5.17 shows the number
of background MB’s available in the background set B for the ‘Entrance’ video. We observe
that FG→BG MB’s for a large fraction of the scene is not present in B when only 1 refer-
ence frame (i.e. the previous frame) is used. The use of two previous frames for reference
does not provide any significant improvement in coverage. However, when the proposed
algorithm is enabled to select the second reference frame, it marks a frame which does not
contain foreground objects as Reference. This enables the encoder to skip coding for a large
number of FG→BG MB’s. Hence bit rate savings of up to 16.3% (over the single reference
frame encoder) is obtained. Further increase in the number of reference frames does not
Chapter 5. Results: Skip Decision and Reference Frame Selection 112
provide benefit as the number of background MB’s has already reached its upper bound.
Fig. 5.18 shows the number of FG→BG MB’s that could not refer to the reference frames
in the RPB due to unavailability. If the MB’s which were unavailable in the background set
contain rich texture (as in this video), coding them will require large number of bits and
will result in increased bit rate.
10 20 30 40 50 60 70 80 90 10030
40
50
60
70
80
90
100
Frame number
Num
ber
of M
B’s
in B
(%
of N
MB)
1 Ref. (Previous)2 Ref. (Previous)2 Ref. (Proposed)Upper bound
Figure 5.17: No. of MB’s in the set B as a percentage of NMB (Total No. of MB’s in a frame)
for the ‘Entrance’ video
Chapter 5. Results: Skip Decision and Reference Frame Selection 113
10 20 30 40 50 60 70 80 90 100
0
50
100
150
200
250
300
Frame number
No.
of F
G−
>B
G M
B’s
whi
chre
quire
non
−ze
ro r
esid
ual c
odin
g
1 Ref. (Previous)2 Ref. (Previous)2 Ref. (Proposed)Lower bound
Figure 5.18: No. of FG→BG MB’s which require non-zero residual coding (i.e. No. of FG→BG 〈U〉 MB’s) in the ‘Entrance’ video
5.7 Performance of the Proposed Adaptive Reference Frame Se-
lection Technique
In this section, we compare the proposed technique with existing reference frame selection
algorithms. To the best of our knowledge, no surveillance-specific reference frame selec-
tion algorithms for H.264 has been published in literature. The ‘1×’ and ‘2×’ complexity
algorithms in [57] are closest in relevance to our method. The authors have also iden-
tified occlusion as an important aspect of reference frame selection. Hence, we compare
the proposed scheme with these methods. However, we acknowledge that the 1× and 2×
algorithms have been developed without assuming a static-camera setup and hence can be
used for generic video content as well.
Before we present the data, we briefly describe the 1× and 2× algorithms. More details
can be obtained from the original reference ( [57]). Following notation we introduced in
Chapter 4, let FC denote the current picture and FR(n)m denote the nth reference frame in
Chapter 5. Results: Skip Decision and Reference Frame Selection 114
the DPB. The 2× algorithm estimates the cost of discarding FR(n)m after the completion of
coding the previous frame FC−1. To obtain this estimate, the current picture FC is encoded
assuming that the previous frame FC−1 and all the reference frames of FC−1 are present
in the DPB for motion compensation. The percentage of blocks used in the reconstruction
of the current frame from each of these frames is noted. We can consider this value as the
utility of the reference frame (it is denoted by β in [57]). The reference frame that has the
least utility (i.e. has least utilization for motion compensation) is discarded. FC is again
encoded using the newly computed set of reference frames to obtain the final output. The
cost of the 2× algorithm is high since a two pass coding is required to determine β. Hence,
et al. [57] propose to estimate the utility of the frames using statistics of the previously
encoded picture i.e. FC−1. Strong correlation assumptions between FC−1 & FC are assumed
and the values for β (utilization ratios for the current frame FC), are approximated to be
equal to the utilization ratios obtained when coding the previous frame. As in the case of
the 2× algorithm, the reference frame in the DPB with least utility is evicted from the DPB.
5.8 RD performance results
We now discuss the RD performance of the proposed technique and also determine the num-
ber of reference frames required for optimum performance. From Table 5.4, we find that
the proposed assignment procedure reduces bitrate by 13.1% - 24.7% (for the ‘Entrance’
and ‘Backyard1’ sequences) compared to the case when the 1× algorithm is used to select
the second reference frame. On the PETS-2 video, the bitrate reduction obtained by using
the proposed technique was measured as 3.6%. In comparison, bitrate reduction was 1.3%
when two previous frames were used as reference. The proposed assignment increases the
number of BG MB’s in set B and hence reduces CFG→BG. Since the number of background
MB’s in set B increases, the number of FG → BG MB’s which can be marked as ‘Skip’ also
increases. This also helps to reduce the computational complexity of the encoder. Using a
second reference frame marked by the 1× algorithm reduces bit rate by up to 2.3% (com-
pared to the single reference encoder). Analysis revealed that the 1× algorithm always
Chapter 5. Results: Skip Decision and Reference Frame Selection 115
marked the previous frame FC−1 along with FC as the reference set for the next frame
FC+1. As also mentioned in [57], this is due to the strong temporal correlation between the
consecutive frames FC−1 and FC . Hence, the reference frame structure is identical to the
anchor in which consecutive previous frames are used as reference. Multiple consecutive
previous frames used as reference do not provide reduction in bit rate. This is due to the
fact that consecutive previous frames do not provide reduction in either CFG or CFG→BG.
The FG MB’s choose the previous frame over the other reference frames as a result of lower
R-D cost (due to smaller motion vectors and greater similarity in content). The cost to code
the FG → BG MB’s (CFG→BG) also does not reduce since consecutive previous frames
have foreground objects present in almost the same regions in the picture.
Using the 2× algorithm to select the second reference frame provided good bitrate re-
duction of up to 24.8% compared to the single reference frame encoder (in the Backyard1
sequence). However, the high computational complexity due to the second encode pass
(of the order of tens of milliseconds) prevents its applicability to low power embedded en-
coders. In comparison, the proposed method provides higher bitrate reduction of up to
25.9% and requires only 30 − 40µsec/frame to perform reference frame selection. Since
a larger number of uncovered BG MB’s can be skipped, the proposed method also achieves
computational cost reduction of up to 7.3% over the 1× algorithm.
We however note that the bit rate reduction obtained for the ‘Walkway’, ‘Access Door’
and ‘Backyard2’ sequences using the proposed scheme is not as significant as that obtained
for the ‘Entrance’ and ‘Backyard1’ sequences. We identify 3 reasons for this observation. A
detailed discussion of each of these is provided below:
1. GMM S-MD performance: We observe that in the case of the ‘Walkway’ dataset,
adding a second reference frame provides a very small improvement in performance (us-
ing either the previous frames as reference or using the proposed algorithm to mark the
reference frames). The reduced accuracy of skip detection due to complex shadows and
shaking vegetation was determined as the cause for this reduction in gain. Large number
of macroblocks which contained background objects were marked as foreground by ‘GMM
S-MD’ and hence were coded.
Ch
ap
ter
5.
Resu
lts:Skip
Decisio
nan
dR
efe
ren
ceFra
me
Sele
ction
11
6
Table 5.4: Performance comparison of reference frame selection algorithms
Sequence
Baseline: 1 Ref. Frame 2 Ref. Frames 2 Ref. Frames 3 Ref. Frames
GMM S-MD GMM S-MD + 1× [57]§ GMM S-MD + Proposed sel. GMM S-MD + Proposed sel.
Bitrate FG PSNR ∆Bitrate† FG PSNR ∆Bitrate† FG PSNR ∆Time¶ ∆Bitrate† FG PSNR
(kbps) (dB) (dB) (dB) (dB)
Entrance
2421 47.08 0.9% 47.07 14% 47.03 4.8% 14.1% 47.02
1339 44.83 0.6% 44.84 15.4% 44.78 5% 15.2% 44.8
811 42.83 1.1% 42.81 15.9% 42.79 5.2% 15.7% 42.8
500 40.26 0.8% 40.28 16.9% 40.21 5.3% 16.8% 40.21
Backyard1
2938 46.12 2% 46.11 21% 46.06 6.9% 21% 46.05
1816 43.51 1.9% 43.49 22.5% 43.42 7% 22.3% 43.42
1167 41.04 1.4% 41.04 22.8% 40.98 7% 23% 40.96
739 38.13 1.5% 38.11 25.9% 38.03 7.3% 25.2% 38.06
Backyard2
468 43.44 2.3% 43.46 8% 43.45 4.4% 8.8% 43.44
284 40.08 1.3% 40.07 6% 40.05 4% 6.5% 40.04
182 37.05 1.1% 37.06 4.9% 37.03 3.7% 5.6% 37.02
113 33.75 0.9% 33.78 6.5% 33.73 3.2% 6.6% 33.72
Continued in next page
Ch
ap
ter
5.
Resu
lts:Skip
Decisio
nan
dR
efe
ren
ceFra
me
Sele
ction
11
7
Table 5.4: Performance comparison of reference frame selection algorithms
Sequence
Baseline: 1 Ref. Frame 2 Ref. Frames 2 Ref. Frames 3 Ref. Frames
GMM S-MD GMM S-MD + 1× [57]§ GMM S-MD + Proposed sel. GMM S-MD + Proposed sel.
Bitrate FG PSNR ∆Bitrate† FG PSNR ∆Bitrate† FG PSNR ∆Time¶ ∆Bitrate† FG PSNR
(kbps) (dB) (dB) (dB) (dB)
Access Door
1010 46.14 0.7% 46.14 2.3% 46.11 4.7% 3.6% 46.11
523 43.75 0.9% 43.73 1.4% 43.72 4.4% 2.2% 43.73
305 41.53 -0.1% 41.53 -0.1% 41.53 4.2% 0.4% 41.51
183 38.78 0.2% 38.78 0.2% 38.76 3.9% 0.4% 38.77
Walkway
1628 44.93 1.1% 44.89 3.3% 44.88 3.6% 3.7% 44.87
943 42.27 0.6% 42.22 2.5% 42.22 3.5% 2.5% 42.23
567 39.72 0.6% 39.71 2% 39.71 3.2% 2.5% 39.71
328 36.74 0.6% 36.72 2% 36.73 2.9% 1.7% 36.74
§ We obtain identical results when we use 2 Previous frames as Reference
† Bit rate reduction in % computed with respect to baseline: 1 Ref. Frame + GMM S-MD (measured using JM)
¶ Execution time reduction in % (measured using x264) computed with respect to 2 Ref. Frames + GMM S-MD + 1× [57]
Chapter 5. Results: Skip Decision and Reference Frame Selection 118
2. Number of FG→ BG MB’s: When the number of foreground objects moving across
the scene is high or when objects are close to the camera, a large number of FG→ BG MB’s
are created. Since the proposed scheme reduces RD cost of FG→ BG MB’s, significant
savings are observed for such videos. As a consequence, bit rate reduction due to the
proposed scheme for the ‘Entrance’ and ‘Backyard1’ datasets is higher compared to the
other sequences (e.g. ‘Backyard2’ in which the number of FG→ BG MB’s is low).
3. Background texture of uncovered regions: Presence of complex texture in the back-
ground results in increased bit rate if those regions are coded as FG→ BG 〈U〉MB’s. Since
the proposed method reduces the number of MB’s coded as FG → BG 〈U〉, greater bit
rate reduction is found in datasets with rich background texture (e.g. ‘Entrance’ and ‘Back-
yard1’). As a consequence of the relatively low quantity of background texture in the ‘Access
door’ video, adding a second reference frame for this sequence does not provide bitrate re-
duction (compared to the encoder which uses previous frames as reference). However, the
proposed algorithm reduces execution time by up to 4.7% by avoiding mode decision for a
few FG→BG MB’s.
As noted earlier, the computational complexity of the reference frame selection al-
gorithm is dependent only on the maximum number of reference frames in the DPB. It
was measured to be 30 − 40µsec/frame when 2 reference frames were used and 60 −
70µsec/frame when the third frame was added. However, using 3 reference frames does
not provide a significant improvement in all the 5 video sequences. This is in agreement
with the analysis performed earlier which showed that 2 reference frames are sufficient to
maximize the number of background MB’s in the background set B. We also performed an
experiment in which the QP of the BG MB’s were assigned a large value of 40. However this
did not provide any further bitrate reduction and instead caused severe blurring/washout
of the background picture.
Chapter 5. Results: Skip Decision and Reference Frame Selection 119
5.9 Summary
A surveillance specific distortion metric was computed to quantify the performance of the
proposed skip decision and reference frame selection techniques. The proposed algorithms
have been compared with relevant methods in literature. Experimental data shows that
the proposed skip selection technique reduces bit rate by up to 94.5% and computational
complexity by up to 74.5% without affecting the foreground image quality. The skip detec-
tion algorithm requires 1-3.6ms on a single core and hence can be easily implemented on
embedded camera platforms. Data also shows that coding cost of uncovered background
MB’s in static camera surveillance videos is not insignificant and depends upon the selection
of reference frames. We have implemented different reference frame selection strategies in
Matlab. Results showed that the number of BG MB’s in the DPB when the proposed tech-
nique is adopted is close to the upper bound. Results show that the proposed reference
frame selection method reduces bit rate by up to 24.7% and execution time by up to 7.3%.
Chapter 6
ROI video coding for Pedestrian
Surveillance
6.1 Introduction
In Chapter 4, we proposed to use foreground segmentation to perform skip detection of
background MB’s. All the foreground macroblocks were encoded with uniform quality.
However, in pedestrian surveillance, the facial features are most useful to perform recogni-
tion and identification tasks. High image detail of non face regions is not as important as
the features of the face regions. Setting equal quality parameter settings to all MB’s results
in sub optimal bitrate allocation. To illustrate this point, we have encoded a test video and
measured the number of bits allocated to the different regions. In Fig. 6.1, this data is
shown for different regions of a single frame in the video. The number of bits allocated to
the shadow and non face regions (torso, arms and legs) is almost 17× the number of bits
utilized to encode the face region. Further analysis of the encoded bit stream shows that
the high bitrate of non face regions is primarily due to four reasons:
• High bit cost of ‘FG border’ MB’s: Due to the block based coding architecture of
the H.264 standard, the encoder cannot effectively combine background and inter
predicted foreground image content of FG border MB’s. For example, in Fig. 6.1,
MB2 which is FG border MB requires 187 bits. In comparison, MB3 (which is not on
the border) requires only 10 bits.
• Deformations of textured clothing: Deformations of clothing reduces similarity be-
tween adjacent frames. Hence, such MB’s cannot effectively utilize inter prediction
(e.g. MB1 in Fig. 6.1).
120
Chapter 6. ROI video coding for Pedestrian Surveillance 121
• Shadows on highly textured background regions: High frequency content of the
image (due to the texture) increases the energy of the residual.
• Strong shadows on background regions: Efficient inter prediction is not possible
due to significant change in the image. Hence, such MB’s (especially those on the
border of the shadow region) cause increased bitrate.
Shadow MB’s~ 9000 bits
Face MB’s ~ 1700 bits
Non Face MB’s~ 20000 bits
MB1 ~ 217 bitsMB2 ~ 187 bitsMB3 ~ 10 bits
MB1
MB2
MB3
Figure 6.1: Number of bits required to encode MB’s of a surveillance frame at uniform
quality
By this analysis, it is clear that we can significantly reduce the bitrate of the encoded
video by differentially assigning QP values to MB’s covering the face and non face regions
of the pedestrians. MB’s covering the face regions are encoded with a low QP (i.e. high
quality). Higher QP is assigned to non face FG MB’s. Shadows on background surfaces can
be marked as skip to further reduce the bitrate. In Chapter 2, we have already reviewed
techniques based on this idea that have been published in literature [26, 60, 61, 62, 26,
63, 64, 66]. Most of the previously published methods [61, 62, 66] are targeted for video
telephony applications. For example, [66], Ming-Chieh Chi et al. use skin color based face
Chapter 6. ROI video coding for Pedestrian Surveillance 122
detection to mark ROI regions in video teleconferencing videos. However, such techniques
do not work on real world surveillance videos. To gain a better understanding of these
challenges, we use the OpenCV adaptive skin detector to determine the regions of interest
in a surveillance video. The OpenCV algorithm is based on the technique proposed by
Farhad et al. in [145, 146]. Fig. 6.2 shows skin detections (pixels marked as yellow)
obtained on a sample frame in the video. We can clearly see that using only skin detection
to perform ROI marking would not be accurate. Also, variation of skin tone under different
lighting conditions reduces the accuracy of such techniques.
Figure 6.2: Skin pixel detection in a surveillance video frame
We now discuss two object detector based ROI coding techniques proposed in literature
(first technique proposed by Christopher et al. [63, 64] and the second scheme introduced
by Lai-Tee Cheok et al. [26]) and compare it with the proposed method. Christopher et al.
[63, 64] use the Viola Jones detector [65] to detect faces in each frame. An iterative mean
shift based object tracker is initialized for each new detection. Detections which match state
objects are used to update the object representations. Face ROI MB’s are encoded at lower
QP using the H.264 encoder. However, this work has been applied to video conference
applications. In surveillance videos, low resolution and poor lighting conditions prevent
Chapter 6. ROI video coding for Pedestrian Surveillance 123
the adoption of face-feature based detectors. To study the performance of ROI coding using
face detection for surveillance, we use the Viola Jones face detector in OpenCV. Fig. 6.3
shows that face detection based ROI marking is also not accurate on real world surveillance
videos. Also, running the detector and updating the object representations on each frame
is computationally very expensive. Instead, as we show in this chapter, we only need to run
a detector once in a few frames (we set the interval to 1 second). Also, we do not update
the tracker model since we do not need to maintain identities across severely occluded
sequences.
Figure 6.3: Face region detection using the Viola Jones detector
In [26], Lai-Tee Cheok et al. use the output of the video analytics module to modulate
the bit allocation to different image regions. Segmentation is used to determine foreground
blobs. A multi-class classifier is used to label the blobs as either pedestrian, vehicle or
animal. A tracker is used to track the blob labels. Blobs containing pedestrian images are
encoded at higher quality. However, as we have seen in Fig. 6.1, larger savings can be
obtained by using different QP’s for MB’s within a blob (i.e. larger QP for non face regions
and skip mode for shadow regions).
Chapter 6. ROI video coding for Pedestrian Surveillance 124
Low resolution, occlusion & poor lighting conditions pose significant challenges to ac-
curate ROI detection. Also, in surveillance applications, intruders typically attempt to avoid
appearing in the camera field of view. The number of frames in which the faces are visi-
ble would be lower in such cases. Hence, the ‘miss’ probability of the ROI detector should
be low. Clearly, using simple object detectors does not give good performance in uncon-
strained environments. Joint reasoning based on multiple cues is required. A large body
of work to detect and segment skin regions, face regions and directly pedestrians exists
in the computer vision literature. However, these researches have not been studied in the
context of ROI video coding. ROI video coding for block based encoders like the H.264
does not require pixel level segmentation. In this chapter, we use mid level super pixel
segmentation representations to efficiently and accurately determine the ROI, RORI and
RONI regions. We propose to combine pedestrian detection with skin and shadow detec-
tion to accurately mark ROI’s. We also integrate a tracker to reduce the ‘miss’ probability.
The tracker also serves to reduce the computational cost since ROI marking of successfully
tracked objects (objects with high association scores) does not require computation of de-
tector scores. Bilattice based logical reasoning is used to effectively combine all the detector
scores to accurately determine the ROI, RORI and RONI regions.
The remainder of the chapter is organized as follows: The architecture of the proposed
Region of Interest video encoder for pedestrian surveillance is described in Section 6.2. In
Section 6.3, we describe low and mid level segmentation. Computations of shadow and skin
scores on super pixels are described in Section 6.4 & 6.5 respectively. Section 6.6 describes
the DPM pedestrian detector based score computation. Geometry is described in Section
6.7. Section 6.8 describes the proposed ‘detection by tracking’ technique. The technique
proposed to infer the face, non face and RONI regions is described in Section 6.9. Section
6.10 describes the ROI, RORI & RONI marking and QP signalling. Experimental results of
the proposed technique are provided in Section 6.11.
Chapter 6. ROI video coding for Pedestrian Surveillance 125
6.2 Proposed architecture
Fig. 6.4 shows the high level block diagram of the proposed technique. As already men-
tioned in the previous section, various visual cues are combined to perform inference. We
partition the system into low level and high level inference components.
6.2.1 Low level inferencing
The incoming image is first segmented using the sampler based technique proposed in Chap-
ter 4. Image regions tagged as RONI by users are not processed. Blob detection and su-
per pixel marking is performed on the foreground pixels. Shadow scores of super pixels,
i.e. probability of super pixels covering shadow image regions, is computed. Independent
shadow probability scores are computed using physics based and texture based features.
The physics based features include the illumination attenuation and the angular orienta-
tions between the pixel and the background cluster center in RGB color space (More details
are provided in Sec. 6.4.2). The skin probability map is used to determine skin probability
scores of all super pixels.
6.2.2 High level inferencing
The Deformable Part Model or DPM is used to determine head-shoulder, torso and leg part-
scores of foreground image regions. Inconsistent pedestrian hypotheses based on geometry
are pruned. A tracker is initialized for successfully detected pedestrians. Isolated pedes-
trians are localized using a simple blob geometry based tracker. For interacting groups of
pedestrians, an optical flow based tracker is combined with a Kalman filter to determine
association. The tracker reduces the miss probability and the computational complexity of
the ROI detector. The ROI, RORI & RONI assignment task is formulated as a super pixel
labelling problem. Bilattice logic reasoning is used to determine the set of ROI, RORI &
RONI super pixels. This labelling of super pixels is used to assign QP values to macroblocks.
Chapter 6. ROI video coding for Pedestrian Surveillance 126
�������
����
����
���� �
����
�������
��������
���
������������������
������� ����
�������������
�������
������������
�������� �������
���
�����������������
������
����� ��
!�� ���� �
���� �����
���!��!�
"�!�
#����
"�!�
�$� �
������
����
%���
!�
& �� ����$�
'����
��� � ����(�
����)'��"!����� (�
*�� �
� �� ��
� ��
���!��
'�������!�����
%!��
���+����
����
�+��
���!
��,
�� �!�-�
���!�
�� ���.&/�0!(�
�������� !��
������ �� �
������
���� �
!�
.&������!
��$��������� ��� !
��$����������
.1�!�
��$������
)��"�"�!��
���$��
2���$!�� �
!�������!!3
&�����������
���$��
2���$!�� �
�������!!3
)��"�
����
.&�(�.&.��4�
.&/�� ��
�!!�� �
�����
�����
���$� �
.��� ��+�������
���! ��� !
��!��!!��� �
������� �
!$���
����
��
Figure 6.4: Architecture of the proposed ROI, RORI and RONI detector
Chapter 6. ROI video coding for Pedestrian Surveillance 127
6.3 Segmentation
Since we consider a static surveillance camera, we reduce search space for ROI detection
by detecting FG blobs. Also, the shape of the FG blob is used by the shape based pedestrian
detector which we describe later in Sec. 6.6.2. The final ROI, RORI & RONI inference is
performed on super pixels. Hence, we compute super pixels over the entire foreground
image regions. We discuss each of these operations in this section.
• Sampler based foreground segmentation: We use the sampler architecture that we
proposed in Chapter 4. We set Ddense as 2 to obtain finer FG masks. The pedestrian
detector and tracker outputs are used to prevent absorption of stationary foreground
object pixels into the background model. GMM parameter update of pixels on these
pedestrian detections is not performed.
• Blob detection: Connected components are used to extract foreground blobs in the
input image. The blob area is thresholded to remove very small blobs. The contours
of the blobs are extracted and are used for HOG feature edge enhancement (which is
explained later in Section 6.6).
• Super pixel marking: The term ‘Super pixels’ (which was introduced by Xiaofeng
Ren and Jitendra Malik in [147]) is regarded as a collection of perceptually mean-
ingful image regions. Super pixel representations are increasingly being used in very
successful vision algorithms, e.g. in [148]. Recently, Meuel et al. [149] have also
used them in a ROI coding system for aerial vehicles. Super pixels provide compact
image region representations which we use to perform higher levels of inferencing.
We determine the super pixels on only the foreground image regions. We adopt the
efficient Simple Linear Iterative Clustering or SLIC algorithm developed by Achanta
et al. in [150]. Computational complexity of the SLIC algorithm depends upon the
search distance and the number of k-means iterations. As we show later, ROI & RORI
marking is performed on macroblocks and hence does not require pixel level accurate
segmentation. Hence, we reduce the number of iterations of the k-means clustering
Chapter 6. ROI video coding for Pedestrian Surveillance 128
procedure to minimize the computational complexity. Fig. 6.5 shows super pixels
computed on a sample image.
Figure 6.5: Super pixels detected in a surveillance video frame
6.4 Shadow detection
As we have already described, we mark face detections (computed using the DPM detec-
tor) as ROI. Non face regions of pedestrians are marked as RORI. Super pixels around the
pedestrian bounding box (obtained using the DPM detector) need to be classified as either
shadow / non shadow. MB’s that intersect with only shadow super pixels are marked as
skip. We now describe the procedure to determine these shadow super pixels.
Cast shadow detection in static camera videos has been extensively studied by researchers
[151, 152, 153, 154, 155, 156]. In [157], Sanin et al. provides a recent review of shadow
removal techniques. The techniques are classified as follows:
• Chromaticity based: assumes color constancy, i.e. shadows cause reduction in lumi-
nance but the chromaticity value is mostly invariant.
• Physical methods: use physical models of lighting (e.g. dichromatic reflection model)
to improve accuracy. Ambient light and multiple light sources are considered.
Chapter 6. ROI video coding for Pedestrian Surveillance 129
• Geometry based methods: use estimated geometric relations between the cast shad-
ows and the objects.
• Texture based methods: perform texture correlation between input image regions
and the background picture. Texture features are highly discriminative and are not
affected by shadows. However, this technique does not work on regions without
strong texture points.
In this thesis, we modify implementations of the physics based and texture based ap-
proaches (provided online by [157]) to generate shadow scores for each super pixel. A
weak shadow detector is used initially to generate shadow candidates. The physics and
texture based detectors are run on only these shadow candidate pixels.
6.4.1 Weak shadow detector
When light from a source incident on a surface is obstructed, the luminance of the pixel
reduces. However, the color (or chromaticity) of the shadow pixels would be almost similar
to the original values. A simple filter based on these observations can be used to reject
background pixels which do not satisfy these conditions, i.e. if either luminance increases,
or if the chromaticity change is large.
Fig. 6.6 shows the RGB space and a cone constructed with base centered at BG pixel
and apex at the origin. Pixels whose values lie inside the shaded region are considered as
shadow candidates. Here d1 & d2 are set equal to λ1dBG & λ2dBG respectively. Constants
θmax, λ1 and λ2 are threshold parameters of the candidate shadow detector.
6.4.2 Physics based shadow detection over super pixels
To visualize the change in pixel values, we plot them in Fig. 6.7. The pixel values were
obtained from a video sequence in which objects occlude a light source. Early shadow
detection methods [156] have assumed that these shadow points lie on the line between
the origin and the illuminated pixel value. However, later approaches have incorporated
physics based principles [153, 154, 158] capable of detecting shadows in scenes illuminated
Chapter 6. ROI video coding for Pedestrian Surveillance 130
B
R
G
Background pixel value
Shadow pixel value
�max
dBG
d2
d1
Figure 6.6: The shaded volume shown in the RGB color space is considered as shadow pixel
values by the weak shadow detector
by multiple light sources. Features are extracted and parametric / non parametric statistical
methods are used to classify pixels.
We use the physics based implementation by Sanin et al. [157] which is based on the
work by Huang et al. [158]. Following the notation used in [157], we denote the vector
joining the shadow pixel and the center of the background cluster as vn. We write the
feature vector fn of the nth pixel in the image as:
fn = [αn, θn,Φn] (6.1)
Here αn denotes the illumination attenuation. θn & Φn represent the angular orienta-
tions of the vector vn in 3D space (in spherical coordinates).
A Gaussian mixture model is learnt for the feature vector of each pixel using an online,
Winner-Takes-All version of the EM algorithm. The probability that the pixel value was
generated by the shadow cluster c is computed using the GMM and is represented by pphyn,c .
The probability of the pixel belonging to the shadow region is given by
Chapter 6. ROI video coding for Pedestrian Surveillance 131
50100
150200 100
150
200
250
100
150
200
250
B
R
G
BackgroundPixels
ShadowPixels
Figure 6.7: Pixel values of a surface is plotted from a video sequence. Intermittent fore-
ground object motion causes shadows on the surface.
pphyn =C
maxc=1
pphyn,c (6.2)
Here C is the number of shadow clusters.
RONI marking of shadow regions does not require pixel level segmentation. We only
need to perform MB level classification. Shadow detection scores at the pixel level can be
aggregated to determine the probability that a macroblock contains only shadow pixels.
However, instead of assigning labels directly to MB’s, we aggregate shadow scores over
super pixels. Higher level reasoning of ROI, RORI & RONI regions is performed using these
scores computed over super pixels (this is described in Sections 6.9 & 6.10). We find that
the SLIC algorithm bins pixels of the shadow & feet boundary regions into separate super
pixels with good accuracy. Due to this, the number of true shadow pixels inside a shadow
super pixel is high. Hence, the aggregated shadow score of the super pixel provides good
Chapter 6. ROI video coding for Pedestrian Surveillance 132
discriminability for higher level inference.
The shadow score pphy for a super pixel is computed in Eqn. 6.3 as the average of the
scores of all its pixels. Here N is the number of pixels in the super pixel. Fig. 6.8b shows
the physics based super pixel scores obtained for a sample image.
pphy =1
N
N∑
i=1
pphyi (6.3)
6.4.3 Texture based shadow detection
Texture features of surfaces are mostly invariant to shadows. Hence, they can serve as very
useful cues to perform shadow detection. Leone et al. [159] use Gabor features computed
on small image patches. Qin et al. [160] use the scale invariant local ternary pattern as
a texture descriptor. They also use a Markov Random Field to incorporate positive spatial
correlation. Sanin et al. [161, 157] showed that classifying large areas (as shadow / not
shadow) using gradient matching improves performance when rich texture is not present
in the entire background image. A weak shadow detector is used to propose these large
regions. In this thesis, we use the gradient matching based technique. However, we choose
the super pixels as the regions over which we aggregate gradient matching scores. The
fraction of the pixels with similar gradients is considered as the shadow score for the super
pixels.
Let the angle between the gradient vectors of the ith pixel (i is the index of the pixel
in the super pixel data structure) in the current image and the background frame be repre-
sented by θgradi . Let N denote the number of pixels in the super pixel. The probability that
the super pixel lies only in the shadow region is denoted by ptex and is computed as follows
ptex =1
N1
N∑
i=1
Iimi (6.4)
Here, mi is set to 1 if θgradi is less than a threshold. Ii is set to 1 if the gradient magnitude
of the pixel in the frame is greater than a threshold. N1 is the number of pixels in the super
pixel whose gradient is greater than the threshold, i.e.
Chapter 6. ROI video coding for Pedestrian Surveillance 133
N1 =N∑
i=1
Ii (6.5)
This gradient based technique works well when the background surface is textured.
When the image does not have texture (i.e. N1 is small), we ignore the texture based scores.
Figure 6.8c shows the texture shadow scores of SP’s. Texture and physics based shadow
features are complementary. Hence, integrating them improves the overall accuracy.
Chapter 6. ROI video coding for Pedestrian Surveillance 134
(a) Surveillance video frame
(b) Shadow scores of super pixels obtained using the physics based detector
(c) Shadow scores of super pixels obtained using the texture based detector
Figure 6.8: Shadow scores of super pixels plotted for a surveillance video frame
Chapter 6. ROI video coding for Pedestrian Surveillance 135
6.5 Skin detection
The unique appearance of human skin is a very useful cue to improve face detection per-
formance. The blending of the colors of the blood and melanin content decide the skin
tone. As a result, the range of hues is restricted. The appearances of skin can be broadly
categorized based on ethnicity as Asian, African and Caucasian. In [162], Elgammal et al.
provide density plots of skin pixels for these categories and show that they form clusters in
various color spaces. However, color of skin is similar to that of a few naturally occurring
surfaces (e.g. sand). Also, the accuracy of appearance based skin detection depends upon
lighting conditions. Hence, skin based detectors need to be integrated with other inference
algorithms to improve reliability. In Section 6.9 later in this chapter, we will describe the
proposed technique to integrate skin probabilities with other scores. In this section, we will
describe computation of skin probabilities for foreground super pixels in the input image.
Skin detection has been studied extensively by researchers and successfully applied to
perform face detection, objectionable image filtering and gesture recognition tasks. In
[163], Jones et al. proposed a histogram based approach to compute the probability of
skin pixels. They generated a large skin dataset which was used to obtain the histogram.
Greenspan et al. [164] proposed a parametric GMM technique. In [165], Phung et al.,
after a detailed analysis, showed that the accuracy of the Bayesian classifier based on the
histogram technique is higher than parameteric methods using Gaussian models. They also
showed that the accuracy does not vary with different choices of the color spaces.
We choose the histogram based Bayesian classifier to generate pixel level skin probabil-
ities since it is fast and accurate. Also, as we will see later in Section 6.9, the probability
scores that are generated by the Bayesian approach facilitate easier integration with other
detector scores.
Let the probability of a pixel with color ‘c’ capturing a skin image be represented as
p(skin/c) .
p (skin/c) =p (c/skin) p (skin)
p (c/skin) p (skin) + p (c/nonSkin) p (nonSkin)(6.6)
Chapter 6. ROI video coding for Pedestrian Surveillance 136
Here p (c/skin), p (c/nonSkin), p (skin) and p (nonSkin) are obtained from the histogram
table as follows:
p (c/skin) =HS (c)
TS(6.7)
p (c/nonSkin) =HN (c)
TN(6.8)
p (skin) =TS
TS + TN(6.9)
p (nonSkin) =TN
TS + TN(6.10)
Here HS (c) is the skin histogram bin count for color c and HN (c) is the non skin his-
togram bin count for color c. TS & TN represents the sum of all entries of the ‘skin’ & ‘non
skin’ histograms.
We use the skin probability map provided in the LTI-Lib library [166]. This map was
generated using the Compaq Cambridge research lab image-database [163] with bin size set
to 32. The probability values for each bin are precomputed and stored in memory. Hence,
computation of the pixel skin probability only involves indexing into the 32 ∗ 32 ∗ 32 array
of floating points numbers. Fig. 6.9 shows the super pixel skin scores obtained on the test
image in Fig. 6.8a.
Figure 6.9: Skin scores of super pixels in a surveillance video frame
Chapter 6. ROI video coding for Pedestrian Surveillance 137
Shadow pixel skin score pskin is defined as the average score of all pixels it contains.
pskin =1
N
N∑
i=1
p (skin/ci) (6.11)
Here ci represents the color of the ith pixel in the super pixel and N represents the total
number of pixels in the super pixel.
6.6 Pedestrian detection
Detection of pedestrians based on shape is more robust compared to techniques that only
use facial features. In this section, we will describe the proposed scheme to obtain pedes-
trian part scores. The high level inference engine places requests for these part scores on
foreground regions which do not have tracker associations. The details of the integration
of the pedestrian detector with the tracker is discussed later in Section 6.9.
Motivated in part by the large number of important applications for pedestrian de-
tection, many methods have been proposed in literature. Techniques such as the use
of improved features [167, 168, 144, 169, 170], efficient classifier learning algorithms
[171, 172, 169, 170, 173], motion cues [174, 175] and deep learning [176, 177, 178, 179]
have steadily improved detection accuracy over the past decade. A detailed survey of the
state of the art pedestrian detection methods can be found in [180]. In this work, we
adopt a modified version of the DPM technique. DPM uses a part based model and hence
allows higher level inference techniques to perform occlusion reasoning. Also, it mod-
els deformations which improves accuracy of pedestrian detection. DPM was one of the
best performing technique before the emergence of deep learning based detectors. Recent
techniques of integrating deformation models into deep architectures (e.g. Deep ID net by
Ouyang et al. [178]) have considerably improved detection accuracy. However, the compu-
tational complexity of deep architectures is considerably higher than that of DPM. Reducing
complexity of these techniques is a very active area of research. Deep learning based de-
tectors can be integrated into the proposed ROI encoder when they become feasible (on
embedded platforms) in the future.
Chapter 6. ROI video coding for Pedestrian Surveillance 138
6.6.1 DPM based pedestrian detection: A brief review
The DPM technique models part positions as latent variables. The unknown part locations
are learnt using a supervised latent-SVM framework. Please refer to [144] for details about
latent-SVM based model training. Here, we will only briefly sketch details of using a trained
model to detect pedestrians. We use notations of [144] with few modifications.
DPM uses the sliding window technique where the base detector operates on an image
window which is scanned across the entire frame. Detection at multiple scales is performed
by rescaling the input image. Detection of a pedestrian in a window involves two stages:
(I) Feature computation and (II) Score computation.
(I) HOG feature generation:
The HOG features used in [144] is conceptually similar to the original method proposed
in [168]. The image is partitioned into square cells of edge length equal to 8. A set of 2× 2
adjacent cells is called a block. Fig. 6.10 shows this division of the image into cells and
blocks.
h1 hi hi+9
Cell (8x8 pixels)
Pixels in the yellow region of the center cell will vote (weighted voting) into cell histograms in this block
Contrast insensitive histogram
Contrast sensitive histogram
Block (2x2 cells)
Image
Figure 6.10: HOG feature computation
The 32 dimensional feature vector is composed of:
• 18 contrast sensitive histogram bins
Chapter 6. ROI video coding for Pedestrian Surveillance 139
• 9 contrast insensitive histogram bins
• 4 values capturing the overall gradient energy in the four blocks containing the cell
• 1 placeholder entry
Gradients of pixels are computed and they vote into histogram bins of neighbouring
cells, i.e. in Fig. 6.10, the pixels inside the yellow region vote into four cell histograms:
(I) The cell it belongs to and (II) The three adjacent cells of the yellow region. Gradient
energy of each block that contains the cell is also stored separately as 4 entries in the 32
dimensional HOG feature vector. These 32 dimensional features for the cells are computed
at multiple scales to form a multi scale feature map which we denote by H. In the original
DPM implementation, the dimensionality of the features was reduced using PCA. This is
particularly useful when a large number of classes need to be tested, i.e. when the filter
score computation cost is very large. We found that we do not require this in our application
since we only detect pedestrians.
(II) Score computation:
Pedestrians are detected by measuring the response of filters applied on the feature
maps. Felzenszwalb et al. use two sets of filters:
• Root filters: Root filters model the overall shape of the pedestrian. They operate
on a rectangular window of the 32 dimensional feature vectors in the feature map.
Let F0 denote the concatenated root filter vector (concatenated in row major order).
Response of the root filter at position (x, y) in pyramid level l is given by:
R0(x, y, l) = F0.φ(H, (x, y, l)) (6.12)
Here, φ(H, (x, y, l)) is the vector obtained by concatenating feature vectors of the
rectangular window (in row major order) in the feature map with top left corner at
(x, y, l).
Chapter 6. ROI video coding for Pedestrian Surveillance 140
• Part filters: Part filters capture the more detailed shapes of individual parts of the
pedestrian image. They are computed over the feature map at a higher resolution to
capture finer details. Response of the ith part filter at position (x, y) in pyramid level
l is given by:
Ri(x, y, l) = Fi.φ(H, (x, y, l)) (6.13)
Along with part appearance modelled by the filters, DPM also takes into account the
feasible arrangements of parts of the pedestrian image. This geometric arrangement
is specified in the model using anchor locations of parts with respect to the root filter.
The anchor position for the ith part relative to the root position is denoted by the
vector vi = (vi,x, vi,y). DPM also allows the parts to be displaced from the anchor
positions. The cost of a deformation equal to (dx, dy) is denoted by φd(dx, dy) where
φd(dx, dy) = (dx, dy, dx2, dy2) (6.14)
A local search around the anchor position for the best location of the part is per-
formed. The cost of the part with index i is given as follows:
Di(x, y, l) = maxdx,dy
[Ri(x+ dx, y + dx, l)− diφd(dx, dy)] (6.15)
To model different poses, three components of root filters and the corresponding sets
of part filters are obtained using a latent SVM training framework [144]. Fig. 6.11
shows the part filters of the three components. The vertically mirrored counterparts
of these components are also included in the final model.
The total score of the pedestrian hypothesis at position (x,y,l) is given by the sum of the
root and part filter responses as
score (x, y, l) = R0(x, y, l) +n∑
i=1
Di (2x+ vi,x, 2y + vi,y, l − λ) + b (6.16)
Chapter 6. ROI video coding for Pedestrian Surveillance 141
(a) (b) (c)
Figure 6.11: DPM part filters
The score is thresholded to obtain the final set of detections in the image.
6.6.2 Proposed modifications to DPM
To reduce computational cost, HOG features are computed on only the foreground blocks.
Also, the detector sliding window is placed only over the foreground blobs. Windows that
do not have sufficient foreground support are ignored. High scoring detections are marked
as hypothesis for further inference (described later in Section 6.9).
Foreground edge enhancement:
We use the foreground blob edges to improve the performance of the HOG classifier. Let
the contour pixel coordinates of the blob be (x0, y0), (x1, y1), . . . , (xN−1, yN−1). The tangent
vector of the ith pixel in the contour is computed as (xi+L − xi−L, yi+L − yi−L) (modular
arithmetic is used on indices here, i.e. x0−1 = xN−1). The gradient vector is perpendicular
to the tangent. We skip (2L − 1) pixels in the contour sequence to reduce the effects of
noise. Fig. 6.12 shows the gradient vector for a contour with L = 2. Similar to image
gradients, the edge based gradient vectors also vote into the cell histograms. We found that
edge enhancement significantly improves detector performance.
Chapter 6. ROI video coding for Pedestrian Surveillance 142
dy
dx
Contour pixels
Foreground blob
Pixel under consideration
Gradient vector
Figure 6.12: Edge enhancement of blob boundaries
Cascade:
Instead of considering each sliding window as a candidate for part based inference, we
prune the number of hypothesis by using a two stage cascade. In [181], Felzenszwalb et
al. designed a DPM cascade for object detection. However, we use part filter scores to later
perform occlusion reasoning. Hence, we do not adopt the full cascade of [181]. Instead,
we use a two stage cascade shown in Fig. 6.13. The first stage of the cascade is based on
the most significant filter in the ordering determined by [181]. The score of the first stage
is given as follows:
R(x, y, l) = F.φ(H, (x, y, l)) (6.17)
Through visual inspection, we group the filters as ‘left head shoulder’, ‘right head shoul-
der’, ‘torso’ and ‘legs’ parts. We note that this grouping is only an approximate representa-
tion, i.e. the ‘left head shoulder’ filter group could also respond to gradients in the entire
head image. The part scores are determined using the deformation search procedure de-
scribed earlier in Eqn. 6.15. We repeat Eqn. 6.15 here for convenience.
Di(x, y, l) = maxdx,dy
[Ri(x+ dx, y + dx, l)− diφd(dx, dy)] (6.18)
The scores of filter group parts can be written as:
spart =∑
i∈Gpart
Di(x, y, l) (6.19)
Chapter 6. ROI video coding for Pedestrian Surveillance 143
Here, ‘part’ can refer to (I) left head shoulder (II) right head shoulder (III) torso or (IV)
legs. Gpart refers to the set of filters in a part.
���������
��������� ���
�����������
�����
���������
����
����������
��������� �
������
ImageOutput scores
��������
Figure 6.13: Proposed DPM cascade for pedestrian detection
The second stage is based on the responses of the filters that model the shape of the
head and shoulder regions, i.e. the left head shoulder & right head shoulder part scores.
Hypotheses whose part scores is less than a threshold are rejected. All the part scores are
computed for hypotheses that pass through both stages. Platt’s scaling [182] of these scores
is performed to obtain probability estimates p(part) as follows:
p(part) =1
1 + eAspart+B(6.20)
Here, A & B are constants determined using the algorithm proposed by Lin et al. in
[182]. The ‘Head’ bounding box is determined using regression based on head, shoulder
and torso filter locations. Fig. 6.14 shows the part filter locations and the bounding box of
the head region for a pedestrian.
Chapter 6. ROI video coding for Pedestrian Surveillance 144
Figure 6.14: Sample result of DPM cascade
6.7 Geometry
Fig. 6.16 shows a few sample surveillance video snapshots. Pictorial representations of
pedestrian hypotheses are also overlaid on the images. We can clearly observe that the
hypothesis overlaid on the image in Fig. 6.16c is infeasible. Such infeasible candidates
can be rejected by considering constraints imposed by the ground planes. This reduces the
computational cost and also improves the accuracy of the pedestrian detector. For a given
pivot location of the bounding box in the image, we need to determine the set of discrete
scales (denoted by S) for which we need to run the DPM detector. If the ground is a planar
surface, then the set of scales for a pivot S would be equal to {Smin, . . . , Smax}. Here, Smax
depends on the geometry of the scene (camera tilt & height, ground plane), the optics of
the imaging system and the physical size of the pedestrian. Smin is set equal to 1 (i.e. index
corresponding to the unscaled image).
However, if there are surfaces at different elevations, we will need to specify multiple
Chapter 6. ROI video coding for Pedestrian Surveillance 145
ranges, i.e. one for each elevation. This can be seen clearly in Fig. 6.16 where for a given
pivot position in the video frame, the pedestrians in Fig. 6.16a & Fig. 6.16b are supported
by two different ground planes. Fig. 6.15 shows that the ground plane at an elevated
position will require inclusion of scales that subtend angles in the range [θmin2 , θmax
2 ]. The
set S can be determined using the camera intrinsic & extrinsics matrices and the parameters
of the ground planes.
C
Tallest pedestrian (6.5 feet)
Image of human maps to smallest pedestrian DPM filter
�1max
�2min
�2max
Figure 6.15: Geometry of the surveillance camera system showing ground planes at differ-
ent elevations.
Automated camera calibration and ground plane estimation based on vanishing point
estimation have been studied by many researchers [183, 184, 185]. Sudowe et al. [185]
showed that the ground plane homography and normal vector projection is sufficient to
determine the set S. The intrinsic parameters of the camera are assumed to be known.
Commercial video analytics developers also have created simple calibration tools [186]
that help the user to calibrate the camera during setup. In this work, we divide the image
into non overlapping blocks of size equal to 32× 32 pixels. The set of scales for each block
is assumed to be known (determined using any of the techniques that we reviewed here).
Since the geometry of the scene is static, we store the set of scales S for each 32× 32 block
Chapter 6. ROI video coding for Pedestrian Surveillance 146
in a table which needs to be updated only when the camera location is changed. We run
the DPM detector at scales specified by the set S. It has to be noted however that only
bounding boxes inside foreground blob regions are considered as valid hypotheses. Hence,
a large number of scales would have been eliminated based on foreground segmentation.
The geometry based scale selection is useful only when there are large regions of foreground
objects.
Chapter 6. ROI video coding for Pedestrian Surveillance 147
����������������������������������������������������������������������������������������������������������������������������������
Feasible pivot
(a) Feasible hypothesis
����������������������������������������������������������������������������������������������������������������������������������
Feasible pivot
(b) Feasible hypothesis of pedestrian at a higher ground plane
���������������������������������������������������������������������������������������������������������������������������������������
Infeasible pivot
(c) Infeasible hypothesis
Figure 6.16: Sample surveillance video snapshots showing feasible and infeasible pedes-
trian hypothesis
Chapter 6. ROI video coding for Pedestrian Surveillance 148
6.8 Detection by Tracking
Accuracy of pedestrian detection from a single image has been steadily increasing since
the past decade. However, significant improvement in performance can be obtained if the
temporal associations across multiple frames is also exploited. This is particularly useful
in real world surveillance scenarios where state-of-the-art detectors (including DPM) fail to
detect some images of the same pedestrian in a video sequence. This is due to variations
in pose, occlusion and lighting as the pedestrian moves in the scene. A large number of
researchers have proposed tracker algorithms. An exhaustive survey of recent techniques
has been provided by Smeulders et al. in [187].
In this thesis, we use a tracker to make associations of the pedestrian images across
scenes. Detections obtained using the DPM part scores are stored in the state for tracking in
future frames. We find that this significantly reduces the miss rate. Also, as we discuss later,
the computational cost of the DPM detector is high. To reduce this cost, we avoid running
the DPM detector over image regions which are supported by tracked pedestrian detections.
Pedestrian detections present in the state are initially associated with image regions in the
current frame by the tracker. Regions which are not supported by tracked pedestrians are
considered as candidates for DPM based detection.
6.8.1 Components of a tracker
The key components of a tracker typically include:
• Appearance model: Invariance of certain features in the image of the object across
frames is the key component of a tracker. Hence, a lot of attention has been given
to develop such invariant image representations. These image representations are
constructed and stored when the tracker is initialized using a detector. The appear-
ance models can be computed over different image structures such as blobs [188],
contours [189], patches [190] or super pixels [191]. Various features such as raw
intensity values, color histograms, HOG, 2D binary patterns, haar wavelets, SIFT &
SURF have been used as visual cues for tracking.
Chapter 6. ROI video coding for Pedestrian Surveillance 149
• Target search: To determine the position of objects in a new frame, a search using the
appearance model is performed to determine the best match. Few popular techniques
such as the Lucas Kanade [192, 193] tracker and the mean shift tracker [194] pose the
target search problem as an optimization task which is solved using gradient descent
methods. Uniform search around the location of the object in the previous frame
is also a popular technique (e.g. fragtrack [190]). A motion model based on the
Kalman filter also is commonly used to reduce the search space [195, 196]. Due to
scene clutter, occlusion & heavy tailed noise, real world tracking problems exhibit
Non Gaussian and multi-modal, posterior and filtering distributions. Particle filtering
techniques have been adopted to solve this [197].
• Appearance model update: The appearance of objects changes over the video se-
quence due to variation in scale, pose, lighting and viewpoint. Hence, trackers update
the appearance model to avoid drift. The MIL (multiple instance learning) tracker by
Babenko et al. [198] updates the model with a bag of image patches. The ensemble
tracker by Avidan [199] uses a set of weak classifiers which is updated in an online
fashion.
6.8.2 FG blob based tracking
In the case of isolated pedestrians, we use the foreground blob geometry to determine the
tracked bounding box locations. The blob geometry based tracker has very low complexity
and accurately tracks isolated pedestrians. A Kalman filter is initialized for each pedestrian.
The Kalman filter prediction is used to mark a search region over the head shoulder image.
This search region is shown in Fig. 6.17a. The y coordinate of the top of the target ‘head
rectangle’ is determined by vertically scanning the FG blob for the presence of FG pixels.
The scanning procedure is performed in a top-down fashion and it terminates when n con-
secutive pixels are found in a row. We set n to 3. To determine the left and right bounds, we
use rectangular window filters similar to those used by Viola & Jones. Fig. 6.17b shows the
positive and negative filters applied on a sample FG blob. Let I+ & I− denote the number
Chapter 6. ROI video coding for Pedestrian Surveillance 150
of FG pixels in the positive & negative filter respectively. To determine the x coordinate of
the left edge of the ‘head rectangle’, the filters are moved over the FG image towards the
right. The x coordinate for which the difference I+ − λI− is maximum is considered as
the left boundary of the ‘head rectangle’. Here, λ is set to 10 to heavily penalize FG pixels
inside the negative filter. A similar procedure is followed on the right side of the pedestrian
search rectangle. I+ & I− are computed using the integral image of the FG mask. A fixed
aspect ratio of the head is used to determine the lower bound of the ‘head rectangle’. The
displacement vector of the ‘head rectangle’ is applied on the pedestrian bounding box to
mark the pedestrian in the frame. The percentage change in the width of the ‘head rectan-
gle’ is used by the inference engine to determine tracking errors. The difference between
the Kalman predicted displacement and that computed by the blob tracker is also used to
indicate errors.
Search region
Target ‘head’ bounding box
(a) Search region
+ +- -
(b) Positive and negative filters
Figure 6.17: (a) Search region is initialized using the Kalman filter prediction. (b) Positive
and negative filters are applied on the FG blob to determine the left and right bounds of the
head region.
6.8.3 Optic flow based tracker
In the case of pedestrians whose bounding boxes overlap or are close to each other, we
cannot track them using only the blob geometry. Hence, we use a Lukas-Kanade optic
flow based tracker. Since the optic flow computation considers a patch around the tracked
positions, it can successfully handle image noise. Also, consecutive images of pedestrians in
Chapter 6. ROI video coding for Pedestrian Surveillance 151
surveillance videos typically satisfy the requirements of brightness constancy and constant
flow. Cost to compute Lukas-Kanade optic flow vectors is smaller compared to the cost of
detecting and matching expensive feature points such as SIFT. We do not use multi modal
particle filtering since we do not need to maintain identity of pedestrians during severe
occlusion. When the pedestrian exits from the occlusion region, the detector would initialize
a new tracker. Also, we do not update the template model of the tracked pedestrian. Hence
the computational complexity of the proposed tracker is low.
Tracker initialization: Detections obtained by the DPM detector are used to initialize the
tracker. The tracker model of the pedestrian is composed of the image of the pedestrian and
its associated super pixels. The image pyramid required for optical flow is also computed
and stored. Uniformly sampled points inside the pedestrian bounding box are included
in the model. We denote this set of sampled points as S. The points are chosen only
in the regions that are not occluded by other pedestrians or objects. A Kalman filter is
also initialized using the first detection and its bounding box association in the consecutive
frame.
Target search:
We use a modified version of the median flow tracker (proposed by Kalal et al. in [200])
to associate pedestrian images to its template. The iterative Lucas-Kanade algorithm with
four levels of image pyramid is used to detect optical flow of the points sampled in the
template. The Kalman predicted correspondence vectors are used as initial solutions for the
Lucas-Kanade algorithm. The Normalized correlation coefficient or NCC is computed on
image patches centered on the sampled points. The points are arranged in the increasing
order of their NCC values. Points in the lower half of this ordered set, i.e. points with
low NCC values are discarded. Fig. 6.19 shows the correspondences between points in the
template and the current image. Points marked in red are those that were discarded based
on the NCC score. Here, the current image and the template are spaced apart in time by 10
frames.
Five overlapping part templates T1, T2, T3, T4 & T5 are defined as shown in Fig. 6.18.
Let the set of sampled points inside the part-template Tp be represented by Sp. Each part
Chapter 6. ROI video coding for Pedestrian Surveillance 152
template Tp is associated with a correspondence vector vp. vp is computed as the median
of the optic flow vectors of the points in the set Sp. The median is computed independently
in both the x & y dimensions. The likelihood of a part template is obtained by combining
feature matching scores of tracked points inside the part. We use NCC values and super pixel
histogram difference as the two sets of matching scores. The average of the NCC values of
points in the set Sp is represented by pNCCp . Here, pNCCp represents the similarity between
the part-template images in the model and the current frame. Occlusion of a part-template
causes the NCC value of the part to reduce. This can be observed in Fig. 6.19b where the
score for the right-upper-body part template is low (in comparison with other scores).
(a) Left-
upper-
body
(b) Right-
upper-
body
(c) Head-
shoulder
(d) Torso (e) Upper-
body
Figure 6.18: The five part-templates are shown here. Feature matching scores are accu-
mulated over these part-templates. Correspondence vectors are computed for each part-
template.
Let SP tempi be the super pixel (or SP) in the template associated with the sampled pixel
si. Similarly, let SP curri be the super pixel in the current frame associated with the sampled
pixel si. The color histogram differences between SP temi & SP curr
i is computed for all SP’s
associated to pixels in a part template. The average value of these histogram differences is
considered as the second feature matching score.
Model update:
As we show in the next section, the inferencing procedure uses the NCC and SP based
feature matching scores of the part templates to select one vector v from the set of template
correspondence vectors vi. It could also reject all the vectors if the track is lost. If the
tracking is considered as successful, the DPM detector scores of the pedestrian image are
Chapter 6. ROI video coding for Pedestrian Surveillance 153
not computed. The model, i.e. the template, super pixel data and the sampled points are
not updated for successfully tracked pedestrians. However, to prevent tracker drift due to
scale and appearance change, the DPM detector is executed after 10 frames (even if the
feature matching scores indicate successful tracking). After associating the detection with
the tracked pedestrian, the existing model of the tracked pedestrian is discarded and the
new model is computed.
We note that in this current application of pedestrian detection to ROI video coding,
identity switches do not affect the performance of the system. Hence, we do not attempt
to obtain accurate pedestrian correspondences between pedestrian images in the video se-
quence.
Chapter 6. ROI video coding for Pedestrian Surveillance 154
(a)
(b)
Figure 6.19: The template and the current frames (separated in time by 10 frames) are
shown. NCC scores of the five part-templates for the image in (a) is (0.77, 0.9, 0.87, 0.8,
0.81). The order of the scores is (left-upper-body, right-upper-body, head-shoulder, torso,
upper body). NCC scores for the image in (b) is (0.93, 0.69, 0.88, 0.8, 0.83). Here, the score
of the right-upper-body template is lower due to occlusions.
Chapter 6. ROI video coding for Pedestrian Surveillance 155
6.9 Inference
Early work by John McCarthy and others attempted to use logic to solve artificial intelli-
gence tasks. However, logic based systems could not model the uncertainties of the real
world. Following the seminal book on ‘Probabilistic Reasoning in Intelligent Systems’ by
Judea Pearl, Bayesian methods were developed and were very successfully applied to solve
AI problems. Probability could conveniently handle uncertainties which logic was incapable
of achieving. To overcome limitations of logic, symbolic approaches have been extended to
incorporate uncertainties.
Statistical relational learning (SRL) is one such example which addresses issues of rep-
resentation, inference and learning. SRL allows statistical analysis over a set of relations.
Complex relations are modelled using first order logic. For example, in Markov logic net-
works, the logic network serves as a template representing the relations. When the formulas
are grounded to form a Markov network, the distribution over the probabilities is defined
by the weights of the links. In [201], Antanas et al. apply SRL for hierarchical image under-
standing. They define a language that consists of (I) Visual entities e.g. window (II) Spatial
relations between visual entities, (III) Composite units that consist of a set of visual entities
(IV) Membership relations between visual and composite entities. Composite entity selec-
tion is formulated as a maximum weighted independence set problem. They successfully
recognize higher-level structures in street view images.
Another technique to combine probability and logic was introduced by Ginsberg in
[202]. Algebraic structures called bilattices were defined and were used to perform in-
ference under uncertainty. In this thesis, we use bilattice based reasoning inferencing tech-
nique to perform ROI detection. Bilattice logic based reasoning allows contradictory data.
For example, the shadow detector could wrongly assign a high score to a super pixel but
the tracker contradicts the hypothesis based on the pedestrian bounding box. The final in-
ference scores can be efficiently computed using the set of logic rules. Also, Bilattice logic
reasoning allows us to combine inference relations with rules specified by the surveillance
operator. When extended to activity recognition, Bilattice logic reasoning can be used to
Chapter 6. ROI video coding for Pedestrian Surveillance 156
mark non-face image regions as ROI, for example, when a luggage is left, the MB’s over the
object can be marked as ROI. Another example of a dangerous activity is when a car enters
a lane in the wrong direction. All MB’s over the car image can be encoded at high quality.
We provide a review of the bilattice logic approach in Appendix C.
6.9.1 Bilattice logic for ROI, RORI & RONI super pixel inference
The goal of the proposed region-of-interest encoder is to classify MB’s as ROI, RORI or
RONI. However, super pixels provide better representations compared to macroblocks since
they preserve natural boundaries. This property of super pixels has made them a popular
choice for segmentation applications like in [203]. In this section, we describe the proposed
inferencing technique to classify super pixels as ROI, RORI or RONI. In the next section, we
use the super pixel class labels to determine the coding mode and the QP parameter of the
macroblocks. The proposed super pixel labelling task is a three class classification problem
which we solve in a sequential manner. We first use the DPM part scores and the tracker
scores to infer pedestrian bounding boxes. Super pixels inside the ‘head’ regions in these
bounding boxes are marked as ROI. The detected & tracked pedestrian results are combined
with the skin & shadow scores to classify the remaining super pixels as RORI or RONI.
• Pedestrian bounding box inference: Shet et al. [173] proposed a partitioning of
object pattern grammar specifications into component based, geometry based and
context based rules. Following [173], we also adopt a similar design procedure.
However, we also include reasoning of shadows, skin detector and tracker outputs.
We now illustrate the reasoning process using a few representative rules.
For an isolated pedestrian, i.e. a pedestrian image which does not have any neigh-
bours, the filter scores are used to infer the score of the hypotheses. We show sample
rules for two filter scores here.
Chapter 6. ROI video coding for Pedestrian Surveillance 157
φ(ped(X,Y, S)← head left(X,Y, S)) (6.21)
φ(ped(X,Y, S)← torso(X,Y, S)) (6.22)
φ(ped(X,Y, S)← FG support(X,Y, S)) (6.23)
Here, ped(X,Y, S) denotes the existence of a pedestrian at position (X,Y) in the image.
S represents the scale of the pedestrian. These rules are combined with facts to
perform reasoning. Fig. 6.20 shows a representative example in which rules are
combined with the facts to obtain the final score of ped(X,Y, S). Here, the facts
represent the detector scores, for example, torso(X,Y, S) is equal to 〈p(torso), 1 −
p(torso)〉 where ptorso has been obtained earlier using the DPM detector. torso(X,Y, S)
represents the torso part score computed at the anchor position associated with the
pedestrian sliding window at location (X,Y,S) in the image pyramid.
Along with the rules that validate the hypothesis, we also add terms that negate it.
For example, a low head part filter score will result in the rejection of the hypothesis.
The corresponding rule for this is
φ(¬ped(X,Y, S)← ¬head left(X,Y, S)) (6.24)
Geometry inconsistency rules are not required for DPM detected pedestrians since
they are applied before computing the part scores. Pedestrian hypotheses which are
not isolated require occlusion reasoning as shown in Eqn. 6.25. Here, the occlusion
term is computed as the overlap between the bounding boxes of the parts. Circular
dependencies are avoided by performing the inference of pedestrians in decreasing
order of their Y values (The top-left corner of the image is assumed to be the origin).
φ(ped(X,Y, S)← not(head left(X,Y, S)), head left occluded(X,Y, S)) (6.25)
Chapter 6. ROI video coding for Pedestrian Surveillance 158
tf
� k
� t
Belief axis
<1,0>
<1,1>
<0,0>
<0,1>
Left Head part score
= <0.8, 0.2>
Super pixel skin score
= <0.65, 0.35>
Facts:pq
Figure 6.20: Figure shows detector scores of a pedestrian on the Bilattice square
Fig. 6.20 shows a representative example of the reasoning procedure. Here, left-head
part score and super pixel skin score are shown for a pedestrian image. Let q represent
the hypothesis (i.e. ped(X,Y, S) where X, Y and S correspond to the location & scale
of the pedestrian shown in Fig. 6.20). Let us assume that the weight of the rule for the
left-head DPM score that entails q is 〈0.9, 0.1〉. Let the weight of the rule that indicates
the absence of the left-head part (i.e. φ(¬ped(X,Y, S) ← ¬head left(X,Y, S))) also
be equal to 〈0.9, 0.1〉. Also, let the weight of the rule for the skin score be 〈0.7, 0.3〉. We
now show how these rules are combined using the logic rules to perform reasoning.
The contribution of these rules that entail q can be computed as
〈0, 0〉 ∨ [〈0.8, 0.2〉 ∧ 〈0.9, 0.1〉]⊕
〈0, 0〉 ∨ [〈0.65, 0.35〉 ∧ 〈0.7, 0.3〉] (6.26)
= 〈0.72, 0〉⊕
〈0.455, 0〉 (6.27)
= 〈0.8474, 0〉 (6.28)
Similarly, the contribution of rules that entail ¬q is computed as
Chapter 6. ROI video coding for Pedestrian Surveillance 159
〈0, 0〉 ∨ [〈0.2, 0.8〉 ∧ 〈0.9, 0.1〉] (6.29)
= 〈0.18, 0〉 (6.30)
These scores are combined using Eqn. 6.31. The final bilattice score is thresholded to
obtain the set of pedestrian detections.
cl(φ)(q) = 〈0.8474, 0〉⊕
¬〈0.18, 0〉 (6.31)
= 〈0.8474, 0.18〉 (6.32)
The DPM based inference is performed on newly detected pedestrians. Detections
obtained in past frames are tracked using the foreground blob and optic flow based
trackers that we described earlier. The feature matching scores of the tracker are
combined to obtain tracked pedestrian bounding boxes. Since the optic flow based
tracker computes 5 tracking vectors (based on the part templates), we need to select
one vector which is assigned to the tracked pedestrian. This is done by computing
the Bilattice values using the NCC and SP scores for all the 5 part-templates. The
vector associated with the highest inference score is considered as the pedestrian
displacement vector. Also, a threshold is applied on this score to determine lost tracks.
In the case of losing the track of a pedestrian, the DPM based detection is performed.
φ(ped(n, vp)← NCCscore(n, vp)) (6.33)
φ(ped(n, vp)← SPscore(n, vp)) (6.34)
Here, n represents the index of the pedestrian in the state.
• Super pixel inference:
Chapter 6. ROI video coding for Pedestrian Surveillance 160
Fig. 6.21a shows a sample pedestrian detection and super pixels in a blob. The super
pixels are labelled using a sequential decision procedure. Super pixels overlapping
with the ‘head’ rectangle of pedestrian detections are marked as ROI. Super pixels
inside the torso regions (that have not been marked as ROI by other pedestrian detec-
tions) of the pedestrian bounding box are marked as RORI. Discrimination between
the RORI and RONI super pixels in the leg region is slightly harder since the lower
bounds of the pedestrian bounding box is not accurately marked by the DPM detector.
To accurately label these super pixels, we define a prior RORI term based on distance
from the torso. Fig. 6.21b shows one such super pixel which is assigned a prior RORI
score equal to max(0, c(1− (ySP /hleg))). Here c is a constant that biases super pixels
closer to the torso to be marked as RORI. Rules based on the prior term, physics based
shadow score pphy and texture based shadow score ptex are defined as shown in Eqns.
6.35, 6.36 & 6.37. They are combined using the bilattice logic inference to determine
the set of RONI super pixels (i.e. shadow super pixels). Super pixels not marked as
shadow in the ‘leg’ rectangle are labelled as RORI.
φ(¬RONISP (xSP , ySP )← prior(xSP , ySP , hleg)) (6.35)
φ(RONISP (xSP , ySP )← shadowphy(xSP , ySP )) (6.36)
φ(RONISP (xSP , ySP )← shadowtex(xSP , ySP )) (6.37)
The unmarked super pixels close to a pedestrian detection that do not have sufficient
FG support are marked as RORI. This is particularly effective is accurately marking
MB’s that cover articulated parts of the human body i.e. the arms and the legs. The
super pixels which continue to remain unassigned are classified based on the skin
score, i.e. super pixels having very low probability of containing skin image regions
are marked as RORI. Also, super pixels which have a high probability of containing
shadow image regions are marked as RONI. All the remaining unmarked super pixels
in the image which have not been assigned a label are marked as ROI.
Chapter 6. ROI video coding for Pedestrian Surveillance 161
Shadow super pixel
Skin super pixel
Pedestrian detection
(a)
ySPhleg
(b)
Figure 6.21: (a) Figure shows a pedestrian detection and different super pixels in the blob
(b) The pedestrian bounding box is divided into face, torso and leg rectangles. The super
pixels in the leg region are assigned a prior RORI score based on the distance ySP .
6.10 Macroblock mode and quality parameter assignment
As mentioned in Chapter 2, video coding standards such as H.264 and HEVC allow the
encoder to perform block level ROI coding, i.e. the encoder can specify the slice level and
MB level QP parameters. In this thesis, we use a fixed QP assignment for ROI, RORI & RONI
MB’s. We denote the QP values assigned to ROI, RORI & RONI MB’s as QPROI , QPRORI
& QPRONI respectively. MB’s that overlap with super pixels marked as ROI are assigned
a QP equal to QPROI . Similarly, MB’s that overlap with super pixels marked as RORI are
assigned a QP equal to QPRORI . Coding of all the other MB’s needs to be skipped. This is
performed by following the Skip signalling procedure discussed in Chapter 4.
In [204, 205], Gao et al. analyzed the impact of increasing quantization in video coding
on feature analysis, object detection, and face recognition algorithms. They showed that
increasing QP up to 34 - 36 does not impact object recognition performance. However,
the face recognition task exhibits a continuous reduction in performance with increasing
QP. This agrees with the intuition that tasks such as face recognition require finer details
of image features. Unlike face recognition, object recognition algorithms are based on
Chapter 6. ROI video coding for Pedestrian Surveillance 162
object shape and hence are more resilient to video compression noise. Based on these
observations, we set QPROI to 20 and QPRORI to 34. We now discuss two important
components of practical surveillance encoders relevant in the present context of ROI video
coding:
• Rate control: Although, we use a fixed assignment of QP values to ROI’s and RORI’s
in this thesis, we can easily incorporate rate control techniques into the proposed
architecture. As an example, when channel bandwidth reduces, QPRORI can be in-
creased. Under severe bandwidth loss, the RORI regions can be marked as skip.
• Error resilience: Along with reducing the bitrate, ROI inference can also be used to
increase network error protection to the regions of interest. Error protection requires
additional data to be embedded in the bit stream. This increases the bitrate of the
encoded surveillance video. Hence, providing protection to only regions of interest
improves coding efficiency of such error resilient encoder systems. For example, the
Intra mode can be chosen to code the ROI MB’s as proposed in [78].
6.11 ROI, RORI & RONI video compression results
6.11.1 Experimental Setup
The proposed ROI detection system has been implemented in C++. We modify the SLIC
implementation provided by the authors [150] to generate super pixels on foreground re-
gions. We have ported the Matlab code of the DPM detector released by Felzenszwalb et al.
to C++ [144]. We have modified the implementation for physics based shadow detection
provided by Sanin et al. [157]. We have developed the implementation for the FG blob
and optic flow based trackers in C++. We have used the OpenCV library for optical flow
and other low level image processing tasks. We have integrated the proposed ROI detector
into the highly optimized x264 H.264/AVC encoder software [11]. The encoded videos
Chapter 6. ROI video coding for Pedestrian Surveillance 163
have been published on the Internet1. Main profile with P slices and context-adaptive bi-
nary arithmetic coding (CABAC) entropy coding is used for all the experiments. Single pass
mode with IPPP coding structure is used for low delay and low-complexity encoding. RD
mode decision for all frames and fast skip detection on P-frames has been enabled. Single
threaded mode is chosen and the computation time is measured on a Core i7 processor
running at 2.4 GHz with 16 GB of system memory.
6.11.2 Bitrate reduction and accuracy
To validate the proposed technique, we have marked the face regions on 40 frames in
two videos, ‘Entrance road’ & ‘Porch’. Fig. 6.22 shows the bitcount and PSNR (computed
over the face image region) of the ‘Entrance road’ video compressed using the proposed
ROI encoder. For comparison, we also plot the data obtained using the FG skip detector
based encoder. The proposed technique reduces bitrate by 37.2% compared to the FG
skip detection based encoder. Also, the proposed technique accurately detects face image
regions. Hence, it maintains good quality of the face image regions. Fig. 6.23 shows the
frames encoded using the proposed ROI encoder and the FG skip detection based encoder.
We can clearly see that the image quality of the face region is unaffected. Similar results
has been obtained for the ‘Porch’ video and is shown in Figs. 6.24 & 6.25. The proposed
ROI encoder provides bitrate reduction of 50.2% on this video.
Fig. 6.26 shows the enlarged image of a pedestrian in the ‘Porch’ video. We can clearly
see that the fine texture and cloth deformation features in the image have been removed by
the proposed ROI encoder. This reduces the bitrate of the compressed video stream. The
figure also shows that the quality of the face region image remains unaffected. We also
manually verified that none of the true RORI MB’s were marked as RONI by the proposed
encoder.
1http://chips.ece.iisc.ernet.in/index.php/Pushkar G
Chapter 6. ROI video coding for Pedestrian Surveillance 164
�
�
��
��
��
��
� � �� �� �� �� �� �� ��
��������������
�����������
�����������
������������� ��
������������������� ���
(a) Bitcount
��
����
��
����
��
����
��
����
�
� � �� �� � � �� �� ��
��������
�����������
�����������
������������� �� �����
��������������������� ������
(b) Face region PSNR
Figure 6.22: Figure shows the (a) Bitrate and (b) Face region PSNR of 40 frames in the ‘En-
trance road’ video. The comparison results have been obtained using (I) Proposed method
and (II) Only skip detection. The overall bitrate reduction using the proposed technique is
37.2%. The total face region distortion metrics using the proposed method and the FG skip
detection encoder were both measured as 40.8dB
Chapter 6. ROI video coding for Pedestrian Surveillance 165
(a) x264 + skip det.
(b) Proposed
Figure 6.23: Figure shows frames from the ‘Entrance road’ video compressed using (a) Only
skip detection (b) Proposed ROI encoder.
Chapter 6. ROI video coding for Pedestrian Surveillance 166
�
�
��
��
��
��
��
��
� � �� �� �� �� �� �� ��
��������������
�����������
�����
������������� ��
������������������� ���
(a) Bitcount
��
����
��
����
��
����
��
����
�
� � �� �� � � �� �� ��
��������
�����������
�����
������������� �� �����
��������������������� ������
(b) Face region PSNR
Figure 6.24: Figure shows the (a) Bitrate and (b) Face region PSNR of 40 frames in the
‘Porch’ video. The comparsion results have been obtained using (I) Proposed method and
(II) Only skip detection. The overall bitrate reduction using the proposed technique is
50.2%. The total face region distortion metrics using the proposed method and the FG skip
detection encoder were both measured as 40.9dB
Chapter 6. ROI video coding for Pedestrian Surveillance 167
(a) x264 + skip det.
(b) Proposed
Figure 6.25: Figure shows frames from the ‘Porch’ video compressed using (a) Only skip
detection (b) Proposed ROI encoder.
Chapter 6. ROI video coding for Pedestrian Surveillance 168
Finer details of clothing are
removed by the ROI encoder
Proposed ROI encoderx264 + skip det.
Face image features are
retained
Figure 6.26: Figure shows that the proposed ROI encoder removes finer details in the RORI
MB’s but maintains image quality of the face region.
Chapter 6. ROI video coding for Pedestrian Surveillance 169
6.11.3 Impact of detector errors on ROI encoder performance
The MB labelling errors of the ROI encoder can be categorized based on their consequence
as follows:
• Errors resulting in quality degradation: As an example, MB’s over the face image
region could be incorrectly labeled as RORI. In such cases, identification of the person
in the encoded video would not be possible.
• Errors resulting in increased bitrate: An example of this is, a RONI region incorrectly
marked as ROI. In such cases, the bitrate increases without any increase in the utility
of the encoded video. If the bandwidth of the communication channel is insufficient,
the rate control unit will reduce the bitrate by lowering the quality of the encoded
video.
The figure below shows a graphical representation of the different errors and their con-
sequence (cells are color coded to signify the severity of the consequence). The accuracy of
the MB labeling scheme is directly related to the performance of the individual components,
i.e. DPM detector, shadow detector, tracker and the inference algorithm. We now discuss
this in detail.
��� ���� ����
��� � ������� �������
���� ����� � �������
���� ����� ����� �
������ �
������
��������� �������
Figure 6.27: The table shows the different MB labeling errors and their consequences (cells
are color coded to signify the severity). Here, the rows correspond to true MB labels and
columns to the MB labels assigned by the ROI detector.
Experimental results show that the DPM detector does not detect pedestrians on a few
images. Low resolution, low contrast and occlusion are the main reasons for such miss de-
tections. Fig. 6.28 shows two pedestrians (A & B) in a single blob that are not detected due
Chapter 6. ROI video coding for Pedestrian Surveillance 170
to occlusion and low contrast (of the head region of ped. B). If they were not being tracked
from previous frames, then the inference procedure will mark the RORI regions (associated
with these pedestrians) to be encoded at low QP i.e. high quality. As a consequence, the
bitrate reduction achieved will reduce. For example, when the two pedestrians are not de-
tected, the image region occupied by pedestrian B requires 19.9 kbits in the compressed
video (ROI QP = 20). When the RORI MB’s are accurately detected, bit consumption (for
region covering image of pedestrian B) reduced by 79% (RORI QP = 34). In contrast, ac-
curately detecting pedestrian A will provide a bit count savings of only 1kbit savings. This
is because the number of RORI MB’s of pedestrian A that are visible is very small due to
occlusion by pedestrian A. These observations suggest that in a compute limited platform,
an optimum scheduling of the computing resources to different target image regions can
improve performance. We introduce these ideas in Sec. 6.11.5.
Ped A
Ped B
Figure 6.28: Figure shows that the DPM detector has failed to detect pedestrians A & B.
Pedestrian A is severely occluded by B. The head region of pedestrian B has poor contrast.
Accurate detection of pedestrian B would have reduced bit cost of the frame by 15kbits. In
contrast, detection of pedestrian A would reduce bit cost by only 1kbit.
Also, we find that small pedestrian images are not detected occasionally. Fig. 6.29a
shows one such example. Detection of these require inclusion of scene context and time
sequence analysis in the inference algorithm. However, in the current context of pedestrian
detection for ROI video coding, we show later in Sec. 6.11.5 that such missed detections do
not have a large impact on the performance of the system. Also, the tracker helps to reduce
Chapter 6. ROI video coding for Pedestrian Surveillance 171
the miss detection rate. Once a pedestrian has been detected, the tracker provides labeling
data for subsequent frames (until the tracker model requires an update). Hence it reduces
the overall miss rate of the ROI encoder. Fig. 6.29b shows one such example in which the
pedestrian (enclosed in a green bounding box) has not been detected by the DPM detector
but the tracker has localized the position in the image (shown by the green bounding box).
(a) (b)
Figure 6.29: (a) Figure shows few more DPM detector failures on small pedestrians (b)
Here, the tracker has tracked the pedestrian (bounded by the green box) based on a previ-
ous detection. If only the DPM detector was applied on the current frame, the pedestrian
would have been missed.
Along with missed detections, false positives also result in increased bitrate. However,
since the DPM scores are computed only over foreground regions, the false positive rate
is reduced. Incorrect detections mostly include two cases: (I) Torso regions being marked
as head shoulder and (II) False detections in large foreground blobs. Also, in some cases,
severe localization errors cause RORI regions to be marked as ROI. Fig. 6.30 shows a few
examples of such errors. Also, the false detections are tracked in future frames. In the case
where the torso image region is marked as the head-shoulder part, NCC scores will be high
in subsequent frames. Hence, the false detections will persist in future frames until the next
tracker model update. Periodic verifications of the tracked objects using the DPM detector
can be used to prune such false positives. However, this will increase the computational
Chapter 6. ROI video coding for Pedestrian Surveillance 172
complexity of the system. We leave the study of this tradeoff to future work.
(a) (b)
Figure 6.30: (a) Figure shows localization error of the DPM detector. The detector has
included the shadow regions below the pedestrian (due to incorrect shadow detection) in
the bounding box (b) The torso has been detected as a head shoulder region. Again, the
shadow region has been included in the bounding box due to incorrect detection.
The errors we have discussed caused a increased bitrate. However, a more severe error
is when a ROI MB is encoded as RORI/RONI and a RORI MB is encoded as RONI. Since
such errors affect the quality of ROI/RORI image regions in the encoded video, it is im-
portant to minimize them. In the proposed technique, we mark RORI and RONI MB’s only
when a pedestrian is detected. Hence, such errors (that affect quality) can occur only when
the ROI/RORI MB of an undetected pedestrian intersects with the RORI/RONI regions of
a detected pedestrian. Such errors are not common but we show few cases where they
appear. In 6.31b, the image of a small child appears adjacent to the image of a pedestrian.
The DPM detector has detected the larger pedestrian but missed detecting the child. Hence,
the head/face region of the child is marked as RORI. However, in this case, the tracker
(that was initialized during a previous detection of the child) was tracking the image of the
child. However, if the tracker was not yet initialized, the face region of the child would be
encoded using a high QP and hence would not be recognizable. Clearly, these errors are
particularly high when small pedestrians emerge from occlusion. Detection of small and
Chapter 6. ROI video coding for Pedestrian Surveillance 173
occluded pedestrians continues to remain a challenge that needs to be addressed by the
computer vision community. Since surveillance images of children are particularly impor-
tant to encode with high resolution, a multi-camera network based system can be adopted
to ensure high accuracy. Multiple camera views can improve the detection performance of
occluded pedestrians. We leave this to future work.
(a) (b) (c)
Figure 6.31: Figure shows three frame a, b & c (that are temporally ordered) in which the
DPM detector has detected the child before (i.e. in (a)) and after (i.e. in (c)) the occlusion.
However, the detector has failed during the occlusion (i.e. in (b)). If the child was not
tracked by the tracker, his face image regions would be encoded in low quality.
Also, the tracker can sometimes drift as shown in Fig. 6.32. However, in such cases, the
NCC scores reduce and causes the execution of a detector on the foreground regions. Since
we do not update the template in the tracker, we find that the NCC scores reliably measure
the similarity of the pedestrian image with the template.
Chapter 6. ROI video coding for Pedestrian Surveillance 174
(a) (b) (c)
Figure 6.32: Figure shows the tracker bounding box positions as the tracked pedestrian
gets occluded and reappears later. During the occlusion (i.e. in (b)), the NCC score of
the tracked pedestrian drops from 0.76 to 0.41. This would trigger the execution of the
DPM detector. However, since the template is not updated, the tracker reassigns the correct
bounding box when the pedestrian reappears from occlusion in (c). The NCC score also
increases to 0.7.
6.11.4 Computational complexity
Computational complexity of the proposed ROI, RORI & RONI detection technique depends
on the number of foreground objects and their image sizes. We have measured the ex-
ecution time of various processing components for the ‘Porch’ video. The sampler based
foreground segmentation and blob processing requires about 10 ms per frame. Super pixel
detection on FG image regions takes 5 - 6 ms per k-means iteration (We use 4 iterations).
We note that the super pixel algorithm is easily parallelizable. Also, Benesova et al. [206]
have proposed a speeded up super pixel detector based on morphological processing that
is 6.6× faster than SLIC on 1MP images. The physics & texture based shadow detector
requires 5 - 6 ms & 2 - 3 ms per frame respectively. The pixel level skin detector requires 1
- 2 ms per frame. The computational cost of the system is dominated by the DPM detector.
On the ‘Porch’ video, feature computation on FG image regions in a frame takes 150 - 200
ms. DPM Score computation also requires 150 - 200 ms. Since we run the DPM detector
only once in a few frames, its average complexity cost is lesser. Also, multiple researches
have significantly reduces the computational cost of DPM. Dollar et al. in [207] propose
Chapter 6. ROI video coding for Pedestrian Surveillance 175
to generate HOG features at only octave spaced scale intervals. These features are used to
generate the HOG data at intermediate scales. Sadeghi et al. [208] combine hierarchical
vector quantization, hashing techniques, multi threading and cache optimization to obtain
a highly efficient DPM detector that runs at 30fps on a 6 core Intel Xeon processor. Optic
flow and super pixel matching required to perform detection-by-tracking requires 15 - 20
ms per frame.
Along with the recent algorithmic techniques that have been developed to reduce de-
tector complexity, many hardware architectures are also actively being researched. These
developments clearly suggest that the proposed ROI coding techniques can be successfully
implemented on future camera platforms.
6.11.5 Complexity control for ROI encoding
Due to the cost sensitive nature of the surveillance camera market, manufacturers prefer
to minimize the computational capability of camera platforms. In this thesis, we have al-
ready proposed multiple computational complexity reduction techniques. We now propose
an orthogonal approach for ROI video encoders in this section.
Consider the image blob of a single pedestrian in a scene. The area of the FG blob
and the number of FG MB’s increases significantly as the pedestrian image height increases.
Hence, bitrate reduction obtained using ROI encoding is high when the pedestrian image
size is large, i.e. when the pedestrian is close to the camera. The bitrate reduction obtained
by marking shadow regions as RONI is dependent on light sources and the background
image texture. Fig. 6.33 shows the bit count savings ∆B (i.e. difference between (I) bit
count with ROI coding and (II) bit count without ROI coding) plotted against the height of
the pedestrian image (in pixels). This plot was obtained using a sample surveillance video
in which a pedestrian was walking towards the camera. QP has been set equal to 24 for
the video encoded without ROI coding. For the ROI encoded video, QPROI is set to 24 and
QPRORI is set to 32.
Chapter 6. ROI video coding for Pedestrian Surveillance 176
���
����
����
����
����
����
����
��� ��� ��� ��� ��� ��� ��� ���
��������������� ��������
������������������ ��������������������
�������������������� �������������������� ����������
Figure 6.33: Bit count savings is plotted against the height of the pedestrian image in
pixels. QP = 24 for the video encoded without ROI coding. For the ROI encoded video,
QPROI = 24 and QPRORI = 32.
Along with the pedestrian height and distance from the camera, bitrate reduction ob-
tained using the proposed ROI encoding technique depends on the position of the pedes-
trian in the scene. This is better explained using Fig. 6.34, where multiple pedestrians
are present in the scene. Pedestrians A & D are close to the camera and are minimally
occluded. Pedestrians B & C are highly occluded. Hence, marking ROI, RORI & RONI MB’s
for pedestrians A & D provides higher bit rate reduction.
Based on these observations, we can write the total bitrate savings ∆B obtained by
accurately identifying ROI, RORI & RONI MB’s as
∆B =N∑
i=1
φ(hi, oi, si) (6.38)
Here, φ(.) represents the bitrate model of a pedestrian image. N is the total number of
pedestrians in the scene. oi represents the occlusion pattern of the pedestrian, i.e. the set
of MB’s of the ith pedestrian image occluded by background objects or other foreground
Chapter 6. ROI video coding for Pedestrian Surveillance 177
AB
CDE F G
Blob 1
Blob 3
Blob 2
Figure 6.34: Scene shows multiple pedestrians in the scene. Pedestrians A & D cover a large
number of MB’s in the image. Hence, ROI detection on image regions of these pedestrians
provides higher bitrate savings.
objects. hi is the image height and si is the state of the ith pedestrian. The state si in-
cludes lighting conditions, appearance & background texture data. Here, we have made an
assumption that the ROI detections have been performed on all the pedestrians accurately.
However, under computational and detector accuracy constraints, the achievable savings
will be lower. This can be written as
∆Bach =N∑
i=1
Ci(R)Diφ(hi, oi, si) (6.39)
where Ci(R) = 1 indicates that the ith pedestrian was included in the hypotheses test
set. Here, R represents the set of image regions which are processed by the ROI detector.
Let RFG represent the set of all foreground regions in the image. Di indicates whether the
detector successfully determined the ROI, RORI & RONI MB’s of the ith pedestrian. The
objective for a resource constrained ROI encoder is to determine the optimal set of image
regions R over which the detector searches for pedestrians. The objective function can be
written as follows:
Chapter 6. ROI video coding for Pedestrian Surveillance 178
R∗ = arg maxR
(
N∑
i=1
Ci(R)Diφ(hi, oi, si)
)
(6.40)
We leave the detailed analysis and design of the optimal system as future work. Here,
we only describe a few considerations of such a ‘compression aware’ ROI detector. A simple
strategy would be to select the image regions to be processed in a sequential manner. For
example, in Fig. 6.34, blob 1 can be chosen first. After detection is complete on this blob,
blob 2 can be considered. Arrival of a new frame terminates the detection process on the
current image. ROI, RORI & RONI signalling is performed based on the available pedestrian
detections.
In such a system, the order in which the regions are processed determine ∆Bach. We
now illustrate this using the image shown in Fig. 6.34. Let us assume that the entire
foreground region RFG cannot be processed due to computational constraints. Depending
on the set of foreground image regions R chosen by the inference procedure, different sets
of pedestrians are detected. For example, if the blobs 2 & 3 are chosen, the pedestrians D,
E, F & G would be detected. However, if blobs 1 & 2 are chosen, pedestrians A, B, C & D
would be detected. Clearly, choosing blobs 1 & 2 results in greater bitrate reduction since
they contain a larger number of RORI & RONI MB’s. Similarly, pedestrian hypotheses that
are unoccluded (e.g. pedestrian D) can be prioritized over other regions. These observations
can be used to determine the optimal sequence of image regions to be processed by the ROI
detector.
6.12 Summary
In this chapter, we proposed a Region-of-Interest video encoder for pedestrian surveillance.
We showed that the Viola Jones face detector or the adaptive skin detector were not ca-
pable of accurately marking ROI’s in real world surveillance videos. Hence, we proposed
an architecture that combines multiple detector scores using bilattice logic reasoning. To
obtain compact representation of regions, super pixels were computed on the foreground
pixels. Shadow and skin probability scores were obtained for all the super pixels. We have
Chapter 6. ROI video coding for Pedestrian Surveillance 179
modified the DPM technique to obtain pedestrian part scores. Bilattice logic reasoning is
used to combine part scores and detect the pedestrians. Since the DPM based detector fails
occasionally, we use a tracker that uses optical flow and a Kalman filter to accurately detect
pedestrians in the video sequence. The tracker also reduced the computational complexity
by avoiding the need to run DPM on every video frame. We posed the ROI and RORI de-
tection task as a super pixel labelling problem. The bilattice reasoning framework was used
to mark ROI, RORI & RONI super pixels. QP assignment to the MB’s is performed using
the labels of the super pixels. The proposed techniques have been integrated into the x264
video encoder. Experiments show bitrate savings of up to 50.2%.
Chapter 7
Conclusion
High image detail is very critical to recognize and identify miscreants in surveillance footage.
As we have shown in the introduction, to obtain high image detail and large surveillance
coverage, we need to use high resolution surveillance cameras. However, communication
bandwidth requirements of HD cameras is very high. For example, the average typical op-
timized bitrate of a 12MP H.264 surveillance video stream is about 4 - 6Mbps [5, 6]. Such
a high bandwidth requirement increases the data communication and storage costs of the
system. Hence, it is very important to reduce the bitrate of HD camera videos to facilitate
faster market adoption of HD cameras. In this thesis, we have shown that this is achievable
by augmenting the H.264 video encoder with computer vision algorithms.
In Chapter 1, we partitioned the bit cost of a static-camera surveillance video as:
• Background MB cost
• Uncovered background MB cost
• Shadow MB cost
• Non face MB (clothing, arms) cost
• Face MB cost
In this thesis, we have addressed all these components. We proposed four techniques to
reduce the bitrate of surveillance videos:
1. Speeded up GMM based foreground segmentation: Reduces the computational com-
plexity of foreground segmentation which is required to perform skip detection.
180
Chapter 7. Conclusion 181
2. Skip detection: Reduces the cost of coding Background MB’s by accurately detecting
and marking them as Skip.
3. Reference frame selection: Optimally selects reference frames to reduce the uncov-
ered background MB cost.
4. Face ROI coding for pedestrian surveillance: Detects shadow MB’s and marks them
as Skip. A detector and tracker framework has been developed to accurately detect
face and non-face regions. Non face MB’s are encoded in lower quality to reduce the
bitrate.
To perform accurate skip detection, we designed a multi stage sampler based back-
ground MB classifier. Stratification and adaptive sampling techniques have been combined
to reduce the complexity of the BG MB detector. The sampled pixels were classified using a
GMM based segmentation algorithm. We have proposed a modified weight update scheme
to reduce the computational complexity of the GMM based pixel level foreground segmenta-
tion algorithm. The proposed technique marks background MB’s as Skip and hence reduces
the bitrate and complexity of the encoder. Although foreground object detection might have
initially seemed to be very easily achievable, experimental results show that real world is-
sues such as environmental noise, poor lighting conditions and limited processing power
pose significant challenges. The proposed skip detector reduces bit rate by up to 94.5% and
computational complexity by upto 74.5% without affecting the foreground image quality. It
requires 1-3.6ms on a single core and hence can be easily implemented on embedded cam-
era platforms. Also, experimental results of the modified GMM algorithm show a speedup
of up to 44% in scenes where a large fraction of the pixels require multimodal Gaussian
models.
The skip detector based encoder uses image content in the DPB to reconstruct the back-
ground image content. However, skip signaling of uncovered background MB’s in the H.264
standard is not possible if the decoded picture buffer does not contain the corresponding
background image. This reduces the achievable bit rate savings. We have shown that the
optimal selection of reference frames can maximize the number of BG MB’s in the DPB
Chapter 7. Conclusion 182
and hence reduce the cost of coding uncovered background regions in the video frame. A
very low complexity technique has been proposed to determine the optimal set of reference
frames that need to be stored in the DPB. Experiments on real world datasets show that the
proposed reference frame selection method reduces bit rate by up to 24.7% and execution
time by upto 7.3%.
In the specific application of pedestrian surveillance video coding, the face of the pedes-
trian is the most important region in the image. Hence, we proposed to detect & encode the
MB’s that cover the image of the face in high quality. The non-face MB’s of the pedestrian
are considered as ‘Regions of reduced interest’ and are encoded in reduced quality. Shadow
regions are marked as skip. Face detection (based on facial features alone) in controlled
conditions has been very successful. However, we have shown that the accuracy of such
detectors is poor in real world scenarios. Hence, to accurately determine the ROI, RORI &
RONI MB’s, we have combined the outputs of multiple detectors. We pose the MB labelling
task as a super pixel classification problem. Shadow and skin detector scores of super pixels
have been computed. Pedestrians are detected using deformable part models. The face
region is determined using the deformed part locations. Detected pedestrians are tracked
using an optical flow based tracker combined with a Kalman filter. The tracker improves
the accuracy and also avoids the need to run the object detector on already detected pedes-
trians. Bilattice based logic inference has been used to combine multiple likelihood scores
and determine the labels of the super pixels. The coding mode and QP values of the MB’s
have been computed using the super pixel labels. Results show that the proposed face ROI
coding technique provides a further reduction in bitrate of up to 50.2%.
Although the results that we have shown in the thesis have been obtained by modifying
the H.264 encoder, we do note that the proposed techniques can be applied to the recently
finalized, HEVC standard as well. All the techniques presented in this thesis assume a static
camera setup. This is the most common use case in video surveillance installations. How-
ever, the proposed ROI coding ideas can be developed further to support pan tilt cameras.
Chapter 7. Conclusion 183
7.1 Future Challenges and Opportunities
In this thesis, we have seen that applying computer vision algorithms as a preprocessing
step to compression significantly reduces the bit rate, especially when perceptual aspects
of the video sequence are considered. Meticulous design of vision algorithms surprisingly
reduces overall computation. However, many challenges still exist in designing effective
surveillance video encoding systems. We describe a few of them here.
7.1.1 Coding for surveillance cameras on drones
Since the past few years, unmanned aerial vehicles or UAVs have become very popular
for asset monitoring, law enforcement, agriculture and also for recreation. Many futuris-
tic applications such as cargo transport are also being conceived. Video compression on
such airborne platforms poses enormous challenges. These systems have very tight power
budgets and computational capability constraints. Since data communication is over the
wireless channel, bandwidth is also limited. Region of interest video coding techniques on
such platforms could reduce the bitrate and hence the energy consumption of the radio
module. This would increase the battery life and hence the operational time of the UAV.
However, since the camera is in motion, foreground segmentation is not easily achievable.
Hence, the complexity of the computer vision algorithms required is also higher. Image
stabilization and rolling shutter correction are very essential to reduce the residual energy.
ROI video coding for such low power, wireless platforms is a very exciting and challenging
research problem.
7.1.2 Power-Rate-Distortion optimization of ROI encoders
In Chapter 6, we have briefly discussed a technique to sequence the ROI detection opera-
tions for ROI coding on a compute-limited platform. We proposed to determine the order in
which FG blobs are processed based on the estimate of the bitrate reduction (which would
be obtained by processing the blobs). However, a full cross layer (i.e. video analytics engine,
Chapter 7. Conclusion 184
compression engine, network layer & radio system) Power-Rate-Distortion optimization for-
mulation will improve the quality of service of wireless surveillance systems.
Also, with the emergence of wireless standards such as 5G, streaming encoded surveil-
lance videos over wireless networks will be soon realized. However, wireless networks are
unreliable and hence, a good rate control mechanism is very essential to avoid issues such
as buffer overload and frame drop. In Chapter 6, we briefly mentioned about using ROI,
RORI and RONI MB labels to perform rate control. Extending this further, we could con-
sider completely skipping RORI regions during severe network packet loss events. Another
approach would be to compress face image regions of only few frames of pedestrians as
they move across the scene. We plan to explore such multiple rate control schemes for ROI
surveillance coding in the future.
Multi resolution coding applied to ROI coding of surveillance videos is another interest-
ing approach that needs to be studied in detail. Although QP based ROI coding is supported
by standards, using lower resolution for regions of interest can provide better RD perfor-
mance at low bitrate. Encoding a high resolution video at high QP will cause artifacts such
as blocking, contouring and ringing. Instead, encoding a down-sampled video will result
only in a blurred output. Also since, the resolution is lower, the complexity of the encoder
is reduced.
7.1.3 360◦ surveillance video coding
Many surveillance camera vendors have started offering high resolution 360◦ cameras. Very
wide angle fish eye lenses are used to capture the scene. The 360◦ image content can
be represented in different layouts, for example, equirectangular, raw fisheye output or
cubemap representation. The choice of representation affects the compression performance
of the video encoder as well as the accuracy of the algorithms use to perform analytics. For
example, in the equirectangular layout (i.e. a world map layout), the image is distorted
near the poles. Hence, a thorough study of these issues will help in defining the entire
processing pipeline for such cameras.
Chapter 7. Conclusion 185
7.1.4 HDR surveillance video coding
While High dynamic range imaging for static imaging has existed for many years, commer-
cial HDR cameras (e.g. HC-WXF990 by Panasonic) that capture two images at different
exposure settings have appeared recently. Surveillance in particular can benefit immensely
from HDR, for example, in surveillance footage which has mixed lighting conditions (bright
sunlight on one side and dark shadows on another region in the same image). Compressing
HDR video content requires a larger number of bits. Also, as cost of thermal imagers re-
duces, they will be adopted in commercial surveillance systems. These imagers commonly
output 14 bits of data per pixel. Registering color pictures with thermal imagery and ef-
ficiently coding them is a very interesting challenge which will emerge soon. Backward
compatibility (i.e. with existing codecs) is another key challenge that needs to be addressed
when developing video coding techniques for HDR.
Appendix A
Alternate derivation of the Speeded
up GMM update
The Gaussian mixture distribution of the pixel x (we have dropped the time index t here only
for the sake of clarity) formulated in terms of discrete latent variables z is shown in Eqn.
A.1 [128]. Here z is a K dimensional binary random variable having a 1-of-K representation
(z = [z1, z2, ....zK ]T ). zk = 1 indicates that the pixel x was generated from the kth mode of
the mixture model.
p (x) =∑
z
p(z)p(x|z) =K∑
k=1
wkN(
x|µk, σ2k
)
(A.1)
From the EM algorithm, the weight update at time instant t is given by Eqn. A.2 [128].
γ (zk (i)) is the posterior probability of zk (i) = 1 (i is the time index) once we have observed
the incoming pixel x(i). γ (zk (i)) can also be interpreted intuitively as the responsibility that
the mode k takes to explain away the pixel data at time instant i.
wk(t) =
∑ti=1 γ (zk (i))
t(A.2)
=
∑t−1i=1 γ (zk (i))
t+
γ (zk (t))
t(A.3)
=(t− 1)wk(t− 1)
t+
γ (zk (t))
t(A.4)
Stauffer et al. [27] proposed to set γ (zk (t)) to 1 for the mode k which matched the
incoming pixel. The responsibility for other modes is set to 0. Also, (1/t) is set equal to
186
Appendix A. Alternate derivation of the Speeded up GMM update 187
α as proposed in [27]. Hence, the weight update when pixel x(t) matched the mode ‘k’ is
obtained as shown in Eqn. A.5:
wk(t) = (1− α)wk(t− 1) + α (A.5)
Similarly, when pixel x(t) does not match the mode ‘k’, the weight update equation is:
wk(t) = (1− α)wk(t− 1) (A.6)
Assume that all the past Tw pixel samples i.e. x(t − Tw) . . .x(t − 1) matched the same
mode ‘k’. The cumulative weight update at the end of the current frame is:
wk(t) = [[[wk(t− Tw)(1− α) + α](1− α) + α]....] (A.7)
≈ wk(t− Tw)(1− α)Tw + Twα (A.8)
≈ wk(t− Tw)(1− Twα) + Twα (A.9)
Similarly for the case where none of the Tw pixel samples x(t−Tw) . . .x(t− 1) matched
the mode ‘k’, the cumulative weight update at the end of the current frame is:
wk(t) = wk(t− Tw) (1− α)Tw (A.10)
≈ wk(t− Tw) (1− Twα) (A.11)
Now, we propose to ignore the order in which the pixel samples x(t − Tw) . . .x(t − 1)
have arrived in the Tw frames. Hence for the case when Nk matches occur to a pixel mode
‘k’ in Tw frames, we can multiply the two Eqs. (A.9) & (A.11) with suitably modified count
values to obtain the final heuristic in Eq. (A.12). Here, Nk is equal to weightCountk which
Appendix A. Alternate derivation of the Speeded up GMM update 188
is the number of pixels in the Tw window that matched the kth mode. We also append the
αcT term from [3] to enable dynamic selection of number of modes in the GMM.
wk(t) = [(1−Nkα)wk(t− Tw) +Nkα][1− (Tw −Nk)α]− TwαcT (A.12)
Ignoring higher powers of α in Eqn. A.12, we obtain Eqn. A.13 which is identical to the
update equation that was derived in Chapter 3.
wk(t) = (1− Twα)wk(t− Tw) +Nkα− TwαcT (A.13)
Appendix B
Sampler design
Fig. B.1 shows the sampler architecture proposed in Chapter 4. The sparse and dense
samplers are simple systematic samplers [131, 132]. We first study general systematic
samplers used in the context of skip detection. We then use these results to further motivate
the architecture we proposed.
Sparse Sampler + BGS
Fsparse
Morphological Dilation using a 3x3
elementF’sal
Dense Sampler +
BGSIFG
Erosion using a 2x1 element
� FsalImage BCurrent
Fprev = (Bprev)C
Fprev
1 frame delayBprev
Salient MB detection BG MB detection
Systematic samplers
Figure B.1: GMM pixel level classifier + Sampler based Motion Detection (or ‘GMM S-MD’)
flow chart
B.1 Analysis of a simple systematic sampler
For sake of simplicity, let us consider a set of 1D pixels shown in Fig. B.2. The set of pixels
in the 1D image is divided into blocks of length B. The skip detector has to determine the set
of blocks that contain foreground objects. In the context of estimation, we have noted (in
Chapter 4) that sampled locations which are spread apart reduce the correlation between
the chosen data and hence reduce the variation of population averages. Natural populations
have highly correlated properties and hence uniform systematic sampling patterns which
189
Appendix B. Sampler design 190
spread out sample locations perform well. For the current task of skip detection, we now
show such uniform samplers minimize the size of the largest object that is missed.
Stride length = dsys
Object size = L
Block size = B
Block of pixels
Sampled pixels
Figure B.2: 1D sampler
B.1.1 Uniform versus Non Uniform sampling patterns
Let NB denote the total number of samples chosen in a block. For simplicity, let us assume
that the points at the boundaries of each block are always sampled and that the block length
is a multiple of the systematic sampler stride. Let dsys be the stride length of the uniform
systematic sampler.
B = NB dsys (B.1)
Let us now consider a non uniform sampling pattern. Let the inter pixel spacing of this
pattern be represented by di,i+1. Here di,i+1 is the distance between the ith pixel and its
next neigbour in raster scan order.
Appendix B. Sampler design 191
B =
NB−1∑
i=1
di,i+1 (B.2)
Let Dsys & DnonSys be the size of the largest object which can be missed by sampling
using the uniform systematic and non-uniform samplers respectively.
Dsys = dsys − 2 (B.3)
DnonSys = max1≤i≤NB−1
(di,i+1 − 2) (B.4)
We need to minimize the maximum separation between two sampled points. We need
to show that DnonSys ≥ Dsys, or that
max1≤i≤NB−1
(di,i+1) ≥ dsys (B.5)
We prove this by contradiction as follows:
Assume that: max0≤i≤NB
(di,i+1) < dsys (B.6)
From Eqns. B.1 & B.6, we obtain
NB−1∑
i=1
di,i+1 < NBdsys (B.7)
< B (B.8)
This is in contradiction with Eqn. B.2. Hence, uniform systematic samplers minimize
the size of the largest object that is missed.
Appendix B. Sampler design 192
B.1.2 Uniform systematic sampler accuracy
Let hpix be the pixel level classifier used to determine presence of an object i.e. hpix = 1
if an object has been detected by the pixel and hpix = 0 otherwise. Let ypix denote the
pixel level, true class label. Similarly, let h & y be the classifier & true class label of a block
respectively. Let the false negative probability (or the ‘miss’ rate) of h be pmiss. Let the false
positive probability (or the ‘false alarm’ rate) of h be pFA.
Pixel level accuracy: In Chapter 4, the system noise (camera and environment noise)
is modelled using a Gaussian mixture distribution. The Mahalanobis distance between the
pixel value and the GMM mode is thresholded to classify the pixel as BG/FG (For simplicity,
we assume the background process to be unimodal). Let the true difference between the
foreground object and the background image be vdif . Here, we are assuming that the object
and the background are uniform. The imager output represented by vI is equal to the sum
of vdif and the system noise vnoise. vnoise is Gaussian distributed random variable with
zero mean. We assume that the system noise is stationary and has been correctly estimated
by the GMM model using the EM algorithm. Let T be the threshold used by the GMM
algorithm to decide whether the pixel is FG/BG, i.e. a pixel is marked as FG if vI is greater
than Tσnoise. The ‘miss’ probability can be written as
p(hpix = 0/ypix = 1) = p
(∣
∣
∣
∣
vIσnoise
∣
∣
∣
∣
< T
)
(B.9)
= p
(∣
∣
∣
∣
vdif + vnoiseσnoise
∣
∣
∣
∣
< T
)
(B.10)
= p
(
−T <vdif + vnoise
σnoise< T
)
(B.11)
= p
(
−v − T <vnoiseσnoise
< −v + T
)
(B.12)
= Φ(−v + T )− Φ(−v − T ) (B.13)
(B.14)
Here, vdif has been normalized with respect to σnoise to obtain v. Φ() is the cumulative
Appendix B. Sampler design 193
distribution function of the standard normal distribution. Similarly, we can obtain the false
alarm probability as follows
p(hpix = 1/ypix = 0) = p
(∣
∣
∣
∣
vnoiseσnoise
∣
∣
∣
∣
> T
)
(B.15)
= 2(1− Φ(T )) (B.16)
(B.17)
By varying the threshold T , we can obtain the Receiver Operating Characteristic (ROC)
curve for the pixel level classifier hpix as shown in Fig. B.3.
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
False positive rate
Tru
eposi
tive
rate
v = 2
v = 3
v = 1
v = 4
Figure B.3: ROC curve of the pixel level classifier for different values of v (or normalized
signal level)
From Fig. B.3, we can observe that the accuracy of the pixel level classifier reduces
drastically when the normalized signal level v is less than 2. Hence, in images with high
noise, i.e. high σnoise, the contrast between the foreground and the background, i.e. vdif
Appendix B. Sampler design 194
needs to be very high in order to obtain good pixel level classification accuracy.
Block level accuracy: In an uniform systematic grid, let the number of points sampled
inside the object be represented by N . Let the object size (in pixels) be denoted by L. N
can take either of the two values n1 or n2, based on the locations of the object relative to
the stride locations.
n1 =
⌊
L
dsys
⌋
(B.18)
n2 =
⌊
L
dsys+ 1
⌋
(B.19)
The probability distribution of N is given by
p(N = n) =
dsys−(L%dsys)dsys
, if n = n1
L%dsysdsys
, if n = n2
0, otherwise
(B.20)
Similarly, let the number of points sampled inside a block be represented by M . The
probability distribution of M is given by
p(M = m) =
dsys−(B%dsys)dsys
, if m = m1
B%dsysdsys
, if m = m2
0, otherwise
(B.21)
where m1 & m1 are
Appendix B. Sampler design 195
m1 =
⌊
B
dsys
⌋
(B.22)
m2 =
⌊
B
dsys+ 1
⌋
(B.23)
We can now compute the probability of the classifier ‘h’ missing a true FG block as
follows (we have assumed that the noise in the pixels are independent).
pmiss = p(h = 0/y = 1) (B.24)
=L∑
n=0
[p(hpix = 0/ypix = 1)]n p(N = n) (B.25)
= [p(hpix = 0/ypix = 1)]n1 p(N = n1) (B.26)
+ [p(hpix = 0/ypix = 1)]n2 p(N = n2) (B.27)
The probability of the classifier ‘h’ marking a BG block as foreground is
pFA = p(h = 1/y = 0) (B.28)
= 1− p(h = 0/y = 0) (B.29)
= 1−L∑
n=0
[p(hpix = 0/ypix = 0)]m p(M = m) (B.30)
= [p(hpix = 0/ypix = 1)]m1 p(M = m1) (B.31)
+ [p(hpix = 0/ypix = 1)]m2 p(M = m2) (B.32)
The miss rate and the false alarm rate are plotted against the stride parameter dsys in
Fig. B.4. Similarly, we have also plotted the miss rate and the false alarm rate against the
pixel level classifier threshold T in Fig. B.5. From these experiments, we can observe that
increasing the stride or the threshold increases the miss rate of the skip detector. Setting
Appendix B. Sampler design 196
dsys or T to a very low value increases the false alarm rate. In the next section, we use these
observations to motivate the proposed multi stage FG MB detector.
0 20 40 60 80 1000
0.2
0.4
0.6
0.8
1
Stride
Miss rateFalse positive rate
Figure B.4: Sampler accuracy for different values of stride length (v = 3, L = 20, T = 2.5)
Appendix B. Sampler design 197
1 1.5 2 2.5 3 3.5 4 4.5 50
0.2
0.4
0.6
0.8
1
T
Miss rateFalse positive rate
Figure B.5: Sampler accuracy for different values of pixel level classifier threshold (v = 3,
L = 20, dsys = 10)
B.2 Analysis of the proposed sampler
Based on the analysis of the uniform systematic sampler in the previous section, we now
motivate the architecture of the proposed architecture shown in Fig. B.1. From Fig. B.4 &
B.5, we note that the tradeoff between the miss rate & the false alarm rate is determined
by the selection of the sampler stride & pixel classifier threshold parameters. Setting the
stride or threshold to a low value would reduce the miss rate but would also increase the
number of false alarms. This would result in an increase in the bitrate of the encoded video.
Reducing the stride value also increases the complexity of the detector since larger number
of pixels will need to be classified.
To achieve high accuracy with low computational costs, we use a stratified adaptive
cluster sampling scheme. The sampler consists of two stages in which the stride of the
first stage is set to a relatively high value. Hence, a large number of BG MB’s would be
filtered out by the first stage. However, the increased stride setting would result in few
Appendix B. Sampler design 198
missed FG blocks. To mitigate this issue, we stratify the image. Consider FG blocks and
their neighbours in the previous frame. Such blocks have a large probability to contain
foreground objects in the current frame. Also, neighbours of blocks which were detected as
FG in the current frame are also likely to contain foreground objects. Hence, such blocks
are sampled in the second stage with a low stride setting. Although the false alarm rate
of the second sampler is high, this does not affect the overall system performance. This is
because the first sampler would have already filtered a large fraction of the BG blocks.
Appendix C
Bilattice logic based inference
In the influential paper [209] by Kripke, a partial truth assignment of atomic sentences is
performed. Kripke also introduced a method to extend partial truth assignments to non-
atomic formulas. A partial order on the valuations based on the information was also
defined. Fitting [210, 211] recognized that there are in fact two orderings, one involving
the information (described by Kripke) and the other for the truth. In the strong 3-Valued
logic (T, F,N) (where the third state N is considered as undefined) introduced by Kleene,
the ∧ & ∨ operators are definable using the ordering involving truth. As we show later in
this section, this idea of ordering involving truth and information is central to performing
inference using Bilattice logic.
Belnap in [212, 213] extended Kleene’s 3-Valued logic and introduced a four value logic
system. Ofer Arieli et al. argue that four valued logic is useful in inference problems. How-
ever, in the present context of applying multi valued logics to inference, the next important
development came when Ginsberg in [202] introduced the concept of Bilattices. Bilattices,
which are algebraic structures generalized Belnap’s four valued logic. In fact, the lattice
formed by Belnap’s four valued logic is isomorphic to the simplest Bilattice. Theory of ap-
plying bilattices to inference has been described in detail in [202, 173, 214]. Shet [215] in
his PhD thesis proposed to use Bilattice based reasoning to perform pedestrian detection,
aerial object detection and identity maintenance tasks. We sketch only the most important
parts of this inference technique and refer the reader to the original references for more
details.
Definition C.1. A poset (or partially ordered set) is an ordered pair P = (X,≤) where
X is a set and ≤ is a partial order (i.e. a binary relation which is reflexive, transitive &
antisymmetric).
199
Appendix C. Bilattice logic based inference 200
Definition C.2. Lattice L is a poset in which, every pair of elements x, y ∈ L has,
• a least upper bound x ∨ y ∈ L (called join)
• a greatest lower bound x ∧ y ∈ L (called meet)
that is,
• x ∨ y ≤ z ⇐⇒ x ≤ z and y ≤ z
• z ≤ x ∧ y ⇐⇒ z ≤ x and z ≤ y
Here, ‘least upper bound’ and ‘greatest lower bound’ are idempotent, commutative and
associative binary operations.
Definition C.3. A lattice L is said to be complete iff there exists a unique lub and glb for
every nonempty subset M of L.
Definition C.4. Bilattice is a quadruple B = (B,≤t,≤k,¬) where B is a non empty set, ≤t
& ≤k are partial orderings and ¬ is a mapping from B to itself, such that:
• (B,≤t), (B,≤k) are complete lattices
• x ≤t y =⇒ ¬y ≤t ¬x
• x ≤k y =⇒ ¬x ≤k ¬y
• ¬¬x = x.
Before we provide a formal description of using bilattices for inference, we will first
provide some intuition. Fig. C.1 shows double Hasse diagrams of two valued, four valued
and continuous square bilattices. The y axis value represents the information available
about the formula and the x axis value represents the belief (i.e. whether the wff is true or
false). The two partial orders of the bilattice B, defined on elements in the set B, can be
interpreted as follows:
Appendix C. Bilattice logic based inference 201
• The ≤t is a partial order based on belief, i.e. wff’s which have a larger probability of
being true are placed higher in the ordering.
• ≤k is a partial ordering on information content, i.e. logic values which contain more
information are placed higher in the ordering.
The logic represented by the trivial two valued bilattice shown in Fig. C.1a is identical to
that used in classical propositional calculus. Fig. C.1b shows the Belnap’s four valued logic
bilattice which adds ⊥ & ⊤ to the classical two logic calculus. Here, ⊥ represents the truth
value of wff’s about which we do not have any information and ⊤ represents contradiction
in data. Although the four valued Belnap’s logic accommodates contradictory sources of
data, it does not allow representation of continuous uncertainty values or probabilities.
The continuous square bilattice [202] solves this problem by extending Belnap’s four
valued logic to include continuous truth value assignments. Each wff is assigned a truth
value equal to 〈p, q〉 where p, q ∈ [0, 1], i.e. the elements of the set B of the bilattice are
ordered pairs. Here p represents the probability or confidence that the wff is true and
q represents the probability that it is false. Logic values are assigned to detector scores
and inference rules. Since detector scores and rule weights are normalized to 1, p & q are
allowed to take values in the range [0, 1]. One important point to note here is that no logical
consistency is imposed on the logical value 〈p, q〉, i.e. q need not be equal to 1− p. Hence, q
here represents the evidence against the logical statement and not merely the lack of belief
of a proposition. The square bilattice is shown in Fig. C.1c. Here, ⊥ = 〈0, 0〉 represents no
information about the wff and ⊤ = 〈1, 1〉 represents contradiction in the data (i.e. some
data suggests the proposition to be true and the rest claims it to be false).
Each detector score (e.g. head part filter score, torso part filter score, super pixel skin
score) is mapped to a point in the bilattice. For example, head(X,Y, S) used to represent the
detection score of the head part filter (at location X, Y and scale S) takes a value 〈0.8, 0.2〉
for the pedestrian image in Fig. C.1c. Likewise, logical weights are assigned to rules that
are used to perform reasoning. For example, φ(pedestrian(X,Y, S) ← head(X,Y, S)) rep-
resents the sentence: ‘detection of a head part at location X, Y and scale S indicates the
Appendix C. Bilattice logic based inference 202
existence of a pedestrian’. The weight for this rule is assigned a value 〈0.7, 0.3〉. As we show
later in this section, all these detector scores and reasoning rules are combined to obtain
the final inferred truth values of an image region. Fig. C.1c shows representative scores of
pedestrian and non pedestrian image regions marked inside the bilattice. We observe that
the logical assignment to the pedestrian image is closer to the ‘true’ (i.e. 〈1, 0〉) value. In
contrast, the assignment to the background image is closer to the ‘false’ (i.e. 〈0, 1〉) value.
As already discussed for the case of a regular bilattice, the x & y axes values impose an
ordering of the wff’s. As an example, we can write 〈0.8, 0.2〉 <k 〈0.87, 0.27〉 and 〈0.8, 0.2〉 <t
〈0.87, 0.13〉. This ordering is illustrated in Fig. C.2. Here v is the value of a wff. The shaded
rectangle in Fig. C.2b is the logic values which are placed lower than v by the ‘information’
order. Similarly, the shaded rectangle in Fig. C.2a is the logic values which are placed lower
than v by the ‘belief’ order.
As we have already see from Defn. C.4, the negation operator flips the logic elements
around the truth axis without altering the information ordering, i.e. ¬〈p, q〉 = 〈q, p〉. Along
with the negation operator, the bilattice is also associated with another operator called
‘conflation’ which is denoted by −. The conflation operator flips the logic elements around
the information (or k) axis without altering the belief ordering.
We now discuss the formal construction of the continuous valued square bilattice used
to perform reasoning.
Definition C.5. Square Bilattice is a quadrapule L2 = (L× L,≤t,≤k,¬) where for every
〈p1, q1〉, 〈p2, q2〉 in L2,
• L = 〈L,≤L〉 is a complete lattice
• ¬〈p1, q1〉 = 〈q1, p1〉
• 〈p1, q1〉 ≤t 〈p2, q2〉 ⇐⇒ p1 ≤L p2 and q2 ≤L q1
• 〈p1, q1〉 ≤k 〈p2, q2〉 ⇐⇒ p1 ≤L p2 and q1 ≤L q2
Appendix C. Bilattice logic based inference 203
Fig. C.3 shows the construction of the square bilattice that we use for reasoning un-
der uncertainty. The square bilattice L2 = {[0, 1] ∗ [0, 1],≤t,≤k} can be decomposed into
two lattices (A) Lattice L1 = {[0, 1] ∗ [0, 1],≤t} based on the belief ordering (B) Lattice
L2 = {[0, 1] ∗ [0, 1],≤k} based on the information content ordering. The lub & glb operators
for the lattice L1, based on the belief ordering, are denoted by ∨ (disjunction) & ∧ (con-
junction) respectively. The lub & glb operators for the lattice L2 (based on the information
ordering) are represented by⊕
&⊗
respectively. The⊗
is also called the consensus op-
erator. Informally, we can consider p⊗
q to be the extent to which p and q agree upon.
Likewise, the⊕
is called the gullibility operator. Again, informally, it represents the opera-
tor that combines any information from different sources.
Definition C.6. The lub & glb operators along the belief axis (∨, ∧) and the lub & glb
operators along the information axis (⊗
&⊕
) are defined as:
• 〈p1, q1〉 ∧ 〈p2, q2〉 = 〈p1 ∧L p2, q1 ∨L q2〉
• 〈p1, q1〉 ∨ 〈p2, q2〉 = 〈p1 ∨L p2, q1 ∧L q2〉
• 〈p1, q1〉⊗
〈p2, q2〉 = 〈p1 ∧L p2, q1 ∧L q2〉
• 〈p1, q1〉⊕
〈p2, q2〉 = 〈p1 ∨L p2, q1 ∨L q2〉
In Defn. C.6, the lub and glb of the square bilattice have been defined based on the
glb and lub operators of L. We now define the lub and glb operators for the bilattice L to
complete the construction of the square bilattice. As noted by Shet et al. in [173], triangular
norm and conorm functions introduced by Schweizer et al. [216] are popular for reasoning
in many-valued logics. Shet et al. adopted these to construct the lub and glb operators for
the bilattice L. T-norms have also been used for rule weighting in fuzzy rule based methods.
Definition C.7. A function T : [0, 1] × [0, 1] → [0, 1] is a t-norm if it satisfies the following
properties:
• Commutativity: T (a, b) = T (b, a)
• Monotonicity: T (a, b) ≤ T (c, d) if a ≤ c and b ≤ d
Appendix C. Bilattice logic based inference 204
• Associativity: T (a, T (b, c)) = T (T (a, b), c)
• Identity element: The number 1 is the identity element i.e. T (a, 1) = a
Definition C.8. A function S : [0, 1]× [0, 1]→ [0, 1] is a t-conorm if it satisfies the following
properties:
• Commutativity: S(a, b) = S(b, a)
• Monotonicity: S(a, b) ≤ S(c, d) if a ≤ c and b ≤ d
• Associativity: S(a,S(b, c)) = S(S(a, b), c)
• Identity element: The number 0 is the identity element i.e. S(a, 0) = a
Following [173], we use T (a, b) = ab & S(a, b) = a + b − ab as the glb & lub operators
for the lattice L respectively. With this, we have now completely specified the algebraic
structure of the square bilattice that we will use to perform inference.
We now show the application of the bilattice formulation to reasoning under uncertainty.
The input to the inference algorithm is a set of facts, which in our case are detector scores
(e.g. head shoulder part filter score, texture based shadow super pixel score). The query is
a wff that represents the label assigned to a bounding box or super pixel (e.g. presence of
a pedestrian in a bounding box or whether a super pixel is a shadow region). All the scores
are represented as logical values in the square bilattice.
Definition C.9. Let L be the formal language where the inference is carried out. Truth
assignment is a function that assigns some truth value to each wff in L. More formally, a
truth assignment is a function φ : L→ B where B is a Bilattice on truth values.
Definition C.10. Let KB = s1, s2, . . . , sM be the sentences in the knowledge base. Let φ be
a truth assignment that labels sentences i.e. if si is a sentence, φ(si) is the truth value of si
assigned by φ. Using this truth data, we can obtain information of other sentences logically
Appendix C. Bilattice logic based inference 205
related to si. This logical consequence, also called entailment, is denoted by |=. The closure
cl(φ) is the truth assignment that labels sentences entailed by the knowledge base.
Let su be the sentence that has to be inferred from the knowledge base. Borrowing
notations from [173, 202], we write S as the subset of L from which it is possible to derive
su, i.e. S is a set of sentences that entail su. The conjunction of sentences in S is:
∧s∈S
cl(φ)(s) (C.1)
There could be multiple such sets that entail su. For example, the presence of a pedes-
trian can be indicated by a high head-shoulder part filter score and a high skin super pixel
score. Let π+(su) denote the collection of such subsets of L that entail su. Similarly, let
π−(su) denote the collection of subsets of L that entail ¬su. Ginsberg [202] showed that
the closure can be written as follows:
cl(φ)(su) =
⊕
S∈π+(su)
⊥ ∨
[
∧s∈S
cl(φ)(s)
]
⊕
⊕
S∈π−(su)
⊥ ∧
[
¬ ∧s∈S
cl(φ)(s)
]
(C.2)
Using Demorgan’s law for bilattices [202], i.e. ¬(a∧b) = (¬a)∨ (¬b) and since ¬⊥ = ⊥,
¬ (a⊕
b) = (¬a⊕
¬b) we can rewrite Eqn. C.2 as:
cl(φ)(su) =
⊕
S∈π+(su)
⊥ ∨
[
∧s∈S
cl(φ)(s)
]
⊕
¬⊕
S∈π−(su)
⊥ ∨
[
∧s∈S
cl(φ)(s)
]
(C.3)
Eqn. C.3 is a disjunction of conjunctions of bilattice values. The closure operation
explicitly uses logic terms that entail q and ¬q in separate conjunction terms. This allows us
to ‘only accept’ or ‘only reject’ hypotheses based on certain facts. For example, inconsistent
geometry can cause a rejection of a pedestrian hypothesis. However, if the geometry is
consistent, it does not increase the truth value of the pedestrian detection. The formulation
we have described to determine closure is identical to that used by Shet in [173]. Reasoning
Appendix C. Bilattice logic based inference 206
using other bilattice structures (e.g. bilattice for default logic) can be found in Shet’s thesis
[215].
Appendix C. Bilattice logic based inference 207
tf
� k
� t
Belief axis
(a) Two valued logic bilattice
tf
� k
� t
Belief axis
(b) Four valued Belnap’s bilattice
tf
� k
� t
Belief axis
<1,0>
<1,1>
<0,0>
<0,1>
pq
(c) Square Bilattice
Figure C.1: Double Hasse diagrams of different bilattices. In (c), a surveillance video frame
is shown. Also, the logic values of pedestrian and non pedestrian image regions are shown
in the double Hasse diagram.
Appendix C. Bilattice logic based inference 208
tf
� k
� t
Belief axis
<1,0>
<1,1>
<0,0>
<0,1>v
pq
(a) Ordering based on belief
tf
� k
� t
Belief axis
<1,0>
<1,1>
<0,0>
<0,1>
v
pq
(b) Ordering based on information
Figure C.2: Double Hasse diagrams show partial ordering based on belief and information
in bilattices
Ap
pen
dix
C.
Bila
tticelo
gic
base
din
fere
nce
20
9
{ [0,1], ≤L }
Lattice Square bilattice Underlying lattices
Order ≤t and≤K :
⟨p1,q1⟩ ≤t ⟨p2,q2⟩ ⇔ p1 ≤L p2 and q2 ≤L q1
⟨p1,q1⟩ ≤k ⟨p2,q2⟩ ⇔ p1 ≤L p2 and q1 ≤L q2
{ [0,1]*[0,1], ≤t, ≤k }
x ∧L y = xy
x ∨L y = x + y - xy
glb and lub operators:
{ [0,1]*[0,1], ≤t } { [0,1]*[0,1], ≤k }
glb and lub operators :
⟨p1,q1⟩ ∧ ⟨p2,q2⟩ = ⟨p1∧Lp2 , q1∨Lq2⟩
⟨p1,q1⟩ ∨ ⟨p2,q2⟩ = ⟨p1∨Lp2 , q1∧Lq2⟩
⟨p1,q1⟩ ⊗ ⟨p2,q2⟩ = ⟨p1∧Lp2 , q1∧Lq2⟩
⟨p1,q1⟩ ⊕ ⟨p2,q2⟩ = ⟨p1∨Lp2 , q1∨Lq2⟩
glb and lub operators:
Figure C.3: Construction of the square bilattice
Bibliography
[1] “Planning, design, installation and operation of CCTV surveillance systems: code of
practice and associated guidance,” British Security Industry Association, 2014.
[2] D.-S. Lee, “Effective gaussian mixture learning for video background subtraction,”
IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 27, pp. 827–832, 2005.
[3] Z. Zivkovic, “Improved adaptive gaussian mixture model for background subtrac-
tion,” Int’l Conf. on Pattern Recognition, vol. 2, pp. 28–31, 2004.
[4] “https://www.mobotix.com/eng au/support/planning-tools/mx-planning-tool-
optics,” MX Planning Tool Optics.
[5] “http://resource.boschsecurity.com/documents/nbn 80122 data sheet enus
14878683787.pdf,” DINION IP ultra 8000 MP datasheet.
[6] “http://resource.boschsecurity.com/documents/npc 2000 data sheet enus
11392811915.pdf,” TINYON IP 2000.
[7] I. E. Richardson, The H.264 Advanced Video Compression Standard, 2nd ed. Wiley
Publishing, 2010.
[8] L. Li, W. Huang, I. Y. H. Gu, and Q. Tian, “Statistical modeling of complex back-
grounds for foreground object detection,” IEEE Trans. on Image Processing, vol. 13,
no. 11, pp. 1459–1472, 2004.
210
BIBLIOGRAPHY 211
[9] “http://imagelab.ing.unimore.it/vssn06/.”
[10] E. Martinec, “http://www.vuezone.com,” Noise, dynamic range and bit depth in digi-
tal SLRs, 2008.
[11] x264 encoder software. [Online]. Available:
http://www.videolan.org/developers/x264.html
[12] “http://www.proxicast.com/security/security-video.htm,” LAN-Cell 3G/4G Cellular
Router for Video Surveillance.
[13] I. E. G. Richardson, H.264 and MPEG-4 Video Compression: Video Coding for Next-
generation Multimedia. New York, NY, USA: John Wiley & Sons, Inc., 2003.
[14] D. Slepian and J. Wolf, “Noiseless coding of correlated information sources,” IEEE
Trans. on Information Theory, vol. 19, no. 4, pp. 471–480, 1973.
[15] A. Wyner and J. Ziv, “The rate-distortion function for source coding with side infor-
mation at the decoder,” IEEE Trans. on Information Theory, vol. 22, no. 1, pp. 1–10,
1976.
[16] L. Liu, Z. Li, and E. Delp, “Efficient and low-complexity surveillance video compres-
sion using backward-channel aware wyner-ziv video coding,” IEEE Trans. Circuits
Syst. Video Technol., vol. 19, no. 4, pp. 453 –465, April 2009.
[17] B. Girod, A. Aaron, S. Rane, and D. Rebollo-Monedero, “Distributed video coding,”
Proceedings of the IEEE, vol. 93, no. 1, pp. 71–83, 2005.
[18] R. Puri, A. Majumdar, and K. Ramchandran, “Prism: A video coding paradigm with
motion estimation at the decoder,” Image Processing, IEEE Transactions on, vol. 16,
no. 10, pp. 2436–2448, Oct 2007.
[19] T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the h.264/avc
video coding standard,” IEEE Transactions on Circuits and Systems for Video Technol-
ogy, vol. 13, no. 7, pp. 560–576, July 2003.
BIBLIOGRAPHY 212
[20] H.264 Advanced video coding for generic audiovisual services. [Online]. Available:
http://www.itu.int/rec/T-REC-H.264
[21] Y. Lee, J. Kim, and C.-M. Kyung, “Energy-aware video encoding for image quality
improvement in battery-operated surveillance camera,” IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 20, no. 2, pp. 310 –318, Feb. 2012.
[22] A. Vetro, T. Haga, K. Sumi, and H. Sun, “Object-based coding for long-term archive
of surveillance video,” IEEE Int. Conf. on Multimedia and Expo, vol. 2, pp. 417–420,
2003.
[23] S.-Y. Chien, S.-Y. Ma, and L.-G. Chen, “Efficient moving object segmentation algo-
rithm using background registration technique,” IEEE Trans. Circuits Syst. Video Tech-
nol., vol. 12, no. 7, pp. 577 –586, Jul 2002.
[24] X. Jin and S. Goto, “Encoder adaptable difference detection for low power video
compression in surveillance system,” Image Commun., vol. 26, no. 3, pp. 130–142,
Mar. 2011.
[25] Z. He and D. Wu, “Resource allocation and performance analysis of wireless video
sensors,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 16,
no. 5, pp. 590–599, May 2006.
[26] L.-T. Cheok and N. Gagvani, “Analytics-modulated coding of surveillance video,” in
Multimedia and Expo (ICME), 2010 IEEE International Conference on, July 2010, pp.
127–132.
[27] C. Stauffer and W. Grimson, “Adaptive background mixture models for real-time
tracking,” IEEE Conf. on Computer Vision and Pattern Recognition, vol. 2, p. 2246,
1999.
[28] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Detecting moving objects, ghosts,
and shadows in video streams,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25,
no. 10, pp. 1337–1342, 2003.
BIBLIOGRAPHY 213
[29] K. Kim, T. Chalidabhongse, D. Harwood, and L. Davis, “Background modeling and
subtraction by codebook construction,” in Image Processing, 2004. ICIP ’04. 2004
International Conference on, vol. 5, Oct 2004, pp. 3061–3064 Vol. 5.
[30] Y. Benezeth, P. M. Jodoin, B. Emile, H. Laurent, and C. Rosenberger, “Comparative
study of background subtraction algorithms,” J. Elec. Imaging, vol. 19, no. 3, 2010.
[31] G. Guo and C. Dyer, “Patch-based image correlation with rapid filtering,” IEEE Conf.
on Comput. Vision and Pattern Recognition, 2007. CVPR ’07., pp. 1–6, 2007.
[32] Y. Yu and D. Doermann, “Model of object-based coding for surveillance video,” in
Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP ’05). IEEE Inter-
national Conference on, vol. 2, March 2005, pp. 693–696.
[33] S.-C. Hsia, C. H. Hsiao, and C.-Y. Huang, “Single-object-based segmentation and cod-
ing technique for video surveillance system,” Journal of Electronic Imaging, vol. 18,
no. 3, pp. 033 007–033 007–10, 2009.
[34] R. Venkatesh Babu and A. Makur, “Object-based surveillance video compression us-
ing foreground motion compensation,” in Control, Automation, Robotics and Vision,
2006. ICARCV ’06. 9th International Conference on, Dec 2006, pp. 1–6.
[35] H. Song and C.-C. Kuo, “A region-based h.263+ codec and its rate control for low
vbr video,” Multimedia, IEEE Transactions on, vol. 6, no. 3, pp. 489–500, June 2004.
[36] P. Baccichet, X. Zhu, and B. Girod, “Network-aware h.264/avc region-of-interest cod-
ing for a multi-camera wireless surveillance network,” in Picture Coding Symposium,
2006.
[37] C.-Y. Wu and P.-C. Su, “A region of interest rate-control scheme for encoding traffic
surveillance videos,” in Intelligent Information Hiding and Multimedia Signal Process-
ing, 2009. IIH-MSP ’09. Fifth International Conference on, Sept 2009, pp. 194–197.
BIBLIOGRAPHY 214
[38] Y. Liu, Z. Li, Y. Soh, and M. Loke, “Conversational video communication of h.264/avc
with region-of-interest concern,” in Image Processing, 2006 IEEE International Con-
ference on, Oct 2006, pp. 3129–3132.
[39] T. Thomas, S. Emmanuel, P. Zhang, and M. Kankanhalli, “An authentication mecha-
nism using chinese remainder theorem for efficient surveillance video transmission,”
in Advanced Video and Signal Based Surveillance (AVSS), 2010 Seventh IEEE Interna-
tional Conference on, 2010, pp. 567–573.
[40] C. S. Kannangara, I. E. G. Richardson, M. Bystrom, J. R. Solera, Y. Zhao, A. Maclen-
nan, and R. Cooney, “Low-complexity skip prediction for H.264 through Lagrangian
cost estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 2, pp. 202–208,
2006.
[41] H. Zeng, C. Cai, and K.-K. Ma, “Fast mode decision for H.264/AVC based on mac-
roblock motion activity,” IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 4, pp.
491–499, 2009.
[42] A. Saha, K. Mallick, J. Mukherjee, and S. Sural, “Skip prediction for fast rate dis-
tortion optimization in H.264,” IEEE Trans. Consum. Electron., vol. 53, no. 3, pp.
1153–1160, Aug 2007.
[43] A. Kannur and B. Li, “An enhanced rate control scheme with motion assisted slice
grouping for low bit rate coding in h.264,” in Image Processing, 2008. ICIP 2008.
15th IEEE International Conference on, Oct 2008, pp. 2100–2103.
[44] H. Li, Z. Wang, H. Cui, and K. Tang, “An improved roi-based rate control algorithm
for h.264/avc,” in Signal Processing, 2006 8th International Conference on, vol. 2,
2006.
[45] X. Zhang, L. Liang, Q. Huang, Y. Liu, T. Huang, and W. Gao, “An efficient coding
scheme for surveillance videos captured by stationary cameras,” Visual Communica-
tions and Image Processing, 2010.
BIBLIOGRAPHY 215
[46] X. Zhang, T. Huang, Y. Tian, and W. Gao, “Background-modeling-based adaptive
prediction for surveillance video coding,” Image Processing, IEEE Transactions on,
vol. 23, no. 2, pp. 769–784, Feb 2014.
[47] M. Paul, W. Lin, C.-T. Lau, and B.-S. Lee, “Explore and model better i-frames for video
coding,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 21,
no. 9, pp. 1242–1254, Sept 2011.
[48] M. Paul, W. Lin, C. Lau, and B.-S. Lee, “Video coding using the most common frame
in scene,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International
Conference on, March 2010, pp. 734–737.
[49] T. Totozafiny, O. Patrouix, F. Luthon, and J.-M. Coutellier, “Dynamic background seg-
mentation for remote reference image updating within motion detection jpeg2000,”
in Industrial Electronics, 2006 IEEE International Symposium on, vol. 1, July 2006,
pp. 505–510.
[50] S. Han, X. Zhang, Y. Tian, and T. Huang, “An efficient background reconstruction
based coding method for surveillance videos captured by moving camera,” in Ad-
vanced Video and Signal-Based Surveillance (AVSS), 2012 IEEE Ninth International
Conference on, Sept 2012, pp. 160–165.
[51] “http://www.onvif.org/,” ONVIF standards.
[52] “http://www.onvif.org/,” PSIA: Physical Security Interoperability Alliance.
[53] V. Chellappa, P. Cosman, and G. Voelker, “Dual frame motion compensation with
uneven quality assignment,” in Proc. Data Compression Conference, DCC 2004, pp.
262–271.
[54] M. Tiwari and P. Cosman, “Selection of long-term reference frames in dual-frame
video coding using simulated annealing,” IEEE Signal Process. Lett., vol. 15, pp. 249–
252, 2008.
BIBLIOGRAPHY 216
[55] D. Liu, D. Zhao, X. Ji, and W. Gao, “Dual frame motion compensation with optimal
long-term reference frame selection and bit allocation,” IEEE Trans. Circuits Syst.
Video Technol., vol. 20, no. 3, pp. 325 –339, March 2010.
[56] B. Li, J. Xu, H. Li, and F. Wu, “Optimized reference frame selection for video coding
by cloud.” in IEEE Int. Workshop on Multimedia Signal Process. (MMSP). IEEE, 2011,
pp. 1–5.
[57] H. Li, B. Li, and J. Xu, “Rate-distortion optimized reference picture management for
high efficiency video coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12,
pp. 1844–1857, 2012.
[58] X. Zhang, Y. Tian, T. Huang, S. Dong, and W. Gao, “Optimizing the hierarchical pre-
diction and coding in hevc for surveillance and conference videos with background
modeling,” Image Processing, IEEE Transactions on, vol. 23, no. 10, pp. 4511–4526,
Oct 2014.
[59] D. Grois and O. Hadar, Recent Advances on Video Coding, D. J. D. S. Lorente, Ed.
InTech, 2011.
[60] I. Fernandez, P. Rondao Alface, T. Gan, R. Lauwereins, and C. De Vleeschouwer,
“Integrated h.264 region-of-interest detection, tracking and compression for surveil-
lance scenes,” in Packet Video Workshop (PV), 2010 18th International, Dec 2010, pp.
17–24.
[61] N. Doulamis, A. Doulamis, D. Kalogeras, and S. Kollias, “Low bit-rate coding of image
sequences using adaptive regions of interest,” Circuits and Systems for Video Technol-
ogy, IEEE Transactions on, vol. 8, no. 8, pp. 928–934, Dec 1998.
[62] Z. Bojkovic and D. Milovanovic, “Multimedia coding using adaptive regions of inter-
est,” in Neural Network Applications in Electrical Engineering, 2004. NEUREL 2004.
2004 7th Seminar on, Sept 2004, pp. 67–71.
BIBLIOGRAPHY 217
[63] C. Bulla, A. Steiger, and P. Hosten, “Realtime object detection & tracking for roi
encoding,” in International Workshop on Acoustic Signal Enhancement IWAENC’12,
Aachen, Germany, Sep. 2012.
[64] C. Bulla, C. Feldmann, and M. Schink, “Region of interest encoding in video confer-
ence systems,” in Proc. of International Conference on Advances in Multimedia MME-
DIA’13, Venice, Italy, Apr. 2013, pp. 119–124.
[65] P. Viola and M. J. Jones, “Robust real-time face detection,” Int. J. Comput. Vision,
vol. 57, no. 2, pp. 137–154, May 2004.
[66] M.-C. Chi, M.-J. Chen, and C.-T. Hsu, “Region-of-interest video coding by fuzzy con-
trol for h.263+ standard,” in Circuits and Systems, 2004. ISCAS ’04. Proceedings of
the 2004 International Symposium on, vol. 2, May 2004, pp. II–93–6 Vol.2.
[67] Y. Liu, Z. G. Li, and Y. C. Soh, “Region-of-interest based resource allocation for con-
versational video communication of h.264/avc,” Circuits and Systems for Video Tech-
nology, IEEE Transactions on, vol. 18, no. 1, pp. 134–139, Jan 2008.
[68] S.-F. Huang, M.-J. Chen, K.-H. Tai, and M.-S. Li, “Region-of-interest determination
and bit-rate conversion for h.264 video transcoding,” EURASIP Journal on
Advances in Signal Processing, vol. 2013, no. 1, 2013. [Online]. Available:
http://dx.doi.org/10.1186/1687-6180-2013-112
[69] D. Chai and K. Ngan, “Face segmentation using skin-color map in videophone appli-
cations,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 9, no. 4,
pp. 551–564, Jun 1999.
[70] D. Wu, S. Ci, H. Luo, Y. Ye, and H. Wang, “Video surveillance over wireless sensor
and actuator networks using active cameras,” Automatic Control, IEEE Transactions
on, vol. 56, no. 10, pp. 2467–2472, Oct 2011.
BIBLIOGRAPHY 218
[71] H. Cheng and J. Wus, “Adaptive region of interest estimation for aerial surveillance
video,” in Image Processing, 2005. ICIP 2005. IEEE International Conference on, vol. 3,
Sept 2005, pp. III–860–3.
[72] H. Meuel, M. Munderloh, and J. Ostermann, “Low bit rate roi based video coding for
hdtv aerial surveillance video sequences,” in Computer Vision and Pattern Recognition
Workshops (CVPRW), 2011 IEEE Computer Society Conference on, June 2011, pp. 13–
20.
[73] M. M. Holger Meuel, Julia Schmidt and J. Ostermann, Advanced Video Coding for
Next-Generation Multimedia Services, P. Y.-S. Ho, Ed. InTech, 2013.
[74] A. Mavlankar and B. Girod, “Video streaming with interactive pan/tilt/zoom,”
in High-Quality Visual Experience, ser. Signals and Communication Technology,
M. Mrak, M. Grgic, and M. Kunt, Eds. Springer Berlin Heidelberg, 2010, pp. 431–
455.
[75] ——, “Spatial-random-access-enabled video coding for interactive virtual
pan/tilt/zoom functionality,” Circuits and Systems for Video Technology, IEEE
Transactions on, vol. 21, no. 5, pp. 577–588, May 2011.
[76] ——, “Background extraction and long-term memory motion-compensated predic-
tion for spatial-random-access-enabled video coding,” in Picture Coding Symposium,
2009. PCS 2009, May 2009, pp. 1–4.
[77] A. Mavlankar, P. Baccichet, D. Varodayan, and B. Girod, “Optimal slice size for
streaming regions of high resolution video with virtual pan/tilt/zoom functionality,”
in Proc. of 15th European Signal Processing Conference (EUSIPCO, 2007.
[78] F. Boulos, W. Chen, B. Parrein, and P. Le Callet, “A new h.264/avc error resilience
model based on regions of interest,” in Packet Video Workshop, 2009. PV 2009. 17th
International, May 2009, pp. 1–9.
BIBLIOGRAPHY 219
[79] C. Koch and S. Ullman, Matters of Intelligence. Springer Netherlands, 1987, vol. 188,
ch. Shifts in Selective Visual Attention: Towards the Underlying Neural Circuitry, pp.
115–141.
[80] L. Itti, “Automatic foveation for video compression using a neurobiological model of
visual attention,” Image Processing, IEEE Transactions on, vol. 13, no. 10, pp. 1304–
1318, 2004.
[81] Z. Li, S. Qin, and L. Itti, “Visual attention guided bit allocation in video compression,”
Image and Vision Computing, vol. 29, no. 1, pp. 1 – 14, 2011.
[82] L. Itti, “Automatic attention-based prioritization of unconstrained video for compres-
sion,” in In Proc. SPIE Human Vision and Electronic Imaging IX (HVEI04, pp. 272–283.
[83] A. Unterweger and A. Uhl, “Slice groups for post-compression region of interest en-
cryption in h.264/avc and its scalable extension,” Signal Processing: Image Commu-
nication, vol. 29, no. 10, pp. 1158 – 1170, 2014.
[84] S. Khire, A. Rodriguez, S. Robertson, and N. Jayant, “Error-resilient delivery of re-
gion of interest video using multiple representation coding,” in Acoustics, Speech and
Signal Processing (ICASSP), 2013 IEEE International Conference on, May 2013, pp.
2055–2059.
[85] H.-M. Hu, B. Li, W. Lin, W. Li, and M.-T. Sun, “Region-based rate control for
h.264/avc for low bit-rate applications,” Circuits and Systems for Video Technology,
IEEE Transactions on, vol. 22, no. 11, pp. 1564–1576, Nov 2012.
[86] X. Zhu, E. Setton, and B. Girod, “Content-adaptive coding and delay-aware rate
control for a multi-camera wireless surveillance network,” in Multimedia Signal Pro-
cessing, 2005 IEEE 7th Workshop on, Oct 2005, pp. 1–4.
[87] F. Licandro, A. Lombardo, and G. Schembra, “Multipath routing and rate-controlled
video encoding in wireless video surveillance networks,” Multimedia Systems, vol. 14,
no. 3, pp. 155–165, 2008.
BIBLIOGRAPHY 220
[88] A. Zainaldin, I. Lambadaris, and B. Nandy, “Adaptive rate control low bit-rate video
transmission over wireless zigbee networks,” in Communications, 2008. ICC ’08. IEEE
International Conference on, May 2008, pp. 52–58.
[89] Y. Sun, I. Ahmad, D. Li, and Y.-Q. Zhang, “Region-based rate control and bit alloca-
tion for wireless video transmission,” Multimedia, IEEE Transactions on, vol. 8, no. 1,
pp. 1–10, Feb 2006.
[90] C.-M. Huang and C.-W. Lin, “Multiple-priority region-of-interest h.264 video com-
pression using constraint variable bitrate control for video surveillance,” Optical En-
gineering, vol. 48, no. 4, pp. 047 004–047 004–10, 2009.
[91] J. Chao, R. Huitl, E. Steinbach, and D. Schroeder, “A novel rate control framework for
sift/surf feature preservation in h.264/avc video compression,” Circuits and Systems
for Video Technology, IEEE Transactions on, vol. 25, no. 6, pp. 958–972, 2015.
[92] “http://resource.boschsecurity.com/documents/commercial brochure enus
9822241291.pdf,” DINION and FLEXIDOME HD 1080p High Dynamic Range cameras.
[93] “https://blog.sony.com/press/sonys-4k-security-camera-has-1-0-type-exmor-r-cmos-
sensor-for-advanced-imaging-capabilities/,” Sony’s 4k Secuirity Camera: Advanced
Imaging Capabilities.
[94] “http://www.axis.com/files/whitepaper/wp zipstream 64253 en 1506 lo.pdf,” Axis
Zipstream technology.
[95] VideobanditTM suite. General Dynamics, C4 Systems. [Online]. Available:
http://www.gdc4s.com/video-bandit
[96] T. Gan and P. Rondao Alface, “Fast mode decision for h.264/avc encoding of tunnel
surveillance video,” in Advances in Multimedia (MMEDIA), 2010 Second International
Conferences on, June 2010, pp. 7–12.
BIBLIOGRAPHY 221
[97] M. Akram and E. Izquierdo, “Fast multiframe motion estimation for surveillance
videos,” in Image Processing (ICIP), 2010 17th IEEE International Conference on, Sept
2010, pp. 753–756.
[98] ——, “Fast motion estimation for surveillance video compression,” Signal, Image and
Video Processing, vol. 7, no. 6, pp. 1103–1112, 2013.
[99] M. Akram, “Surveillance centric coding,” Ph.D. dissertation, Queen Mary, University
of London, 2011.
[100] G. Xu, M. Ding, Y. Cheng, and Y. Tian, “Global motion estimation based on kalman
predictor,” in Imaging Systems and Techniques, 2009. IST ’09. IEEE International Work-
shop on, May 2009, pp. 395–398.
[101] M. Munderloh, H. Meuel, and J. Ostermann, “Mesh-based global motion compensa-
tion for robust mosaicking and detection of moving objects in aerial surveillance,” in
Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer
Society Conference on, June 2011, pp. 1–6.
[102] H.-J. Stolberg, S. Moch, L. Friebe, A. Dehnhardt, M. Berekovic, and P. Pirsch, “An
soc with two multimedia dsps and a risc core for video compression applications,”
in Solid-State Circuits Conference, 2004. Digest of Technical Papers. ISSCC. 2004 IEEE
International, Feb 2004, pp. 330–531 Vol.1.
[103] Y. Chi, R. Elienne-Cummings, and G. Cauwenberghs, “Image sensor with focal plane
change event driven video compression,” in Circuits and Systems, 2008. ISCAS 2008.
IEEE International Symposium on, May 2008, pp. 1862–1865.
[104] ——, “Image sensor with focal plane change event driven video compression,” in
Circuits and Systems, 2008. ISCAS 2008. IEEE International Symposium on, May 2008,
pp. 1862–1865.
BIBLIOGRAPHY 222
[105] BoZhao, X. Zhang, S. Chen, K.-S. Low, and H. Zhuang, “A 64 × 64 cmos image sensor
with on-chip moving object detection and localization,” Circuits and Systems for Video
Technology, IEEE Transactions on, vol. 22, no. 4, pp. 581–588, April 2012.
[106] S. Mizuno, K. Fujita, H. Yamamoto, N. Mukozaka, and H. Toyoda, “A 256 times;256
compact cmos image sensor with on-chip motion detection function,” Solid-State
Circuits, IEEE Journal of, vol. 38, no. 6, pp. 1072–1075, June 2003.
[107] M. Zhang, N. Llaser, H. Mathias, and A. Dupret, “Design and optimization of two mo-
tion detection circuits for video monitoring system,” in Circuits and Systems (ISCAS),
2012 IEEE International Symposium on, May 2012, pp. 1907–1910.
[108] N. Massari, M. Gottardi, L. Gonzo, D. Stoppa, and A. Simoni, “A cmos image sensor
with programmable pixel-level analog processing,” Neural Networks, IEEE Transac-
tions on, vol. 16, no. 6, pp. 1673–1684, Nov 2005.
[109] W. Leon-Salas, S. Balkir, K. Sayood, N. Schemm, and M. Hoffman, “A cmos imager
with focal plane compression using predictive coding,” Solid-State Circuits, IEEE Jour-
nal of, vol. 42, no. 11, pp. 2555–2572, Nov 2007.
[110] S. Kawahito, M. Yoshida, M. Sasaki, K. Umehara, D. Miyazaki, Y. Tadokoro, K. Mu-
rata, S. Doushou, and A. Matsuzawa, “A cmos image sensor with analog two-
dimensional dct-based compression circuits for one-chip cameras,” Solid-State Cir-
cuits, IEEE Journal of, vol. 32, no. 12, pp. 2030–2041, Dec 1997.
[111] Z. Lin, M. Hoffman, N. Schemm, W. Leon-Salas, and S. Balkir, “A cmos image sensor
for multi-level focal plane image decomposition,” Circuits and Systems I: Regular
Papers, IEEE Transactions on, vol. 55, no. 9, pp. 2561–2572, Oct 2008.
[112] S. Chen, A. Bermak, and Y. Wang, “A cmos image sensor with on-chip image com-
pression based on predictive boundary adaptation and memoryless qtd algorithm,”
Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 19, no. 4, pp.
538–547, April 2011.
BIBLIOGRAPHY 223
[113] M. Zhang and A. Bermak, “Compressive acquisition cmos image sensor: From the
algorithm to hardware implementation,” Very Large Scale Integration (VLSI) Systems,
IEEE Transactions on, vol. 18, no. 3, pp. 490–500, March 2010.
[114] ——, “Cmos image sensor with on-chip image compression: A review and perfor-
mance analysis,” Journal of Sensors, 2010.
[115] C. Yeo and K. Ramchandran, “Robust distributed multiview video compression for
wireless camera networks,” Image Processing, IEEE Transactions on, vol. 19, no. 4,
pp. 995–1008, April 2010.
[116] “http://blinkforhome.com,” Blink wireless surveillance system.
[117] “http://www.vuezone.com,” Netgear VueZone remote video system.
[118] L. J. Song and Q. Fan, “The design and implementation of a video surveillance system
for large scale wind farm,” Advanced Materials Research, vol. 361-363, pp. 1257–
1262, 2011.
[119] C. Hartung, R. Han, C. Seielstad, and S. Holbrook, “Firewxnet: A multi-tiered
portable wireless system for monitoring weather conditions in wildland fire envi-
ronments,” in Proceedings of the 4th International Conference on Mobile Systems, Ap-
plications and Services, ser. MobiSys ’06. New York, NY, USA: ACM, 2006, pp. 28–41.
[120] Y. Ye, S. Ci, A. Katsaggelos, Y. Liu, and Y. Qian, “Wireless video surveillance: A
survey,” Access, IEEE, vol. 1, pp. 646–660, 2013.
[121] J. Jung, J. Lim, S. Lee, J. Lee, J. Yang, and C.-M. Kyung, “A low-energy video event
data recorder using dual image/video codec,” in Advanced Video and Signal Based
Surveillance (AVSS), 2014 11th IEEE International Conference on, Aug 2014, pp. 277–
282.
[122] C. Li, D. Wu, and H. Xiong, “Power-rate-distortion model for wireless video commu-
nication under delay and energy constraints,” Circuits and Systems for Video Technol-
ogy, IEEE Transactions on, vol. 24, no. 7, pp. 1170–1183, July 2014.
BIBLIOGRAPHY 224
[123] Z. He, W. Cheng, and X. Chen, “Energy minimization of portable video communica-
tion devices based on power-rate-distortion optimization,” Circuits and Systems for
Video Technology, IEEE Transactions on, vol. 18, no. 5, pp. 596–608, May 2008.
[124] Z. He, Y. Liang, L. Chen, I. Ahmad, and D. Wu, “Power-rate-distortion analysis for
wireless video communication under energy constraints,” Circuits and Systems for
Video Technology, IEEE Transactions on, vol. 15, no. 5, pp. 645–658, May 2005.
[125] Z. He and D. Wu, “Resource allocation and performance analysis of wireless video
sensors,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 16,
no. 5, pp. 590–599, May 2006.
[126] M. Marijan, I. Demirkol, D. Maricic, G. Sharma, and Z. Ignjatovic, “Adaptive sensing
and optimal power allocation for wireless video sensors with sigma-delta imager,”
Image Processing, IEEE Transactions on, vol. 19, no. 10, pp. 2540–2550, Oct 2010.
[127] P. Kaewtrakulpong and R. Bowden, “An Improved Adaptive Background Mixture
Model for Realtime Tracking with Shadow Detection,” in Proc. 2nd European Work-
shop on Advanced Video Based Surveillance Systems. Kluwer Academic Publishers,
September 2001.
[128] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and
Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006.
[129] A. Prati, I. Mikic, M. M. Trivedi, and R. Cucchiara, “Detecting moving shadows:
algorithms and evaluation,” IEEE Trans. on Pattern Analysis and Machine Intelligence,
vol. 25, no. 7, pp. 918–923, Jun. 2003.
[130] P. Gorur and B. Amrutur, “Speeded up gaussian mixture model algorithm for back-
ground subtraction,” IEEE Conf. on Advanced Video and Signal Based Surveillance
(AVSS), pp. 386–391, 2011.
[131] W. G. Cochran, Sampling Techniques, 3rd Edition. John Wiley, 1977.
[132] S. K. Thompson, Sampling. Wiley Series in Probability and Statistics, 2012.
BIBLIOGRAPHY 225
[133] “Guidance on choosing a sampling design for environmental data collection,” Envi-
ronmental Protection Agency, United States, Tech. Rep., 2002.
[134] H. J. Chang, H. Jeong, and J. Y. Choi, “Active attentional sampling for speed-up
of background subtraction,” IEEE Conf. on Comput. Vision and Pattern Recognition
(CVPR), pp. 2088 –2095, June 2012.
[135] J. Ferryman and A. Shahrokni, “Pets2009: Dataset and challenge,” in Performance
Evaluation of Tracking and Surveillance (PETS-Winter), 2009 Twelfth IEEE Interna-
tional Workshop on, Dec 2009, pp. 1–6.
[136] N. Goyette, P. M. Jodoin, F. Porikli, J. Konrad, and P. Ishwar, “Changedetection.net: A
new change detection benchmark dataset,” in 2012 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition Workshops, June 2012, pp. 1–8.
[137] JVT JM Reference Software. [Online]. Available:
http://iphome.hhi.de/suehring/tml/
[138] J. Park, A. Tabb, and A. Kak, “Hierarchical data structure for real-time background
subtraction,” in IEEE Int. Conf. on Image Process. (ICIP), 2006, pp. 1849–1852.
[139] D.-Y. Lee, J.-K. Ahn, and C.-S. Kim, “Fast background subtraction algorithm using
two-level sampling and silhouette detection,” in IEEE Int. Conf. on Image Process.
(ICIP), 2009, pp. 3177–3180.
[140] J. M. Guo, Y.-F. Liu, C.-H. Hsia, M.-H. Shih, and C.-S. Hsu, “Hierarchical method
for foreground detection using codebook model,” IEEE Trans. Circuits Syst. Video
Technol., vol. 21, no. 6, pp. 804–815, June 2011.
[141] H.-H. Lin, J.-H. Chuang, and T.-L. Liu, “Regularized background adaptation: A novel
learning rate control scheme for gaussian mixture modeling,” IEEE Trans. Image Pro-
cess., vol. 20, no. 3, pp. 822–836, 2011.
[142] C. A. Rothkopf, D. H. Ballard, and M. M. Hayhoe, “Task and context determine where
you look,” J. of Vision, vol. 7, no. 14, 2007.
BIBLIOGRAPHY 226
[143] J. W. Suchow and G. A. Alvarez, “Motion silences awareness of visual change,” Cur-
rent biology, vol. 21, pp. 140 – 143, 2011.
[144] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object detection with
discriminatively trained part-based models,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 32, no. 9, pp. 1627–1645, 2010.
[145] F. Dadgostar, “Real-time vision-based hand and face tracking and recognition of ges-
ture,” Ph.D. dissertation, Massey University, 2006.
[146] F. Dadgostar and A. Sarrafzadeh, “An adaptive real-time skin detector based on hue
thresholding: A comparison on two motion tracking methods,” Pattern Recognition
Letters, vol. 27, no. 12, pp. 1342 – 1352, 2006.
[147] X. Ren and J. Malik, “Learning a classification model for segmentation,” in Computer
Vision, 2003. Proceedings. Ninth IEEE International Conference on, Oct 2003, pp. 10–
17 vol.1.
[148] Y. Yang, S. Hallman, D. Ramanan, and C. Fowlkes, “Layered object detection for
multi-class segmentation,” in Computer Vision and Pattern Recognition (CVPR), 2010
IEEE Conference on, June 2010, pp. 3113–3120.
[149] H. Meuel, M. Reso, J. Jachalsky, and J. Ostermann, “Superpixel-based segmentation
of moving objects for low bitrate roi coding systems,” in Advanced Video and Signal
Based Surveillance (AVSS), 2013 10th IEEE International Conference on, Aug 2013,
pp. 395–400.
[150] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk, “Slic superpix-
els compared to state-of-the-art superpixel methods,” Pattern Analysis and Machine
Intelligence, IEEE Transactions on, vol. 34, no. 11, pp. 2274–2282, Nov 2012.
[151] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Detecting moving objects, ghosts,
and shadows in video streams,” Pattern Analysis and Machine Intelligence, IEEE Trans-
actions on, vol. 25, no. 10, pp. 1337–1342, Oct 2003.
BIBLIOGRAPHY 227
[152] A. Joshi and N. Papanikolopoulos, “Learning to detect moving shadows in dy-
namic environments,” Pattern Analysis and Machine Intelligence, IEEE Transactions
on, vol. 30, no. 11, pp. 2055–2063, Nov 2008.
[153] S. Nadimi and B. Bhanu, “Physical models for moving shadow and object detection
in video,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 26,
no. 8, pp. 1079–1087, Aug 2004.
[154] N. Martel-Brisson and A. Zaccarin, “Learning and removing cast shadows through a
multidistribution approach,” Pattern Analysis and Machine Intelligence, IEEE Transac-
tions on, vol. 29, no. 7, pp. 1133–1146, July 2007.
[155] F. Porikli and J. Thornton, “Shadow flow: a recursive method to learn moving cast
shadows,” in Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference
on, vol. 1, Oct 2005, pp. 891–898 Vol. 1.
[156] T. Horprasert, D. Harwood, and L. S. Davis, “A statistical approach for real-time
robust background subtraction and shadow detection,” in Proc. IEEE ICCV, vol. 99,
pp. 1–19.
[157] A. Sanin, C. Sanderson, and B. C. Lovell, “Shadow detection: A survey and compar-
ative evaluation of recent methods,” Pattern Recognition, vol. 45, no. 4, pp. 1684 –
1695, 2012.
[158] J.-B. Huang and C.-S. Chen, “Moving cast shadow detection using physics-based fea-
tures,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference
on, June 2009, pp. 2310–2317.
[159] A. Leone, C. Distante, and F. Buccolieri, “A texture-based approach for shadow de-
tection,” in Advanced Video and Signal Based Surveillance, 2005. AVSS 2005. IEEE
Conference on, Sept 2005, pp. 371–376.
BIBLIOGRAPHY 228
[160] R. Qin, S. Liao, Z. Lei, and S. Li, “Moving cast shadow removal based on local de-
scriptors,” in Pattern Recognition (ICPR), 2010 20th International Conference on, Aug
2010, pp. 1377–1380.
[161] A. Sanin, C. Sanderson, and B. Lovell, “Improved shadow removal for robust person
tracking in surveillance scenarios,” in Pattern Recognition (ICPR), 2010 20th Interna-
tional Conference on, Aug 2010, pp. 141–144.
[162] C. M. Ahmed Elgammal and D. Hu, “Skin detection - a short tutorial.”
[163] M. Jones and J. Rehg, “Statistical color models with application to skin detection,”
in Computer Vision and Pattern Recognition, 1999. IEEE Computer Society Conference
on., vol. 1, 1999, p. 280 Vol. 1.
[164] H. Greenspan, J. Goldberger, and I. Eshet, “Mixture model for face-color modeling
and segmentation,” Pattern Recognition Letters, vol. 22, no. 14, pp. 1525 – 1536,
2001.
[165] S. Phung, A. Bouzerdoum, and S. Chai, D., “Skin segmentation using color pixel
classification: analysis and comparison,” Pattern Analysis and Machine Intelligence,
IEEE Transactions on, vol. 27, no. 1, pp. 148–154, Jan 2005.
[166] Lti-lib: Image processing and computer vision library. [Online]. Available:
http://ltilib.sourceforge.net/doc/homepage/index.shtml
[167] C. Papageorgiou and T. Poggio, “A trainable system for object detection,” Interna-
tional Journal of Computer Vision, vol. 38, no. 1, pp. 15–33, 2000.
[168] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society
Conference on, June 2005, vol. 1, pp. 886–893 vol. 1.
[169] P. Dollar, Z. Tu, P. Perona, and S. Belongie, “Integral channel features.” in BMVC.
British Machine Vision Association, 2009.
BIBLIOGRAPHY 229
[170] B. Wu and R. Nevatia, “Detection of multiple, partially occluded humans in a single
image by bayesian combination of edgelet part detectors,” in Computer Vision, 2005.
ICCV 2005. Tenth IEEE International Conference on, vol. 1, Oct 2005, pp. 90–97 Vol.
1.
[171] S. Maji, A. Berg, and J. Malik, “Classification using intersection kernel support vector
machines is efficient,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008.
IEEE Conference on, June 2008, pp. 1–8.
[172] Q. Zhu, M.-C. Yeh, K.-T. Cheng, and S. Avidan, “Fast human detection using a cascade
of histograms of oriented gradients,” in Computer Vision and Pattern Recognition,
2006 IEEE Computer Society Conference on, vol. 2, 2006, pp. 1491–1498.
[173] V. Shet, M. Singh, C. Bahlmann, V. Ramesh, J. Neumann, and L. Davis, “Predicate
logic based image grammars for complex pattern recognition,” International Journal
of Computer Vision, vol. 93, no. 2, pp. 141–161, 2011.
[174] P. Viola, M. Jones, and D. Snow, “Detecting pedestrians using patterns of motion
and appearance,” in Computer Vision, 2003. Proceedings. Ninth IEEE International
Conference on, Oct 2003, pp. 734–741 vol.2.
[175] M. Andriluka, S. Roth, and B. Schiele, “People-tracking-by-detection and people-
detection-by-tracking,” in Computer Vision and Pattern Recognition, 2008. CVPR
2008. IEEE Conference on, June 2008, pp. 1–8.
[176] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for ac-
curate object detection and semantic segmentation,” in Computer Vision and Pattern
Recognition (CVPR), 2014 IEEE Conference on, June 2014, pp. 580–587.
[177] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” CoRR, vol.
abs/1409.4842, 2014. [Online]. Available: http://arxiv.org/abs/1409.4842
BIBLIOGRAPHY 230
[178] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C.-
C. Loy, and X. Tang, “Deepid-net: Deformable deep convolutional neural networks
for object detection,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE
Conference on, June 2015, pp. 2403–2412.
[179] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time
object detection with region proposal networks,” CoRR, vol. abs/1506.01497, 2015.
[Online]. Available: http://arxiv.org/abs/1506.01497
[180] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation
of the state of the art,” Pattern Analysis and Machine Intelligence, IEEE Transactions
on, vol. 34, no. 4, pp. 743–761, 2012.
[181] P. Felzenszwalb, R. Girshick, and D. McAllester, “Cascade object detection with de-
formable part models,” in Computer Vision and Pattern Recognition (CVPR), 2010
IEEE Conference on, June 2010, pp. 2241–2248.
[182] H.-T. Lin, C.-J. Lin, and R. Weng, “A note on platts probabilistic outputs for support
vector machines,” Machine Learning, vol. 68, no. 3, pp. 267–276, 2007.
[183] R. C. Jingchen Liu and Y. Liu, in Automatic Surveillance Camera Calibration without
Pedestrian Tracking, 2011, pp. 117.1–117.11.
[184] S. C. Lee and R. Nevatia, “Robust camera calibration tool for video surveillance cam-
era in urban environment,” in Computer Vision and Pattern Recognition Workshops
(CVPRW), 2011 IEEE Computer Society Conference on, June 2011, pp. 62–67.
[185] P. Sudowe and B. Leibe, “Efficient use of geometric constraints for sliding-window
object detection in video,” in Computer Vision Systems, ser. Lecture Notes in Com-
puter Science, J. Crowley, B. Draper, and M. Thonnat, Eds. Springer Berlin Heidel-
berg, 2011, vol. 6962, pp. 11–20.
[186] 3d calibration of riva ip cameras with integrated video analytics. [Online]. Available:
http://www.rivatech.de/en/vca/vca-installation
BIBLIOGRAPHY 231
[187] A. Smeulders, D. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah, “Visual
tracking: An experimental survey,” Pattern Analysis and Machine Intelligence, IEEE
Transactions on, vol. 36, no. 7, pp. 1442–1468, July 2014.
[188] M. Godec, P. Roth, and H. Bischof, “Hough-based tracking of non-rigid objects,” in
Computer Vision (ICCV), 2011 IEEE International Conference on, Nov 2011, pp. 81–
88.
[189] D. Mitzel, E. Horbert, A. Ess, and B. Leibe, “Multi-person tracking with sparse de-
tection and continuous segmentation,” in Computer Vision ECCV 2010, ser. Lecture
Notes in Computer Science, K. Daniilidis, P. Maragos, and N. Paragios, Eds. Springer
Berlin Heidelberg, 2010, vol. 6311, pp. 397–410.
[190] A. Adam, E. Rivlin, and I. Shimshoni, “Robust fragments-based tracking using the
integral histogram,” in Computer Vision and Pattern Recognition, 2006 IEEE Computer
Society Conference on, vol. 1, June 2006, pp. 798–805.
[191] S. Oron, A. Bar-Hillel, D. Levi, and S. Avidan, “Locally orderless tracking,” in Com-
puter Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, June 2012,
pp. 1940–1947.
[192] C. Tomasi and T. Kanade, “Detection and tracking of point features,” International
Journal of Computer Vision, Tech. Rep., 1991.
[193] J. Shi and C. Tomasi, “Good features to track,” in Computer Vision and Pattern Recog-
nition, 1994. Proceedings CVPR ’94., 1994 IEEE Computer Society Conference on, Jun
1994, pp. 593–600.
[194] D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,” Pattern Anal-
ysis and Machine Intelligence, IEEE Transactions on, vol. 25, no. 5, pp. 564–577, May
2003.
BIBLIOGRAPHY 232
[195] B. Wu and R. Nevatia, “Detection and tracking of multiple, partially occluded humans
by bayesian combination of edgelet based part detectors,” International Journal of
Computer Vision, vol. 75, no. 2, pp. 247–266, 2007.
[196] T. Zhao, R. Nevatia, and B. Wu, “Segmentation and tracking of multiple humans in
crowded environments,” Pattern Analysis and Machine Intelligence, IEEE Transactions
on, vol. 30, no. 7, pp. 1198–1211, July 2008.
[197] K. Smith, D. Gatica-Perez, and J. Odobez, “Using particles to track varying numbers
of interacting people,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005.
IEEE Computer Society Conference on, vol. 1, June 2005, pp. 962–969 vol. 1.
[198] B. Babenko, M.-H. Yang, and S. Belongie, “Robust object tracking with online mul-
tiple instance learning,” Pattern Analysis and Machine Intelligence, IEEE Transactions
on, vol. 33, no. 8, pp. 1619–1632, Aug 2011.
[199] S. Avidan, “Ensemble tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 2,
pp. 261–271, Feb. 2007.
[200] Z. Kalal, K. Mikolajczyk, and J. Matas, “Forward-backward error: Automatic de-
tection of tracking failures,” in Pattern Recognition (ICPR), 2010 20th International
Conference on, Aug 2010, pp. 2756–2759.
[201] L. Antanas, M. van Otterlo, J. O. Mogrovejo, T. Tuytelaars, and L. D. Raedt, “There
are plenty of places like home: Using relational representations in hierarchies for
distance-based image understanding,” Neurocomputing, vol. 123, pp. 75 – 85, 2014,
contains Special issue articles: Advances in Pattern Recognition Applications and
Methods.
[202] M. Ginsberg, “Multivalued logics: A uniform approach to reasoning in ai,” Computer
Intelligence, vol. 4, no. 1, pp. 256–316, 1988.
BIBLIOGRAPHY 233
[203] B. Fulkerson, A. Vedaldi, and S. Soatto, “Class segmentation and object localization
with superpixel neighborhoods,” in Computer Vision, 2009 IEEE 12th International
Conference on, Sept 2009, pp. 670–677.
[204] W. Gao, Y. Tian, T. Huang, S. Ma, and X. Zhang, “The IEEE 1857 standard: Empow-
ering smart video surveillance systems,” Intelligent Systems, IEEE, vol. 29, no. 5, pp.
30–39, Sept 2014.
[205] W. Gao and S. Ma, Advanced Video Coding Systems. Springer International Publish-
ing, 2014.
[206] W. Benesova and M. Kottman, “Fast superpixel segmentation using morphological
processing,” in MVML, 2014, pp. 1–9.
[207] P. Dollar, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramids for object
detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36,
no. 8, pp. 1532–1545, 2014.
[208] M. Sadeghi and D. Forsyth, “30hz object detection with dpm v5,” in Computer Vision
ECCV 2014, ser. Lecture Notes in Computer Science, D. Fleet, T. Pajdla, B. Schiele,
and T. Tuytelaars, Eds. Springer International Publishing, 2014, vol. 8689, pp.
65–79.
[209] S. A. Kripke, “Outline of a theory of truth,” Journal of Philosophy, vol. 72, no. 19, pp.
690–716, 1975.
[210] M. Fitting, “Notes on the mathematical aspects of kripke’s theory of truth.” Notre
Dame J. Formal Logic, vol. 27, no. 1, pp. 75–88, 01 1986.
[211] ——, “Bilattices are nice things,” in Self-Reference, T. Bolander, V. Hendricks, and
S. A. Pedersen, Eds. Csli Publications, 2006.
[212] N. Belnap, “How a computer should think,” in Contemporary Aspects of Philosophy,
G. Ryle, Ed. Oriel Press Ltd., 1977.
BIBLIOGRAPHY 234
[213] N. D. Belnap, “A useful four-valued logic,” in Modern Uses of Multiple-Valued Logic,
J. M. Dunn and G. Epstein, Eds. D. Reidel, 1977.
[214] C. Cornelis, O. Arieli, G. Deschrijver, and E. Kerre, “Uncertainty modeling by bilattice-
based squares and triangles,” Fuzzy Systems, IEEE Transactions on, vol. 15, no. 2, pp.
161–175, April 2007.
[215] V. D. Shet, “Bilattice based logical reasoning for automated visual surveillance and
other applications,” Ph.D. dissertation, University of Maryland, College Park, 2007.
[216] B. Schweizer and A. Sklar, “Associative functions and abstract semi-groups,” 1963.