On the Role of Representation in the Analysis of Visual...
Transcript of On the Role of Representation in the Analysis of Visual...
![Page 1: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/1.jpg)
On the Role of Representationin the Analysis of Visual Spacetime
Konstantinos George Derpanis
A dissertation submitted to the Faculty of Graduate Studiesin partial fulfilment of the requirements
for the degree of
Doctor of Philosophy
Graduate Programme in Computer Science and EngineeringYork UniversityToronto, Ontario
July 2010
![Page 2: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/2.jpg)
Abstract
The problems under consideration in this dissertation centre around the rep-resentation of visual spacetime, i.e., (visual) image intensity (irradiance) as afunction of two-dimensional spatial position and time. In particular, the over-arching goal is to establish a unified approach to representation and analysis oftemporal image dynamics that is broadly applicable to the diverse phenomenain the natural world as captured in two-dimensional intensity images. Previousresearch largely has approached the analysis of visual dynamics by appealing torepresentations based on image motion. Although of obvious importance, motionrepresents a particular instance of the myriad spatiotemporal patterns observedin image data. A generative model centred on the concept of spacetime orienta-tion is proposed. This model provides a unified framework for understanding abroad set of important spacetime patterns. As a consequence of this analysis, twonew classes of patterns are distinguished that have previously not been considereddirectly in terms of their constituent spacetime oriented structure, namely mul-tiplicative motions (e.g., translucency) and stochastic-related phenomena (e.g.,wavy water and windblown vegetation). Motivated by this analysis, a represen-tation is proposed that systematically exposes the structure of visual spacetimein terms of local, spacetime orientation. The power of this representation isdemonstrated in the context of the following four fundamental challenges in com-puter vision: (i) spacetime texture recognition, (ii) spacetime grouping, (iii) localboundary detection and (iv) the detection and spatiotemporal localization of anaction in a video stream.
iv
![Page 3: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/3.jpg)
Acknowledgements
I owe a great intellectual debt to my supervisor Rick Wildes, only glowingwords can describe his mentorship and patience. Rick’s principled approach ofstripping a problem down to its essence and adopting the warranted analyticaltools (no less and no more) will hopefully be a lasting legacy in my own research.Through Rick’s generous financial support I was shielded from the often distract-ing aspects of graduate student life, giving me more time to focus on what’simportant, the research. I appreciate the great amount of his precious time hemade available through our weekly meetings, email, and random meetings in hisoffice and hallway. I’m honored to be his first PhD, in a line of many more tofollow.
I am also grateful to John Tsotsos for his support over the past twelve years,both intellectual and personal. John first sparked my interest in computer visionresearch while I was a senior undergrad at the University of Toronto. John hasbeen my guardian angel throughout my grad school journey; I will sadly miss ouralmost daily interactions. I thank the members of my examining committee, JohnTsotsos, Minas Spetsakis, Michael Jenkin, Maz Fallah and Aaron Bobick (exter-nal examiner), for their comments and suggestions concerning the dissertationmanuscript.
I’ve had the great privilege of interacting with an eclectic group of graduatestudents at York University. I thank my labmates in the Computer Vision Labat York, Mikhail Sizintsev, Kevin Cannons and Jack Gryn, for their valuablediscussions concerning my work on visual spacetime and their patience with mypaper sprawl! I also thank Bill Kapralos, Erich Leung, Dan Zikovitz, AlbertRothenstein, Nathan Mekuz, Andrew Hogue, Andrei Zaharescu, Neil Bruce andEugene Simine, for our many enjoyable discussions, both research and otherwise,and the “occasional” coffee break.
I acknowledge the financial support from the National Sciences and Engineer-ing Research Council (NSERC) and Ontario Graduate Scholarship (OGS).
v
![Page 4: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/4.jpg)
Finally, I thank my family. Eυχαριστω τoυς γoνεις µoυ, o Γιωργoς καιη Kατεριυα, για την αγαπη και την υπoστηριξη, µoυ εδειξαν την αξια τηςσκληρης δoυλειας. (Greek-to-English translation: I thank my parents, Georgeand Kathy, for their love and support, they showed me the value of hard work.)My wife, Antonia, for providing me her unending love, patience, companionshipand encouragement to keep pushing ahead in my research when doubt crept in.My last thanks go to my 20 month old son, George, my most important contribu-tion of all, you’ve brought our family tremendous joy and given me much neededfocus.
vi
![Page 5: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/5.jpg)
From henceforth, space by itself, and time by itself, have vanished into the merestshadows and only a kind of blend of the two exists in its own right.
HERMANN MINKOWSKI (quoted in J.R. Newman, The World ofMathematics)
vii
![Page 6: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/6.jpg)
Table of Contents
Abstract iv
Acknowledgements v
Table of Contents viii
List of Tables xii
List of Figures xiv
1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Image motion versus image dynamics . . . . . . . . . . . . 21.2 The role of representation . . . . . . . . . . . . . . . . . . . . . . 71.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4 Overview of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Modeling of Visual Spacetime 132.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . 142.1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.1 Relevance of frequency analysis . . . . . . . . . . . . . . . 182.2.2 Generative model . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.3.1 Dominant orientation . . . . . . . . . . . . . . . . . . . . . 242.3.2 Degenerate orientation . . . . . . . . . . . . . . . . . . . . 262.3.3 Multi-dominant orientation . . . . . . . . . . . . . . . . . 28
viii
![Page 7: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/7.jpg)
2.3.3.1 Additive case: Semi-transparency . . . . . . . . . 302.3.3.2 Multiplicative case: Translucency . . . . . . . . . 312.3.3.3 Multiplicative case: Occlusion . . . . . . . . . . . 382.3.3.4 Multiplicative case: Pseudo-transparency . . . . . 41
2.3.4 Heterogeneous orientation and isotropic structure . . . . . 442.3.5 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4 Discussion and summary . . . . . . . . . . . . . . . . . . . . . . . 45
3 Representation of Visual Spacetime 473.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . 493.1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Technical approach . . . . . . . . . . . . . . . . . . . . . . . . . . 533.2.1 Distributed spacetime orientation representation . . . . . . 533.2.2 Representation properties . . . . . . . . . . . . . . . . . . 61
3.3 Discussion and summary . . . . . . . . . . . . . . . . . . . . . . . 62
4 Visual Spacetime Texture Recognition 644.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . 684.1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Technical approach . . . . . . . . . . . . . . . . . . . . . . . . . . 724.2.1 What is dynamic texture? . . . . . . . . . . . . . . . . . . 724.2.2 Representation: Distributed spacetime orientation . . . . . 764.2.3 Recognition: Spacetime orientation distribution similarity 79
4.3 Empirical evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 834.3.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3.1.1 UCLA dynamic texture data set . . . . . . . . . 834.3.1.2 York University Vision Lab (YUVL) spacetime
texture data set . . . . . . . . . . . . . . . . . . . 854.3.1.3 UCSD traffic data set . . . . . . . . . . . . . . . 91
4.3.2 Dynamic texture classification with the UCLA data set . . 934.3.2.1 Viewpoint specific classification . . . . . . . . . . 934.3.2.2 Shift-invariant classification . . . . . . . . . . . . 954.3.2.3 Semantic category classification . . . . . . . . . . 103
4.3.3 Spacetime texture classification with the YUVL data set . 106
ix
![Page 8: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/8.jpg)
4.3.3.1 Basic-level classification . . . . . . . . . . . . . . 1064.3.3.2 Subordinate-level classification . . . . . . . . . . 113
4.3.4 Traffic congestion classification with the UCSD data set . . 1204.3.5 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.4 Discussion and summary . . . . . . . . . . . . . . . . . . . . . . . 125
5 Spacetime Grouping and Local Boundary Detection 1285.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1285.1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.1.2.1 Spacetime grouping . . . . . . . . . . . . . . . . 1315.1.2.2 Spacetime structure boundary detection . . . . . 135
5.1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 1375.2 Technical approach . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.2.1 Representation: Distributed spacetime orientation . . . . . 1425.2.2 Anisotropic smoothing: Mean-shift . . . . . . . . . . . . . 1445.2.3 Spacetime grouping . . . . . . . . . . . . . . . . . . . . . . 1465.2.4 Spacetime structure boundaries . . . . . . . . . . . . . . . 147
5.3 Empirical evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 1505.3.1 York University Vision Lab (YUVL) Spacetime Structure
Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1505.3.2 Spacetime grouping . . . . . . . . . . . . . . . . . . . . . . 155
5.3.2.1 Representation comparisons . . . . . . . . . . . . 1555.3.2.2 Quantitative results on natural imagery . . . . . 1585.3.2.3 Additional examples . . . . . . . . . . . . . . . . 164
5.3.3 Spacetime structure boundary detection . . . . . . . . . . 1685.4 Discussion and summary . . . . . . . . . . . . . . . . . . . . . . . 175
6 Action Spotting 1786.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1786.1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . 1836.1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.2 Technical approach . . . . . . . . . . . . . . . . . . . . . . . . . . 1876.2.1 Features: Spacetime orientation . . . . . . . . . . . . . . . 1886.2.2 Spacetime template matching . . . . . . . . . . . . . . . . 191
6.2.2.1 Weighting template contributions . . . . . . . . . 1936.2.2.2 Efficient matching . . . . . . . . . . . . . . . . . 194
x
![Page 9: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/9.jpg)
6.2.3 Computational complexity analysis . . . . . . . . . . . . . 1966.3 Empirical evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 1976.4 Discussion and summary . . . . . . . . . . . . . . . . . . . . . . . 208
7 Summary and Conclusions 2117.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2117.2 Suggestions for further research . . . . . . . . . . . . . . . . . . . 216
A Frequency Domain Analysis of Spacetime Structure 223A.1 Image translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 223A.2 Aperture problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
B 3D Separable Steerable Filters 226B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226B.2 Technical approach . . . . . . . . . . . . . . . . . . . . . . . . . . 229
B.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 229B.2.2 Basis functions separable in x-y-z . . . . . . . . . . . . . . 232B.2.3 Example: G2/H2 steerable quadrature filter pairs . . . . . 234
B.3 Empirical evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 236B.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240B.5 x-y-z Separable, Steerable Quadrature Pair Basis Filters . . . . . 240
C Proof of the Smoothness of the Ring-Shaped Function 252
D Summary of Alternative Approaches Compared 255D.1 The state-space model of dynamic texture . . . . . . . . . . . . . 255D.2 Motion boundary detection using optical flow . . . . . . . . . . . 257D.3 Motion boundary detection using the structure tensor . . . . . . . 258D.4 Flow-based action spotting . . . . . . . . . . . . . . . . . . . . . . 260D.5 Parts-based shape plus flow action spotting . . . . . . . . . . . . . 262
E Crop Windows for Shift-Invariant Recognition Experiment 265
Bibliography 266
xi
![Page 10: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/10.jpg)
List of Tables
4.1 YUVL spacetime texture data set summary. . . . . . . . . . . . . 854.2 UCSD traffic data set summary. . . . . . . . . . . . . . . . . . . . 924.3 Summary of semantic reorganization of UCLA dynamic texture
data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1034.4 Dynamic texture confusion matrix. . . . . . . . . . . . . . . . . . 1044.5 Spacetime texture confusion matrix. . . . . . . . . . . . . . . . . . 1064.6 Traffic congestion confusion matrix. . . . . . . . . . . . . . . . . . 120
B.1 Several Gaussian derivatives and polynomial fits to their Hilberttransforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
B.2 x-y-z separable basis set and interpolation functions for the secondderivative of the Gaussian. . . . . . . . . . . . . . . . . . . . . . . 241
B.3 9-tap filters for x-y-z separable basis set for G2. . . . . . . . . . . 241B.4 G2 basis filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241B.5 x-y-z separable basis set and interpolation functions for fit to
Hilbert transform of the second derivative of the Gaussian. . . . . 242B.6 9-tap filters for x-y-z separable basis set for H2. . . . . . . . . . . 242B.7 H2 basis filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243B.8 x-y-z separable basis set and interpolation functions for the third
derivative of the Gaussian, G3. . . . . . . . . . . . . . . . . . . . . 243B.9 13-tap filters for x-y-z separable basis set for G3. . . . . . . . . . 244B.10 G3 basis filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245B.11 x-y-z separable basis set and interpolation functions for the (ap-
proximate) Hilbert transform of the third derivative of the Gaus-sian, H3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
B.12 13-tap filters for x-y-z separable basis set for H3. . . . . . . . . . 247B.13 H3 basis filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248B.14 x-y-z separable basis set and interpolation functions for the fourth
derivative of the Gaussian, G4. . . . . . . . . . . . . . . . . . . . . 249
xii
![Page 11: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/11.jpg)
B.15 13-tap filters for x-y-z separable basis set for G4. . . . . . . . . . 250B.16 G4 basis filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
E.1 Special crop windows for non-stationary cases in UCLA data set. 266
xiii
![Page 12: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/12.jpg)
List of Figures
1.1 Visual spacetime. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Inadequacies of optical flow in natural scenes. . . . . . . . . . . . 6
2.1 Range of spacetime oriented structure. . . . . . . . . . . . . . . . 212.2 Synthetic multiplicative transparency image sequence. . . . . . . . 332.3 Real translucency example. . . . . . . . . . . . . . . . . . . . . . . 362.4 Real structured light example. . . . . . . . . . . . . . . . . . . . . 372.5 Real pseudo-transparency example. . . . . . . . . . . . . . . . . . 43
3.1 Representation spectrum. . . . . . . . . . . . . . . . . . . . . . . 513.2 Oriented energy filters for spatiotemporal analysis. . . . . . . . . . 543.3 Illustration of the ring operator in the frequency domain. . . . . . 58
4.1 Example image frames of spacetime textures in the real-world. . . 664.2 Spatial texture spectrum. . . . . . . . . . . . . . . . . . . . . . . . 734.3 Spacetime texture spectrum. . . . . . . . . . . . . . . . . . . . . . 744.4 Overview of spacetime texture classification approach. . . . . . . . 824.5 Sample frames from the UCLA dynamic texture data set. . . . . . 844.6 Organization of the YUVL spacetime texture data set. . . . . . . 884.7 Example image sequences from the YUVL spacetime texture data
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.8 Example image sequences from the YUVL spacetime texture data
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.9 Example frames from the UCSD traffic video data set. . . . . . . 914.10 Viewpoint specific recognition results based on the UCLA data set. 934.11 Shift-invariant recognition results based on the UCLA data set. . 964.12 Example correct classifications for shift-invariant dynamic texture
categorization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
xiv
![Page 13: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/13.jpg)
4.13 Example correct classifications for shift-invariant dynamic texturecategorization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.14 Examples of several misclassifications from the shift-invariant recog-nition experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.15 Spacetime texture basic-level recognition results. . . . . . . . . . . 1094.16 Example correct classifications for basic-level categorization. . . . 1104.17 Example correct classifications for basic-level categorization. . . . 1114.18 Example misclassifications for basic-level categorization. . . . . . 1124.19 Spacetime texture subordinate-level recognition results. . . . . . . 1144.20 Example correct classifications for subordinate-level categorization. 1164.21 Example correct classifications for subordinate-level categorization. 1174.22 Example correct classifications for subordinate-level categorization. 1184.23 Example misclassifications for subordinate-level categorization. . . 1194.24 Example correct classifications for traffic congestion categorization. 1224.25 Example misclassifications for traffic congestion categorization. . . 123
5.1 Oriented energy decomposition maps structural differences to in-tensity differences. . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.2 Sample frames and corresponding ground truth boundaries fromthe YUVL Spacetime Structure Data Set. . . . . . . . . . . . . . 153
5.3 Synthetic juxtaposed spacetime structures. . . . . . . . . . . . . . 1545.4 Grouping with various levels of data abstraction. . . . . . . . . . . 1565.5 Successful grouping results on the YUVL Spacetime Structure
Data Set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595.6 Spacetime grouping precision-recall curves. . . . . . . . . . . . . . 1615.7 Grouping on a (synthetic) smoothly varying spacetime structure. . 1635.8 Grouping of a target initially at rest, then in motion. . . . . . . . 1655.9 Grouping result on a set of (synthetically combined) dynamic tex-
tures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1665.10 Grouping result on a set of (synthetically combined) dynamic tex-
tures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1675.11 Successful boundary detection results on the YUVL Spacetime
Structure Data Set. . . . . . . . . . . . . . . . . . . . . . . . . . . 1725.12 Local boundary detection precision-recall curves. . . . . . . . . . . 1735.13 Comparison of results for flow, rank, level-set and proposed ap-
proaches to boundary detection. . . . . . . . . . . . . . . . . . . . 174
6.1 Overview of approach to action spotting. . . . . . . . . . . . . . . 182
xv
![Page 14: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/14.jpg)
6.2 Aerobics routine example. . . . . . . . . . . . . . . . . . . . . . . 2036.3 Multiple actions in the outdoors. . . . . . . . . . . . . . . . . . . 2046.4 Foreground clutter example. . . . . . . . . . . . . . . . . . . . . . 2056.5 CMU action data set templates. . . . . . . . . . . . . . . . . . . . 2066.6 Precision-Recall curves for CMU action data set. . . . . . . . . . . 207
B.1 Three-dimensional steerable filter architecture. . . . . . . . . . . . 228B.2 Zone plate slices. . . . . . . . . . . . . . . . . . . . . . . . . . . . 238B.3 G2/H2 RMS error plots. . . . . . . . . . . . . . . . . . . . . . . . 239
xvi
![Page 15: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/15.jpg)
Chapter 1
Introduction
Curiosity demands that we ask questions, that we try toput things together and try to understand this multitudeof aspects as perhaps resulting from the action of a rela-tively small number of elemental things and forces actingin an infinite variety of combinations.
RICHARD FEYNMAN (The Feynman Lectures onPhysics)
1.1 Motivation
It has long been recognized that the temporal variation across a sequence of im-
ages provides a wealth of information about a perceiver’s surroundings in the
world. This type of visual cue can be used to recover a variety of quantitative
and qualitative information, including three-dimensional scene structure, egomo-
tion and higher-level image understanding. The problems under consideration
1
![Page 16: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/16.jpg)
in the present research centre around the representation and analysis of image
dynamics. In particular, this research is concerned with efficiently establishing
an initial representation that is suitable for capturing the wide range of image
dynamics (i.e., image variation over time that is not strictly motion-related) that
one may encounter, with image motion as a special case, albeit an important one.
Ultimately, the efficacy of a representation depends on the visual tasks that it
can address. With the representation defined, its power is demonstrated in the
context of several important visual tasks.
In this dissertation, image sequences are analyzed in the joint multidimen-
sional space spanned by the two-dimensional spatial coordinates and time; this
joint space is termed visual spacetime. This space can be considered as a three-
dimensional, (x, y, t), volume, constructed by stacking consecutive frames along
the temporal axis (see Fig. 1.1).
1.1.1 Image motion versus image dynamics
The recovery of image motion or optical flow, the apparent motion of the image
brightness pattern expressed as a vector field (Horn, 1986), has generally been
presumed to be the initial goal of processing image dynamics. Ideally, optical flow
corresponds to the motion field, which represents the projection of the three-
2
![Page 17: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/17.jpg)
. . .
x
y
t
Figure 1.1: Visual spacetime. (left) An image sequence and (right) its spatiotem-poral volume constructed by stacking the images along the temporal axis.
dimensional velocity of surfaces in a scene onto the image plane; however, in
general, optical flow does not equal the motion field (Horn, 1986; Verri & Poggio,
1989). The reason for the lack of strict equivalence between the two flow fields
is that the motion field is a purely geometric construct, whereas optical flow
depends on assumptions placed on the photometric structure of the image, put
another way, the measurement of visual motion is a visual construction not merely
a passive process (Hoffman, 2000).
A more critical issue with image motion as a representational substrate is
its general inadequacy in describing general image dynamics (Simoncelli, 1993).
Figure 1.2 provides an illustration of several of these inadequacies in the real-
world. Figure 1.2 (a) shows a boat sailing in wavy water with the coast at
3
![Page 18: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/18.jpg)
a distance. In the case of the clear sky region, the lack of both spatial and
temporal variation makes the flow estimation indeterminate (i.e., there are an
infinite number of velocities that are consistent with this type of spatiotemporal
region). In the case of the wavy water, the assumptions typically associated
with defining optical flow are violated (e.g., brightness conservation and local
smoothness). Figure 1.2 (b) shows a campfire scene. In the case of the fire
region, it generally consists of pure temporal variation (flicker), which violates
the assumption that the spatiotemporal variation is due to motion. In the case of
the rising smoke region, (assuming that the smoke is rigid) there is more that one
motion (pointwise) present within this semi-transparency region: (i) the rising
smoke and (ii) the stationary backdrop. This violates the assumption that a
single flow field can describe the region.
A common approach to dealing with the flow inadequacies described above is
simply to ignore them when encountered, e.g., using confidence measures (Beau-
chemin & Barron, 1995) and robust estimators (Meer et al., 1991; Black & Anan-
dan, 1996). Alternatively, if one could encompass the broad range of image
dynamics in a stable manner, the conveyed information would be useful to sub-
sequent visual processes. For instance, the ability to recognize smoke or fire or
both would be useful to detect emergency situations. Motivated by this possibil-
4
![Page 19: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/19.jpg)
ity, the goal of this dissertation is to revisit the issue of the initial representation
of dynamics and its subsequent role in visual tasks.
5
![Page 20: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/20.jpg)
time
clear sky
wavy water
buildings
moving sail boat
(a)
time
fire
rising smoke
backdrop
(b)
Figure 1.2: Inadequacies of optical flow in natural scenes. (a) An image sequencewith a boat sailing on wavy water with the coast at a distance. (b) An imagesequence of a campfire. Checkmarks denote regions where flow recovery is ex-pected to be successful, regions marked with an “x” are inappropriate for flow.Images courtesy of the LabelMe video project (Yuen et al., 2009).
6
![Page 21: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/21.jpg)
1.2 The role of representation
A common theme in this dissertation is the role of representation in analyzing
dynamic imagery. The term “representation” refers to a formal system for making
explicit certain types of information, together with a specification of how to
compute it (Marr, 1982).
A fundamental issue in problem solving, irrespective of the problem domain,
is the selection of a representation of the information. This is important because
it can affect the degree of ease with which a problem can be solved. For instance,
arithmetic operations can be performed with ease when numbers are represented
in the Arabic and binary representations; however, these same operations are
relatively difficult with numbers represented in Roman numerals.
In this dissertation, the following criteria for a “good” representation are
adopted (cf. (Winston, 1994)):
• Significant structure (i.e., set of regularities) for a particular problem(s) are
made explicit while suppressing unnecessary detail.
• Concise in that it expresses things efficiently.
• Facilitates further computation.
• Computable by an existing procedure.
7
![Page 22: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/22.jpg)
• Applicable to useful tasks.
One representation for the analysis of visual spacetime that clearly does not
satisfy all the criteria of a “good representation” is the initial intensity imagery
(the pixel data). In this representation, although all the information is preserved,
no structure is made explicit, thus violating the first criteria. In the context of
problems related to dynamic imagery, the use of a single input image to represent
the temporal data is clearly inappropriate because it suppresses the dynamic
information (i.e., the significant structure).
This dissertation focuses on a particular representation of temporal image
dynamics that is broadly applicable to the diverse phenomena in the natural
world as captured in a temporal sequence of two-dimensional intensity images. In
particular, a distributed representation of visual spacetime along the dimension
of spacetime orientation is considered, where the relative contribution of each
orientation to the region of analysis is made explicit. The significance of spacetime
orientation structure is demonstrated by way of an analysis that shows that a
wide range of important dynamic patterns, both motion and non-motion, can be
modeled in a unified manner simply through consideration of their constituent
spacetime oriented structure in (x, y, t). Finally, the power of this particular
representation will be demonstrated in the context of several important visual
8
![Page 23: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/23.jpg)
tasks.
1.3 Contributions
The most significant contributions of this dissertation are the following:
• This dissertation emphasizes the representational issues regarding general
image dynamics and adheres to the criteria of a “good” representation out-
lined in the previous section. In particular, the proposed representation
is concerned with efficiently exposing constituent spacetime oriented struc-
ture in the data and forms the foundation for novel approaches to several
visual tasks considered in this dissertation.
• A generative model centred on the concept of spacetime orientation is pro-
posed. In this work, a generative model is defined as a set of primitive pat-
terns (i.e., spacetime orientation) that are combined based on a set of rules
to realize the space of patterns of interest. The proposed model provides
a unified framework for understanding a broad set of important spacetime
patterns. The discussion of the various spacetime patterns encompassed by
the model culminates with the introduction of two new classes of patterns
that have not been previously considered directly in terms of their con-
9
![Page 24: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/24.jpg)
stituent spacetime oriented structure, namely multiplicative motions (e.g.,
translucency) and stochastic-related phenomena (e.g., fire and windblown
vegetation). The insights gained from this analysis are used to motivate a
particular representation of visual spacetime.
• In the context of dynamic pattern recognition, a novel approach to repre-
sentation and analysis is proposed. Similar to spatial textures, the dynamic
patterns of concern range from structured patterns (e.g., motion) to more
stochastic ones (e.g., wavy water); these patterns collectively are termed
“spacetime textures”. Aggregated measures of spacetime orientation over
visual spacetime are proposed for characterizing a region’s dynamic content
and used for spacetime texture classification.
• The first unified approach to representing and segregating (i.e., grouping
and detecting local boundaries) a wide range of juxtaposed spacetime pat-
terns (motion, static, flicker, (pseudo-)transparency, translucency, space-
time texture, unstructured) is presented. Previous analyses of spacetime
patterns operate on a case-by-case basis, with image motion predominant.
• A novel approach to the challenge of spatiotemporally detecting and lo-
calizing an action in a video stream is proposed. The approach does not
10
![Page 25: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/25.jpg)
require preprocessing in the form of actor localization, tracking, motion es-
timation, figure-ground segmentation or extensive learning. The approach
can accommodate variable appearance of the same action, rapid dynamics,
multiple actions in the field-of-view, cluttered backgrounds and is resilient
to the addition of distracting foreground clutter. While others have dealt
with background clutter, it appears that the present work is the first to
address directly the foreground clutter challenge.
• As an additional contribution of this work, the video data sets and corre-
sponding ground truth used in evaluating the spacetime texture recognition
and segregation approaches have been made publicly available online for use
by other researchers.
1.4 Overview of thesis
Chapter 1 has served to motivate the issue of visual spacetime representation.
Chapter 2 presents a theoretical analysis of the spacetime oriented structure un-
derlying a wide range of important dynamic patterns, spanning motion and non-
motion dynamics. Patterns of concern are studied within a unified framework and
eschews ad hoc case-by-case analysis. Motivated by the findings in Chapter 2,
Chapter 3 proposes a representation suitable for characterizing a broad set of dy-
11
![Page 26: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/26.jpg)
namic patterns. This representation indicates the relative presence of a given set
of spacetime orientations in a spatiotemporal image region. With the represen-
tation established, Chapters 4 to 6 demonstrate the power of the representation
in the context of three diverse visual tasks. Chapter 4 begins by introducing the
concept of spacetime texture; this concept can be considered the spatiotemporal
analogue of spatial texture. Next, an approach to spacetime texture recognition
is proposed. Chapter 5 proposes a novel grouping mechanism and local boundary
detection scheme that segregates video based on differences in image dynamics
as captured by the proposed representation. Chapter 6 considers the problem of
detecting and spatiotemporally localizing an action defined by a single video tem-
plate in a larger video stream; this problem is termed “action spotting”. Finally,
Chapter 7 provides a summary of the dissertation with suggestions for future
investigations. Relevant related work is discussed in the individual chapters.
12
![Page 27: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/27.jpg)
Chapter 2
Modeling of Visual Spacetime
If you understand something in only one way, then youdon’t really understand it at all.
MARVIN MINSKY (Society of the Mind)
2.1 Introduction
2.1.1 Motivation
The goal of this chapter is to describe a broad set of spatiotemporal signals in
terms of their constituent spacetime oriented structure, both in the spacetime and
frequency domains. This will be done via a generative model that systematically
superimposes spacetime oriented structure. It is shown that this model captures
a wide range of spatiotemporal patterns that typically have been treated non-
13
![Page 28: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/28.jpg)
uniformly on a case-by-case basis. Furthermore, this unified perspective will
prove important in motivating the representation of spatiotemporal signals based
on measurements of spacetime oriented structure discussed in Chapter 3 and its
subsequent application in the visual tasks discussed in Chapters 4 to 6. Portions
of the work described in this chapter appear in (Derpanis & Wildes, 2010b).
2.1.2 Related work
The study of image dynamics in vision has been and continues to be dominated
by the optical flow model (Fennema & Thompson, 1979; Ullman, 1979; Lucas &
Kanade, 1981; Horn & Schunck, 1981). This model seeks to assign a vector field to
the input data indicative of apparent motion (i.e., image velocity/displacement).
A fundamental assumption underlying this model is that a single velocity is
present within a finite image region of analysis. This single motion assumption
commonly appears in the form of the conservation of some image feature prop-
erty over time (e.g., brightness (Horn & Schunck, 1981), phase (Fleet & Jepson,
1990), etc.).
The assumption of a single pointwise velocity is often too restrictive for model-
ing natural scenes. For instance, in many situations multiple image velocities are
present at an image point, including image neighbourhoods of semi-transparency
14
![Page 29: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/29.jpg)
and dynamic occlusion. Towards this end, several models have been proposed
that relax the single motion assumption and admit the superposition of two or
more translating signals, e.g., (Shizawa & Mase, 1991; Langley et al., 1992; Bergen
et al., 1992).
Others have considered the correspondence between image velocity and three-
dimensional orientation in (x, y, t) space (Adelson & Bergen, 1985), or equiva-
lently, modeled the image motion signal in the frequency domain, e.g., (Fahle
& Poggio, 1981; Adelson & Bergen, 1985; Watson & Ahumada, 1983; Heeger,
1988; Simoncelli, 1993). This latter approach has its intellectual antecedents in
earlier work in spatial vision, e.g., (Campbell & Robson, 1968), where it was
shown that perplexing aspects of spatial vision could be better understood in the
frequency domain. This viewpoint served as motivation for optical flow estima-
tion methods based on direction selective, spatiotemporal linear filters, e.g., (Wat-
son & Ahumada, 1985; Adelson & Bergen, 1985; Fleet & Jepson, 1985; Heeger,
1988).
It appears that Fleet (Fleet, 1992) presented the earliest computational anal-
ysis of multiple motions in the frequency domain. Subsequent research con-
cerned with multiplicative motion and dynamic occlusion proposed computa-
tional schemes for recovering the image velocity of constituent components within
15
![Page 30: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/30.jpg)
this framework (Fleet & Langley, 1994; Beauchemin & Barron, 2000b). These
early approaches focused on a world where constituent patterns were composed
of a few spectral components (e.g., sinusoids and plaids). In the real-world, the
spectra of image patterns is typically of a broadband nature (Simoncelli & Ol-
shausen, 2001). For the case of dynamic occlusion with broadband signals, (Yu
et al., 2002) demonstrated that the spectral features relied on in earlier work are
not reliable. Key to their analysis is understanding multiple motion phenomena
in a more natural domain rather than in a highly contrived one. A special case of
multiple superimposed motions considered in (Langer & Mann, 2003) consists of
image patterns where the constituent elements share the same direction of motion
but differ in speed (e.g., panning across a stationary cluttered scene with varying
depth, falling snow and traffic scenes), such patterns are termed “optical snow”.
More generally, (Wildes & Bergen, 2000) considered a broader scope of dynamic
image patterns than those discussed above, including both motion-related (e.g.,
coherent motion and semi-transparency) and non-motion-related signals (e.g.,
flicker and noise-like patterns).
Recently, attention has turned towards modeling visual processes composed of
ensembles of particles exhibiting stochastic dynamics, such as plumes of smoke,
fire, dynamic water and windblown vegetation; these processes are commonly
16
![Page 31: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/31.jpg)
referred to as “dynamic textures”. In these cases, optical flow recovery proves
challenging since assumptions underlying these methods are often violated (e.g.,
image brightness conservation and motion smoothness). To address these cases,
statistical generative models have been proposed to jointly capture the spatial ap-
pearance and dynamics of the input, e.g., (Szummer & Picard, 1996; Fitzgibbon,
2001; Doretto et al., 2003a).
Overall, while these various models have considered similar dynamic image
patterns to those considered in the sequel, none of this work has uniformly con-
sidered the complete set of patterns that are the main subject of the current
chapter.
2.1.3 Contributions
The major contributions in this chapter are two-fold. (i) A generative model
centred on the concept of spacetime orientation is proposed. This model pro-
vides a unified framework for understanding a broad set of important spacetime
patterns. (ii) The discussion of the various spacetime patterns encompassed by
the model culminates with the introduction of two new classes of patterns that
have not been previously modeled directly in terms of their constituent space-
time oriented structure, namely multiplicative motions (e.g., translucency) and
17
![Page 32: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/32.jpg)
stochastic-related phenomena (e.g., dynamic water and windblown vegetation).
2.2 Preliminaries
2.2.1 Relevance of frequency analysis
The Fourier transform is a global transform and as such care must be taken
in extrapolating results to local phenomena. A common property among the
phenomena to be studied is that they are characterized by linear structures in
the spectral domain. These structures represent idealizations. In practice, as
a consequence of the “uncertainty principle” (Bracewell, 2000), these structures
are subject to blurring by the window of analysis. The use of smooth windows,
such as a Kaiser window (Harris, 1978), can ameliorate this problem to some
degree but will not remove it completely. The use of larger windows can also
reduce this problem; however, this increases the possibility of mixing simple local
structures. This dilemma represents an instance of the “generalized aperture
problem” (Jepson & Black, 1993).
In order to apply the Fourier transform, the signal must conform to the Dirich-
let conditions (Bracewell, 2000). These conditions require that over any interval
the signal is absolutely integrable, of bounded variation and that it has a finite
18
![Page 33: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/33.jpg)
number of discontinuities, each of which is finite. Since any measured physi-
cal signal satisfies these conditions, the analysis is ensured to hold for arbitrary
natural image sequences.
2.2.2 Generative model
For the spacetime patterns of concern in this chapter, the following recursive
procedure is used as the generative model for obtaining the final image from
component layers (Adelson & Anandan, 1990). Assume that the depth ordering
of N layers relative to the viewer is given, where the layer composition results
are denoted I0(x), . . . , IN−1(x), with x = (x, y, t)> denoting the spatial, (x, y),
and temporal, t, coordinates. At each pixel, each layer, n, may partially transmit
the total amount of light from the layers beneath it by a transmittance factor of
Tn(x), where 0 ≤ Tn(x) ≤ 1, and may contribute its own emission of quantity
En(x), where En(x) ≥ 0. The non-negativity of the emission term, En(x), follows
from the fact that it represents power per unit foreshortened area per unit solid
angle (radiance) and thereby cannot take on negative values. The boundary cases
of the transmittance factor consisting of zero and one, indicate that the light from
the previous layers is fully attenuated and fully transmitted, respectively. The
final composite image is the result of applying this process recursively from back-
19
![Page 34: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/34.jpg)
to-front, formally,
In(x) = Tn(x)In−1(x) + En(x), (2.1)
where n ≥ 0 and I0(x) ≡ E0(x). Strictly speaking, the image signal is given
in terms of irradiance, while In(x) is given as scene radiance in the generative
model, (2.1); however, since image irradiance is proportional to scene radiance
(Horn, 1986), this distinction is neglected here, as it has been in developing
other applicable analyses of semi-transparency, e.g., (Adelson & Anandan, 1990;
Szeliski et al., 2000).
In the sequel, the generative model, (2.1), is used as a basis for understand-
ing the spacetime oriented structure of various spacetime patterns in a system-
atic fashion. A novel aspect of the presented model in comparison to previ-
ous formulations used to model dynamic imagery (Fleet, 1992; Fleet & Langley,
1994; Beauchemin & Barron, 2000b; Yu et al., 2002), is the explicit introduc-
tion of the non-negativity constraint of the image signal and transmittance. It
will be demonstrated in the context of multiplicative motion patterns that en-
forcing the non-negativity constraint yields oriented structure in the frequency
domain that was conjectured in earlier work to be annihilated in the composi-
tion process (Fleet, 1992; Fleet & Langley, 1994). Interestingly, the constraint of
non-negativity of the image signal has previously appeared in work concerning
20
![Page 35: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/35.jpg)
multi-dominantorientation
unconstrainedorientation(limit case)
x
dominantorientation
y
t
ωy
ωt
ωx
spat
iote
mpo
ral
dom
ain
frequ
ency
dom
ain
heterogeneousorientation
(typically referred to as dynamic texture)
breakdownin characteristic orientation
superposition of spacetime orientation
underconstrainedorientation
degenerate
isotropicstructure
(limit case)
Figure 2.1: Range of spacetime oriented structure. The bottom and top rows ofimages depict prototypical patterns of dynamic structure in the spacetime andfrequency domains, respectively. The horizontal axis indicates the amount ofspacetime oriented structure superimposed in a pattern, with increasing amountsgiven along the rightward direction.
the simultaneous reconstruction of component images and recovery of motions in
imagery containing reflections and transparency (Szeliski et al., 2000); however,
the authors did not pursue the implication of the non-negativity constraint on
the explicit structure of the signal.
21
![Page 36: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/36.jpg)
2.3 Analysis
The local orientation (or lack thereof) of a pattern is a salient characteristic. Ge-
ometrically, spacetime orientation captures the local first-order correlation struc-
ture of a pattern. For vision, local spacetime orientation can have additional
interpretations. Figure 2.1 illustrates the significance of this structure in terms
of describing a range of dynamic patterns in video data, cf. (Wildes & Bergen,
2000). With reference to Fig. 2.1 and as discussed in Section 2.1.2, image velocity
corresponds to a three-dimensional orientation in (x, y, t); indeed, motion repre-
sents a prominent instance of the depicted dominant oriented patterns. To the
left of the dominant oriented pattern reside two degenerate cases corresponding
to patterns where the recovery of spacetime orientation is underconstrained (e.g.,
the aperture problem and pure temporal luminance flicker) and at the limit fully
unconstrained (e.g., blank wall).
Starting again from a dominant oriented pattern and superimposing an addi-
tional spacetime orientation yields a multi-dominant oriented pattern (e.g., semi-
transparency). Here, two spacetime orientations dominate the pattern description
(Fleet, 1992). Continuing the superposition process to the limit, yields the special
case of isotropic structure (e.g., “television snow”), where no discernable orienta-
tion dominates the local region; nevertheless, significant spatiotemporal contrast
22
![Page 37: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/37.jpg)
is present (Wildes & Bergen, 2000). In between the cases of multi-dominant
and isotropic structure lie various complicated phenomena that arise as multiple
spacetime oriented structures (e.g., motions) are composited; these patterns are
commonly termed “dynamic textures” (Doretto et al., 2003a). Occurrences in the
world that give rise to such visual phenomena include those governed by turbu-
lence and other stochastic processes (e.g., dynamic water, windblown vegetation,
smoke and fire).
The remainder of this chapter is concerned with providing formal models for
the range of patterns depicted in Fig. 2.1. The first part of the discussion is
concerned with several extant models (i.e., translation, aperture problem and
additive transparency). In conjunction with the generative model, these models
will serve to introduce two new classes of patterns (i.e., multiplicative motion and
dynamic texture) that have not previously been modeled directly in terms of their
constituent spacetime oriented structure. (For the purpose of keeping this dis-
cussion self-contained, non-original model derivations are relegated to Appendix
A.)
23
![Page 38: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/38.jpg)
2.3.1 Dominant orientation
Spacetime patterns consisting of a single spacetime orientation have traditionally
been associated with textured spatial patterns moving with a uniform velocity
(Fahle & Poggio, 1981; Adelson & Bergen, 1985; Watson & Ahumada, 1985;
Heeger, 1988); static patterns correspond to the special case of zero velocity. In
the frequency domain, the energy of these patterns correspond to a plane through
the origin, with the planar surface slant indicative of velocity.
More formally, consider an image signal, I(x), defined by a 2D spatial intensity
profile I(x, y) moving with velocity v = (u, v)>. In the spatiotemporal domain,
the model for a constant translating image signal is given by,
I(x) = I(x− ut, y − vt). (2.2)
The corresponding spectrum is given by (Fahle & Poggio, 1981; Watson & Ahu-
mada, 1985; Adelson & Bergen, 1985; Heeger, 1988) (see Appendix A.1 for deriva-
tion),
I(k) = I(ωx, ωy)δ(ωxu+ ωyv + ωt), (2.3)
where k = (ωx, ωy, ωt)> denotes the spatiotemporal frequency vector, I denotes
the Fourier transform of the corresponding signal, I, and δ(·) is the Dirac delta
function. Geometrically, this can be interpreted as the spectrum being restricted
24
![Page 39: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/39.jpg)
to a plane through the origin with normal (v>, 1)>. In the 2D case, consisting
of a single spatial dimension, x or y, and time, t, the planar spectra reduces to a
line through the origin. This motion is often referred to as first-order motion or
Fourier motion in the literature.
A lesser known instance of a (approximately) single oriented pattern are im-
age sequences of rain and snow streaks. In layman’s terms, this instance is some-
times described as “it’s raining pins and needles”. Here, spacetime orientation
lies perpendicular to the temporal axis with the spatial orientation component
corresponding to the streaks’ spatial orientation. In the frequency domain, the
energy of these patterns lies approximately on a plane through the temporal fre-
quency axis, where the orientation of the spectral plane is related to the spatial
orientation of the streaks; for a derivation of this model, see (Barnum et al.,
2010).
Beginning with a single spacetime oriented pattern and superimposing space-
time orientations spanning a narrow range about the initial oriented pattern
yields “optical snow” (Langer & Mann, 2003) and “nowhere-static” (Fitzgibbon,
2001) patterns. Optical snow arises in many natural situations where the im-
aged scene elements are restricted to a single direction of motion but vary in
speed (e.g., camera motion across a static scene containing a range of depths
25
![Page 40: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/40.jpg)
and vehicular traffic scenes). In the frequency domain, the energy approximately
corresponds to a “bow tie” signature formed by the superposition of planes. In
contrast, “nowhere-static” patterns do not impose such local directionality con-
straints. Examples of such patterns include, scenarios where a camera pans over
a scene exhibiting stochastic movement (e.g., windblown flowers and cheering
crowds). In the frequency domain, one can think of the energy as corresponding
to the superposition of several “bow ties”, each sharing a common central space-
time orientation. These two patterns and single oriented patterns are collectively
referred to as “dominant oriented” patterns herein.
2.3.2 Degenerate orientation
To the left of the single oriented pattern in Fig. 2.1 reside two degenerate cases
corresponding to spacetime orientation that is partially specified and at the limit
completely unspecified, respectively. In the frequency domain, the energy of the
partially specified case corresponds to a line through the origin; in the frequency
domain the spacetime orientation ambiguity is due to an infinite number of planes
containing the given spectral line through the origin. In the limit, a region
can totally lack any spatiotemporal contrast or vary slowly. Correspondingly,
the frequency domain correlate is isolated in the low-frequency portion of the
26
![Page 41: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/41.jpg)
spectrum (ideally consisting of only a zero frequency (DC) component). This
case is referred to as “unstructured”.
A commonly cited example of an underconstrained spacetime oriented signal
is the “aperture problem” (Horn, 1986). This problem refers to the situation
where the spatial structure of a translating pattern is one-dimensional (e.g., an
edge) within the region of analysis (i.e., the aperture). In this case only the
velocity component in the direction of the normal to the spatial intensity profile
can be recovered.
More formally, consider an image signal, I(x), defined by a 1D spatial intensity
profile I(x) with n a 2D unit vector normal to the spatial oriented structure and
translating with a velocity vn in the direction of n (i.e., the normal velocity). In
the spatiotemporal domain, the aperture problem can be written as
I(x) = I[(n>,−‖vn‖)x]. (2.4)
The corresponding spectrum is given by (Fleet, 1992; Bigun, 2006) (see Appendix
A.2 for derivation),
I(k) = I[(ωx, ωy)n]δ[(ωx, ωy)u]δ[(v>n , 1)k], (2.5)
where u is orthogonal to n. Geometrically, this can be interpreted as the spec-
trum being restricted to a line through the origin. This follows simply by the
27
![Page 42: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/42.jpg)
intersection of the two spectral planes, (ωx, ωy)u = 0 and (v>n , 1)k = 0, defined
by the two Dirac delta components.
A second example of an underconstrained spacetime oriented signal are in-
stances of pure temporal luminance variation or simply flicker (e.g., lightning
illuminating the sky). Such signals are devoid of spatial appearance structure
but exhibit pure temporal variation.
More formally, consider an image signal, I(x), defined by a 1D temporal
intensity profile I(t). In the spatiotemporal domain, flicker can be written as
I(x) = I(t). (2.6)
The corresponding spectrum is simply given by,
I(k) = I(ωt)δ(ωx)δ(ωy). (2.7)
Geometrically, this can be interpreted as the spectrum restricted to a line through
the origin that lies strictly on the temporal frequency axis.
2.3.3 Multi-dominant orientation
To the immediate right of the dominant oriented pattern in Fig. 2.1 reside multi-
dominant patterns where two or more distinct spacetime orientations dominate
the pattern description (e.g., semi-transparency and translucency). In the fre-
28
![Page 43: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/43.jpg)
quency domain, the energy corresponds to two or more planes, each representative
of its respective spacetime orientation. Interestingly, several psychophysical stud-
ies (Mulligan, 1992; Edwards & Greenwood, 2005) have concluded that human
observers can simultaneously perceive two coherently moving patterns (layers)
of white noise that have been combined additively; however, beyond two layers,
humans are no longer able to perceive all layers simultaneously.
In the analysis to follow, the assumption of broadband signals, cf. (Yu et al.,
2002), is augmented with an additional natural domain constraint, that of non-
negativity of the image signal and attenuation/transmittance in the signal com-
position stage, as encompassed by the generative model, (2.1). It will be demon-
strated that the addition of this simple constraint imposes oriented spatiotempo-
ral structure that was previously claimed to be lost in the context of multiplicative
compositions of multiple motions (Fleet, 1992). Without loss of generality, the
number of layers under consideration will be restricted to two.
In the discussion to follow, patterns are distinguished in terms of whether
components layers in the generative model, (2.1), are combined additively or mul-
tiplicatively. Section 2.3.3.1 considers additive dynamic patterns, most notably
semi-transparency. Sections 2.3.3.2 to 2.3.3.3 discuss models of multiplicative
dynamic patterns that arise in situations of translucency, dynamic occlusion and
29
![Page 44: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/44.jpg)
pseudo-transparency.
2.3.3.1 Additive case: Semi-transparency
Reflections are a common source of additive multi-dominant oriented patterns. In
many cases, resulting image patterns contain mixtures of reflected and transmit-
ted light. Common examples include, glass-like and water surfaces that transmit
a reflected image of the environment and a partially uniformly attenuated image
of the scene behind it.
Formally, consider an image signal, I0(x, y), viewed through a non-refractive
translucent layer with a spatially homogeneous transmittance denoted by the con-
stant α. In addition, the translucent layer contributes its own emission, E1(x, y),
due to reflection. If both the image signal and translucent material are moving
with velocities v0 = (u0, v0)> and v1 = (u1, v1)>, respectively, using the genera-
tive model, (2.1), the image sequence signal can be written as
I1(x) = αI0(x) + E1(x) (2.8)
= αI0(x− u0t, y − v0t) + E1(x− u1t, y − v1t). (2.9)
By the superposition property of the Fourier transform (Bracewell, 2000), its
spectrum consists of the sum of translational spectra of the individual layers
30
![Page 45: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/45.jpg)
(Fleet, 1992)
I1(k) = αI0(k) + E1(k) (2.10)
= αI0(ωx, ωy)δ(ωxu0 + ωyv0 + ωt) + E1(ωx, ωy)δ(ωxu1 + ωyv1 + ωt).
(2.11)
2.3.3.2 Multiplicative case: Translucency
Common situations that give rise to multiplicative patterns of concern in this sec-
tion are: structured lighting/shadows effects and translucency (i.e., non-uniform
transmission of light through a surface).
Assume that an image signal, I0(x, y), is viewed through a non-refractive
translucent layer with transmittance T1(x, y). If components I0(x, y) and T1(x, y)
are moving with velocities v0 = (u0, v0)> and v1 = (u1, v1)>, respectively, using
the generative model, (2.1), the image sequence signal can be written as
I1(x) = T1(x− u1t, y − v1t)I0(x− u0t, y − v0t). (2.12)
From the generative model, the transmittance factor is strictly non-negative.
Consequently, T1(x, y) can be reexpressed as the sum of a constant/DC term α
and a zero mean signal, T (x, y) = T1(x, y)− α. Furthermore, to reflect the non-
negative nature of image signals, I0(x, y) can be reexpressed as a constant/DC
31
![Page 46: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/46.jpg)
term β plus a zero mean signal, I(x, y) = I0(x, y)−β. Including these constraints
in (2.12), yields,
I1(x) =
(α + T (x− u1t, y − v1t)
)(β + I(x− u0t, y − v0t)
)(2.13)
= αβ
+ αI(x− u0t, y − v0t) + βT (x− u1t, y − v1t)
+ T (x− u1t, y − v1t)I(x− u0t, y − v0t). (2.14)
From the standard Fourier motion result (Section 2.3.1), the superposition
property and the convolution theorem of the Fourier transform (Bracewell, 2000),
it can be easily shown that the Fourier transform of (2.14) is
I1(k) = αβδ(k)
+ αI(ωx, ωy)δ(ωxu0 + ωyv0 + ωt) + βT (ωx, ωy)δ(ωxu1 + ωyv1 + ωt)
+
(T (ωx, ωy)δ(ωxu1 + ωyv1 + ωt)
)∗(I(ωx, ωy)δ(ωxu0 + ωyv0 + ωt)
),
(2.15)
where ∗ symbolizes the convolution operator. Assuming broadband component
signals, the first term corresponds to a zero frequency (DC) term. The second
and third terms correspond to two oriented spectral planes. Their normal vectors
(v>0 , 1)> and (v>1 , 1)> denote their respective layer velocities. The final term
corresponds to the convolution between two 3D planes that yields a non-oriented
32
![Page 47: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/47.jpg)
*
(T (k)δ(kv1 + ωt))
(I(k)δ(kv0 + ωt))
αI(k)δ(kv0 + ωt) βT (k)δ(kv1 + ωt)
+ ++
ωt
ωx
αβδ(k, ωt)
=
Figure 2.2: Synthetic multiplicative transparency image sequence. The term-by-term composition of a multiplicative transparency is illustrated in the 2D, ωx−ωt,frequency domain (magnitude terms displayed), as given by (2.15). The whitenoise structures are moving in opposite directions with a speed of 1 pixel/frame.A Kaiser window in the spatiotemporal domain was used to reduce windowingdistortions for all terms, except the DC term. For display purposes, the logarithmof the spectrum is displayed.
structure in the case of broadband signals. Finally, one can include the emission
term, E1(x), that will result in the strengthening of the planar spectral structure
of the translucent layer.
Figure 2.2 illustrates the frequency spectra for the various terms of (2.15)
and their compositional result. For illustrative purposes attention is restricted
to the 2D case (x-t). The constituent image signals are both white noise moving
in opposite directions with a speed of 1 pixel/frame. As can be seen, enforcing
33
![Page 48: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/48.jpg)
the non-negativity constraint on the image signal by way of introducing DC
components reveals the oriented structure of the constituent surfaces with the
convolution (distortion) term acting as a non-oriented noise-like backdrop. In
the case where both DC terms are zero, a clear violation of the non-negativity
constraint, the translucency reduces to the non-oriented term,(T (ωx, ωy)δ(ωxu1 + ωyv1 + ωt)
)∗(I(ωx, ωy)δ(ωxu0 + ωyv0 + ωt)
). (2.16)
In Fig. 2.3, a real translucency example is presented. This example consists
of a painting moving behind a spatially varying translucent material, that is also
in motion, and captured by a stationary video camcorder. In the epipolar slice
image, two symmetric diagonal oriented structures are clearly evident. Corre-
spondingly, the main power in the spectral domain is dominated by two lines
through the origin. These structures are consistent with the constant leftward
and rightward motions present within the analysis window.
Structured lighting and shadows can also be modeled as multiplicative mo-
tions as the local surface albedo determines the proportion of impinging light
that is reflected. In Fig. 2.4, a real structured light example is presented. This
example consists of a moving structured light illuminating a textured surface
moving in the opposite direction, captured by a stationary video camcorder. In
the epipolar slice image, two symmetric diagonal oriented structures are clearly
34
![Page 49: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/49.jpg)
evident. Correspondingly, the main power in the spectral domain is dominated
by two lines through the origin. These structures are consistent with the constant
leftward and rightward motions present within the analysis window.
As an example application of the novel theoretical analysis of multiplicative
motions presented in this section, (Derpanis & Wildes, 2010b) considers the task
of multiple motion recovery with multiplicatively combined image motion signals.
35
![Page 50: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/50.jpg)
(a)
gogh.starry-night.jpg (JPEG Image, 1013x834 pixels) - Scaled (63%) http://www.ibiblio.org/wm/paint/auth/gogh/starry-night/gogh.starry-night.jpg
1 of 1 1/8/2009 12:00 PM
(b) (c) (d)
(e) (f)
Figure 2.3: Real translucency example. (a) and (b) depict the “raffia weave” tex-ture from the Brodatz database (Brodatz, 1966) and van Gogh’s “Starry Night”painting, respectively, used to form the constituent layers of the translucencyexample. (c) and (d) represent the first and last frames of a 32 frame imagesequence depicting the “Starry Night” painting (i.e., opaque surface) moving be-hind an acetate (i.e, translucent material) depicting the “raffia weave” texture,captured by a stationary video camcorder. The surfaces are moving in oppositedirections (leftward and rightward) with approximately equal speed. The move-ments were generated using computer-controlled translating stages. The whitesolid lines overlayed for display purposes only in (c) and (d) denote the 32 pixel(horizontal) spatial extent of the analysis window; the temporal extent of theanalysis window is 32 frames. (e) The epipolar slice of the sequence along theanalysis window; the spatial and temporal axes point rightward and downward,respectively. (f) The 2D power spectrum of the Kaiser windowed analysis region;the origin of the spectrum lies in the middle of the image. For display purposes,the DC component has been removed.
36
![Page 51: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/51.jpg)
(a) (b) (c) (d)
(e) (f)
Figure 2.4: Real structured light example. (a) and (b) depict the “wood grain”and “pigskin” textures, respectively, from the Brodatz database (Brodatz, 1966),used to form the constituent layers of the structured light example. (c) and (d)represent the first and last frames of a 32 frame image sequence depicting a mov-ing structured light pattern projected (using an LCD projector) onto an opaquemoving surface, captured by a stationary video camcorder. The structured lightand opaque surface depict the “wood grain” and “pigskin” textures, respectively.The surfaces are moving in opposite directions (leftward and rightward) with ap-proximately equal speed. The movement of the opaque surface was generatedusing a computer-controlled translating stage. The white solid lines overlayed fordisplay purposes only in (c) and (d) denote the 32 pixel (horizontal) spatial extentof the analysis window; the temporal extent of the analysis window is 32 frames.(e) The epipolar slice of the sequence along the analysis window; the spatial andtemporal axes point rightward and downward, respectively. (f) The 2D powerspectrum of the Kaiser windowed analysis region; the origin of the spectrum liesin the middle of the image. For display purposes, the DC component has beenremoved.
37
![Page 52: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/52.jpg)
2.3.3.3 Multiplicative case: Occlusion
In this section, the analysis of dynamic occlusion given in (Yu et al., 2002) is
extended by enforcing the non-negativity constraint on image signals. Note that
unlike the previous multiplicative dynamic patterns discussed, where two motions
were defined at each image point, the following model describes a situation where
two motions are spatially juxtaposed (i.e., only a single velocity is present at each
point but over a given neighbourhood two motions are present).
Assume that an image signal, I0(x, y), moves with a velocity v0 = (u0, v0)>
behind an opaque surface with transmittance T1(x, y) and emission E1(x, y) mov-
ing with velocity v1 = (u1, v1)>. Unlike the translucency case in Section 2.3.3.2,
the transmittance function is now binary (i.e., T1(x, y) ∈ 0, 1). The occlusion
relationship between the two surfaces can be modeled as follow,
I1(x) =
(1− T1(x− u1t, y − v1t)
)E1(x− u1t, y − v1t)
+ T1(x− u1t, y − v1t)I0(x− u0t, y − v0t). (2.17)
Next, let the non-negativity constraints be introduced to both the transmittance
and emission terms. The transmittance T1(x, y) can be reexpressed as the sum of
a constant/DC term α and a zero mean signal, T (x, y) = T1(x, y)−α. While the
emission terms, E1(x, y) and I0(x, y) can be reexpressed as constant/DC terms,
38
![Page 53: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/53.jpg)
β and γ, plus zero mean signals: E(x, y) = E1(x, y)−β and I(x, y) = I0(x, y)−γ.
Introducing these constraints in (2.17) yields,
I1(x) =
(1− α− T (x− u1t, y − v1t)
)(β + E(x− u1t, y − v1t)
)+
(α + T (x− u1t, y − v1t)
)(γ + I(x− u0t, y − v0t)
). (2.18)
The corresponding Fourier transform can be written as,
I1(k) =
((1− α)β + αγ
)δ(k)
+ αI(ωx, ωy)δ(ωxu0 + ωyv0 + ωt)
+ (1− α)E(ωx, ωy)δ(ωxu1 + ωyv1 + ωt)
+ (γ − β)T (ωx, ωy)δ(ωxu1 + ωyv1 + ωt)
−(T (ωx, ωy)δ(ωxu1 + ωyv1 + ωt)
)∗(E(ωx, ωy)δ(ωxu1 + ωyv1 + ωt)
)+
(T (ωx, ωy)δ(ωxu1 + ωyv1 + ωt)
)∗(I(ωx, ωy)δ(ωxu0 + ωyv0 + ωt)
).
(2.19)
The final step consists of defining a transmittance function. Following (Fleet,
1992; Yu et al., 2002), the two-dimensional Heaviside function (i.e., a unit step)
is used for the support of the occluder,
T1(x, y) =
1, (x, y)n ≥ 0
0, otherwise
, (2.20)
39
![Page 54: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/54.jpg)
where n denotes the 2D unit normal vector to the occluding boundary. Note that
the DC term of (2.20) is given by α = 1/2. The assumption of a linear occluding
boundary can be justified on the grounds that the region of analysis that straddles
the boundary is generally much smaller than the constituent surfaces.
With the occlusion model, (2.19), fully specified it now can be interpreted
with broadband signal components. The first term corresponds to a DC com-
ponent. The second and third terms correspond to the scaled spectral planes of
the occluded and occluder signals, respectively. Their normal vectors (v>0 , 1)>
and (v>1 , 1)> denote their respective layer velocities. Here, it is interesting to
point out that in the present derivation both the occluder and occluded signals
explicitly appear as separate terms (ignoring scale and bias); whereas, in the orig-
inal derivation of Eq. (12) in (Yu et al., 2002) only the occluder signal appears
undistorted. The oriented structure of the occluder signal is reinforced by the
fourth and fifth terms. Observing that the motion of the Heaviside function is
an instance of the aperture problem, the final term corresponds to a convolution
between a 3D line and a 3D plane. This lone term contributes to a distortion
from the ideal case of superposition between two oriented planes. As pointed out
in (Yu et al., 2002), the influence of the convolution of the line corresponds to a
hyperbolic distortion that can be assumed negligible as compared to noise. Im-
40
![Page 55: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/55.jpg)
portantly, the main energy of the spectrum lies on the two spectral planes given
by the motion of the occluder and occluded signals.
2.3.3.4 Multiplicative case: Pseudo-transparency
Pseudo-transparency (also commonly referred to as diaphanous or gauzy/sheer
transparency) can also be accommodated by the model, (2.19). This spacetime
structure corresponds to the case where the holes in a perforated occluder are be-
low the observer’s spatial resolution limit (Kersten, 1991). In other words, across
the region of concern each analysis window contains both the foreground and
background. A prime example of this case in the real-world is viewing a moving
object through a fragmented surface, such as a fence, leafless bush, grassy field,
etc. Assuming that the binary transmittance function of the occluder is broad-
band, reflecting its typically “complex” nature, (2.19) can again be interpreted
as two oriented spectral planes through the origin reflecting the velocities of the
two surfaces, where the distortion in the last term, as in the case of translucency,
corresponds to a non-oriented noise backdrop.
In Fig. 2.5, a real pseudo-transparency example is presented. This example
consists of a person moving rightward behind a stationary chain linked fence,
captured by a stationary video camcorder. In the epipolar slice image, diagonal
41
![Page 56: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/56.jpg)
and vertical oriented structures are clearly evident. Correspondingly, the main
power in the spectral domain is dominated by two lines through the origin. These
structures are consistent with the constant rightward motion and static structures
present within the analysis window.
42
![Page 57: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/57.jpg)
(a) (b)
(c) (d)
Figure 2.5: Real pseudo-transparency example. (a) and (b) represent the firstand last frames of a 32 frame image sequence depicting a person moving rightwardbehind a stationary chain linked fence, captured by a stationary video camcorder.The white horizontal lines overlayed for display purposes only in (a) and (b) de-note the 32 pixel (horizontal) spatial extent of the analysis window; the temporalextent of the analysis window is 32 frames. (c) The epipolar slice of the sequencealong the analysis window; the spatial and temporal axes point rightward anddownward, respectively. (d) The 2D power spectrum of the Kaiser windowedanalysis region; the origin of the spectrum lies in the middle of the image. Fordisplay purposes, the DC component has been removed.
43
![Page 58: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/58.jpg)
2.3.4 Heterogeneous orientation and isotropic structure
Continuing the superposition process in Fig. 2.1 to the limit, yields the special
case of isotropic structure (e.g., “television snow”), where no discernable orienta-
tion dominates the local region; nevertheless, significant spatiotemporal contrast
is present. Such dynamic patterns are referred to as “scintillation” in (Wildes &
Bergen, 2000). In the frequency domain, the energy of this pattern corresponds
to an isotropic response throughout.
In between the cases of multi-dominant oriented and isotropic structure pat-
terns lie various complicated phenomena that arise as multiple spacetime oriented
structures (e.g., motions); here lie the patterns that have been often termed “dy-
namic textures” (Doretto et al., 2003a). In this dissertation, this broad set of
patterns that are formed by diverse mixtures of spacetime oriented structures are
collectively referred to as heterogeneous oriented. Occurrences in the world that
give rise to such visual phenomena include those governed by turbulence and
other stochastic processes (e.g., dynamic water, windblown vegetation, smoke
and fire). The key underlying assumption to distinguishing these patterns is that
they exhibit a characteristic distribution of spacetime oriented structure.
44
![Page 59: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/59.jpg)
2.3.5 Recapitulation
This section has shown that the local spacetime orientation of a visual pattern
captures significant, meaningful aspects of its dynamic structure. As illustrated
in Fig. 2.1, the key unifying idea to this analysis is understanding the various
diverse patterns in terms of superimposed spacetime oriented structure, as cap-
tured by the proposed generative model. The first part of the discussion was
concerned with several well-known models of dynamic patterns (i.e., translation,
aperture problem and additive transparency). In conjunction with the gener-
ative model, these models were used to introduce two new classes of patterns
(i.e., multiplicative motion and dynamic texture) that have not previously been
considered directly in terms of their constituent spacetime oriented structure.
2.4 Discussion and summary
The implications of this chapter are both theoretical and practical. From a
theoretical point of view, it is shown that spacetime oriented structure is indeed
present in multiplicative motion signals (e.g., translucency) by considering the
physical constraint that natural image signals cannot take on negative values.
More generally, the presented analysis shows that a wide range of important
45
![Page 60: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/60.jpg)
dynamic patterns (depicted in Fig. 2.1), both motion and non-motion, can be
modeled in a unified manner simply through consideration of their constituent
spacetime oriented structure in (x, y, t).
From a practical viewpoint, with the common structure of various dynamic
patterns revealed, correspondingly unified image processing and inference mech-
anisms can be developed. Such developments can remove the need for operations
that proceed on a case-by-case basis, including the need for potentially compli-
cated integration mechanisms. In particular, the theoretical developments can
serve to motivate further techniques for image sequence processing and interpre-
tation based on spatiotemporal orientation measurements, irrespective of whether
multiple motions are present or not, irrespective of whether multiple motions are
combined additively or multiplicatively and irrespective of whether the pattern
dynamics is best described as motion in the first place. In the remainder of this
dissertation, this unified perspective will prove important in motivating the repre-
sentation of spatiotemporal signals based on measurements of spacetime oriented
structure discussed next in Chapter 3 and its subsequent application in the visual
tasks discussed in Chapters 4 to 6.
46
![Page 61: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/61.jpg)
Chapter 3
Representation of Visual
Spacetime
How complex or simple a structure is depends criticallyupon the way we describe it. Most of the complex struc-tures found in the world are enormously redundant, andwe can use this redundancy to simplify their description.But to use it, to achieve the simplification, we must findthe right representation.
HERBERT A. SIMON
47
![Page 62: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/62.jpg)
3.1 Introduction
3.1.1 Motivation
Chapter 2 demonstrated that the local spacetime orientation description of a
visual dynamic pattern captures significant, meaningful aspects of its temporal
variation. In particular, these patterns can be viewed as (multi-)oriented intensity
structures in spacetime. This view suggests a signal decomposition using filters
tuned to 3D spacetime oriented structure in the spatiotemporal domain, (x, y, t),
or equivalently, in terms of the set of constituent planes through the origin in the
frequency domain, (ωx, ωy, ωt).
The goal of the present chapter is the development of a representation of the
input data at the earliest stages of processing that is broadly applicable to the
diverse phenomena one may encounter in the natural world. It is proposed that
by selecting an ill-suited image representation of dynamics, even trivial visual
tasks in one representation may be rendered impossible to solve in others. In the
next section it is argued that extant representations of dynamics are generally
ill-suited for natural imagery. This is followed by a discussion of the proposed
representation that is concerned with exposing constituent spacetime oriented
structure in the data. The power of the proposed representation is demonstrated
48
![Page 63: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/63.jpg)
in the context of several visual tasks that follow in Chapters 4 to 6.
3.1.2 Related work
Figure 3.1 illustrates an organization of several extant representations of image
dynamics. The cited representations are sorted based on their commitment to an
underlying model (i.e., level of abstraction). At one extreme, no commitment to
an abstraction is made, the raw signal is used directly (e.g., pointwise intensity
and colour). This representation fails to leverage the rich underlying structure in
the data.
Moving away from the raw image signal, one could abstract the data in terms
of a (normalized) spectral decomposition (i.e., filters localized in different por-
tions of the frequency domain), e.g., (Chomat & Crowley, 1999). This approach
may be likened to a generalized texture analysis, cf. (Rubner & Tomasi, 1996). A
drawback of this representation is that the dynamics information is confounded
with appearance information. Consequently, regions sharing similar dynamics
but differing in spatial appearance will map differently in this space. Represen-
tations based on spacetime gradients, e.g., (Zelnik-Manor & Irani, 2006), and
frame differencing, e.g., (Mounts, 1969; Jain & Nagel, 1979), form special cases
of the generalized texture representation. Similar to the generalized texture rep-
49
![Page 64: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/64.jpg)
resentation, the spacetime gradient measure is problematic since it is confounded
with appearance. In terms of frame differencing, beyond the difficulty of setting
the level of change deemed significant, a major drawback is that it does not pro-
vide direct means for interpreting the source of the change since it confounds
all observed temporal variation together, irrespective of its origins (e.g., moving
surface vs. scene illumination change).
At the other extreme of Fig. 3.1, one assumes that a single or some spec-
ified number of superimposed vector fields (denoting pointwise image velocity
estimates) can represent the input signal. This represents the dominant focus
in the literature, as exemplified by the multitude of review papers (Huang &
Tsai, 1981; Aggarwal, 1986; Nagel, 1986; Hildreth & Koch, 1987; Aggarwal &
Nandhakumar, 1988; Vega-Riveros & Jabbour, 1989; Barron et al., 1994a; Beau-
chemin & Barron, 1995; DuFaux & Moscheni, 1995; Mitiche & Bouthemy,
1996; Haußecker & Spies, 1999; Stiller & Konrad, 1999; Fleet & Weiss, 2005) and
comparative evaluation papers (Burt et al., 1982; Little & Verri, 1989; Willick &
Yang, 1991; Barron et al., 1994b; Otte & Nagel, 1994; Liu et al., 1996; Bainbridge-
Smith & Lane, 1997; Galvin et al., 1998; McCane et al., 2001; Baker et al.,
2007; Liu et al., 2008) dedicated to the topic. A related representation is nor-
mal flow (Horn, 1986), which measures the velocity component parallel to the
50
![Page 65: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/65.jpg)
optical flow
normal flow
over commitment
raw signal
difference image or
spacetime gradient
no commitment
proposed distributedmeasurements of spacetime orientation
level of commitment
spectral decomposition
Figure 3.1: Representation spectrum. Common dynamic imagery abstractionsand their respective level of commitment to an underlying model.
local image gradient. Both abstractions represent an over-commitment to lo-
cal translational motion (or equivalently, motion-related dominant orientation),
which cannot describe the general spacetime structure one may encounter in the
natural world. Examples of inappropriate situations, include, unstructured re-
gions (e.g., blank wall and clear sky), the aperture problem, multiple motions per
point (i.e., semi-transparency), flicker, scintillation, etc. In addition, similar to
the representations cited above, a drawback of normal flow as a representation of
dynamics is that the measure is dependent on appearance.
51
![Page 66: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/66.jpg)
3.1.3 Contributions
In the light of previous representations, the contribution of the present work
is an efficiently computable novel representation that serves as an approximate
midpoint to the two extremes depicted in Fig. 3.1. The representation eschews
the semantically loaded question “What motion is present in the data?” that
underlies all flow-based analyses and instead poses the geometric-based question
“What spacetime orientations are present in the data?”. This restatement al-
lows for the uniform description of the various spacetime structures, both motion
and non-motion, discussed in Chapter 2; whereas, the previous surveyed repre-
sentations lack this flexibility. The representation yields a set of measurements
(a distribution) that indicates the relative presence of a given set of spacetime
orientations in the input signal.
The components of the proposed representation bears resemblance to several
previous representations. Most closely related is the representation in (Simon-
celli, 1993; Simoncelli & Heeger, 1998) that provides a computational model of
primate neurons in Middle Temporal (MT) area, which are considered to be ve-
locity selective. Apart from variations on the representation instantiation, the
key differentiating aspect between the two representations lies in their applica-
tion. The representation in (Simoncelli, 1993; Simoncelli & Heeger, 1998) is used
52
![Page 67: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/67.jpg)
in analyzing patterns with motion as their origins; whereas, the proposed repre-
sentation is targeted at the spacetime orientation structure of general dynamic
imagery, including but not exclusive to motion.
3.2 Technical approach
3.2.1 Distributed spacetime orientation representation
This section discusses the general framework underlying the construction of the
spacetime orientation representation central to the visual tasks considered in this
dissertation. The unspecified particulars of the representation in this section
are task specific parameters to be defined in later chapters in conjunction with
particular visual tasks (e.g., the set of spacetime orientations and filter-order).
The desired spacetime orientation decomposition is realized using a set of
broadly tuned quadrature pair filters based on 3D Gaussian Nth derivative filters,
GNθ(x), and their Hilbert transforms, HNθ
(x), with the unit vector θ capturing
the 3D direction of the filter symmetry axis and x = (x, y, t)> spacetime position;
various other choices of oriented filters are also possible. A pair of filters is
considered to be in quadrature if they share the same frequency tuning but differ
in phase by 90o, i.e., are Hilbert transforms of each other (Jahne, 2005). (The
53
![Page 68: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/68.jpg)
(a) (b) (c)
Figure 3.2: Oriented energy filters for spatiotemporal analysis. (a) The plotillustrates in the spatiotemporal domain an iso-surface profile for the secondderivative of Gaussian filter, G2θ
(x, y, t), oriented along the x-axis (i.e., θ =
(1, 0, 0)>); red surface portion denotes positive and black negative. (b) Thecorresponding Hilbert transform, H2θ
(x, y, t). (c) The frequency response for thecorresponding quadrature pair filters.
filtering considered in this dissertation is restricted to a single spacetime scale
of analysis, i.e., radial scale of spacetime structure. Potential for multi-scale
extension is described further below.) The filter responses to the image data are
pointwise rectified (squared), pointwise summed, and integrated (summed) over a
spacetime neighbourhood, Ω, to yield the following aggregated pointwise energy
measurement, cf. (Knutsson & Granlund, 1983; Adelson & Bergen, 1985; Heeger,
1988; Wildes & Bergen, 2000):
Eθ(x) =∑x∈Ω
[GNθ(x) ∗ I(x)]2 + [HNθ
(x) ∗ I(x)]2, (3.1)
where I(x) denotes the input imagery and ∗ convolution. This filtering step can
be realized efficiently via separable steerable filtering; see Appendix B for details.
54
![Page 69: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/69.jpg)
Notice that while the employed Gaussian derivative filter and its Hilbert trans-
form are phase-sensitive individually, their quadrature pair combination (i.e.,
pointwise summation of squared responses) removes this sensitivity to yield a
measurement of local signal energy at orientation θ. Alternatively, summation
over a region of support about either individual squared component (i.e., G2Nθ
or H2Nθ
) also ameliorates phase sensitivities. More specifically, this follows from
Rayleigh’s-Parseval’s theorem (Jahne, 2005) that specifies the phase-independent
signal energy in the frequency passband of the Gaussian derivative:
Eθ(x) ∝∑
ωx,ωy ,ωt
|FGNθ(x) ∗ I(x)(ωx, ωy, ωt)|2 (3.2)
=∑
ωx,ωy ,ωt
|FHNθ(x) ∗ I(x)(ωx, ωy, ωt)|2, (3.3)
where (ωx, ωy) denote the spatial frequency, ωt the temporal frequency and F the
Fourier transform1. Comparing (3.1) - (3.3) it is seen that the sum of the squared
components of the individual quadrature pairs (theoretically) yield a proportional
result to the individual component. This follows from the fact that the GN and
HN filters differ only in their respective phase characteristics. Alternatively, it
can be shown (Aach et al., 1995) that the pointwise sum of the squared quadra-
ture filter responses is proportional to the low-pass filtered squared response of
1Strictly, Rayleigh’s theorem is stated with infinite frequency domain support on summation.
55
![Page 70: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/70.jpg)
GN or HN . Thus, including both squared filter responses in (3.1) is in theory
not strictly necessary; alternatively, integration over a support region, Ω, is not
necessary when both squared filter responses are combined pointwise. The choice
of including both quadrature components and support of integration represents
an implementation tradeoff between local signal fidelity versus computational ex-
pense and ultimately depends on the application. Inclusion of the Hilbert portion
improves the signal-to-noise ratio (SNR) of the energy estimates while minimiz-
ing the region of aggregation, Ω. Such considerations are particularly important
in the segregation tasks discussed in Chapter 5, where a compact region of aggre-
gation minimizes blurring over boundaries between coherent dynamic structures.
The major drawback in using both quadrature pair components is that additional
costly filtering operations are necessary.
Each oriented energy measurement, (3.1), is confounded with spatial orien-
tation; equivalently, in the case of motion-related structure, each oriented filter
is maximally selective to the normal flow of the pattern. Consequently, in cases
where the spatial structure varies widely about an otherwise coherent dynamic
region (e.g., single motion of a surface with varying spatial texture), the responses
of the ensemble of oriented energies will reflect this behaviour and thereby are
appearance dependent; whereas, a description of pure pattern dynamics is sought.
56
![Page 71: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/71.jpg)
As discussed in Chapter 2, a pattern exhibiting a single spacetime orientation
(e.g., image velocity) manifests itself as a plane through the origin. Correspond-
ingly, summation across a set of x-y-t-oriented energy measurements consistent
with a single frequency domain plane through the origin is indicative of energy
along the associated spacetime orientation, independent of purely spatial ori-
entation. Since Gaussian derivative filters of order N are used in the oriented
filtering, (3.1), it is appropriate to consider N +1 equally spaced directions along
each frequency domain plane of interest, as N + 1 directions are needed to span
orientation in a plane with Gaussian derivative filters of order N (Freeman &
Adelson, 1991). Let each plane be parameterized by its unit normal, n; a set of
equally spaced N + 1 directions within the plane are given as
θi = cos
(2πi
N + 1
)θa(n) + sin
(2πi
N + 1
)θb(n), (3.4)
with θa(n) = n× ex/‖n× ex‖, θb(n) = n× θa(n), ex denotes the unit vector along
the ωx-axis2 and 0 ≤ i ≤ N . In the case where the spacetime orientation is defined
by image velocity (u, v)>, the normal vector is given by n = (u, v, 1)>/‖(u, v, 1)>‖.
Now, energy along a frequency domain plane with normal n and spatial ori-
entation discounted through marginalization (summation), is given by summing
2Depending on the spacetime orientation sought, ex can be replaced with another axis toavoid an undefined vector.
57
![Page 72: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/72.jpg)
ωx
ωy
ωt ωt
ωy
ωx
∑
Figure 3.3: Illustration of the ring operator in the frequency domain. (left) Aset of N + 1 equally spaced Nth derivative of Gaussian filters lying on a planeis shown; the depicted filters correspond to the second derivative of Gaussian(N = 2). (right) The corresponding level surface of the sum of power spectra ofthese filters is shown. The result is a smooth torus that is maximally responsiveto a given spacetime orientation.
across the set of measurements, Eθi , cf. (Simoncelli, 1993; Simoncelli & Heeger,
1998):
En(x) =N∑i=0
Eθi(x), (3.5)
with θi one of N+1 specified directions, (3.4), and each Eθi calculated via the ori-
ented energy filtering, (3.1). In the frequency domain, the marginalization step,
(3.5), can be interpreted as computing the energy of the signal along weighted
level surfaces of a smooth ring-shaped function about a particular plane through
the origin in the frequency domain (see Fig. 3.3); see Appendix C for a proof of
the smoothness of the ring-shaped function. This observation can provide the
basis for a multi-scale extension via systematically altering the inner and outer
58
![Page 73: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/73.jpg)
extents of the ring and thus selectively covering the frequency domain.
Finally, the marginalized energy measurements, (3.5), are confounded by the
local contrast of the signal and as a result increase monotonically with contrast.
This makes it impossible to determine whether a high response for a particular
spacetime orientation is indicative of its presence or is indeed a low match that
yields a high response due to significant contrast in the signal. To arrive at a
purer measure of spacetime orientation, the energy measures are normalized by
the sum of consort planar energy responses at each point, cf. (Landy & Bergen,
1991; Lubin, 1992; Heeger, 1992; Simoncelli, 1993; Simoncelli & Heeger, 1998;
Wildes & Bergen, 2000):
Eni(x) =Eni(x)
M∑j=1
Enj(x) + ε
, (3.6)
where M denotes the number of spacetime orientations considered and ε is a
scalar constant introduced as a noise floor and to avoid instabilities at points
where the overall energy is small. To this set an additional measurement may be
considered that explicitly captures lack of structure via normalized ε,
Eε(x) =ε
M∑j=1
Enj(x) + ε
. (3.7)
(Note that for loci where oriented structure is less apparent, the summation in
59
![Page 74: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/74.jpg)
Algorithm 1: Distributed spacetime orientation representation.Input: I: Input image sequence, Ω: spacetime integration neighbourhood, ni: Set of
spacetime orientations, ε: noise floor scalar, N : order of the Gaussian derivativefilters
Output: Eni∪ Eε: Set of normalized energy responses tuned to spacetime orientation
Step 1: Measure signal energy tuned to both spatial and temporal
orientation
1. Decompose the input signal by the set of linear filters, GN/HN .2. Square the linear filter outputs and sum over a neighbourhood of spacetime,
(Eq. 3.1).Step 2: Remove dependence on spatial orientation
3. Marginalize (integrate) the energy measurements (Eq. 3.5).Step 3: Remove dependence on signal contrast
4. Divisively normalize the marginalized outputs (Eq. 3.6).5. (Optional) Divisively normalize the unstructured response (Eq. 3.7).
(3.7) will tend to 0; hence, Eε approaches 1 and thereby indicates relative lack of
structure.)
Conceptually, (3.1) - (3.7) takes an image sequence and carves its (local)
power spectrum into a set of planes, with each plane corresponding to a particular
spacetime orientation, to provide a relative indication of the presence of structure
along each plane or lack thereof in the case of a uniform intensity region as
captured by the normalized ε, (3.7).
To recapitulate, the general construction of the proposed representation is
given in algorithmic terms in Algorithm 1.
60
![Page 75: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/75.jpg)
3.2.2 Representation properties
The constructed representation enjoys a number of general attributes that are
worth emphasizing:
1. Owing to the bandpass nature of the Gaussian derivative filters, (3.1), the
representation is invariant to additive photometric bias in the input signal.
2. Owing to the divisive normalization, (3.6), the representation is invariant
to multiplicative photometric bias.
3. Owing to the marginalization of spatial orientation, (3.5), the representa-
tion is invariant to changes in appearance manifest as spatial orientation
variation.
Overall, these three invariances result in a robust pattern description that is
invariant to changes that do not correspond to dynamic variation (i.e., spatial
appearance), even while making explicit local orientation structure that arises
with temporal variation (single motion, multiple motion, temporal flicker, etc.).
In addition, the representation is efficiently realized via linear (separable convolu-
tion, pointwise addition) and pointwise non-linear (squaring, division) operations.
61
![Page 76: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/76.jpg)
More specifically, the convolutions are realized via a set of separable steerable
filters, which are discussed in Appendix B. Moreover, GPU implementation of
Algorithm 1 can execute in real-time (Zaharescu & Wildes, 2010).
3.3 Discussion and summary
Motivated by the importance of spacetime oriented structure of dynamic patterns
discussed in Chapter 2, the current chapter builds upon previous work on space-
time oriented filtering but differs markedly in that filtering is executed to capture
the (possibly multi-) oriented structure of raw dynamic visual data rather than
the recovery of an assumed flow field (i.e., motion-related dominant spacetime
orientation). The final decomposition, (3.6) and (3.7), constitutes a distributed
representation of visual spacetime along the dimension of spacetime orientation.
The relative contribution of each orientation to the local signal is thereby made
explicit.
The proposed representation satisfies all the criteria of a “good” representa-
tion established in Section 1.2. Specifically, the significant structure (i.e., space-
time orientation as established in Chapter 2) is made explicit while suppressing
unnecessary detail (e.g., spatial appearance), the information is represented con-
cisely as a distribution/histogram, it facilitates further computation, a procedure
62
![Page 77: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/77.jpg)
for realizing the representation has been established and its applicability to useful
(visual) tasks is confirmed in the following chapters.
In contrast to the main theme of this chapter regarding the representation
of spacetime structure, the specific details of our implementation are less impor-
tant. In terms of oriented filtering, a variety of filters could be substituted, such as
higher-order directional cosine, Gabor, lognormal and causal-time filters (Jahne,
2005). Also, one could design a ring-shaped filter directly rather than rely on
integrating spacetime oriented components as outlined in this chapter. The com-
ponentwise construction presented here is advantageous in integrated systems,
where component measurements can be reused. For example, the same compo-
nent (differential) measurements can be used to recover optical flow (Adelson &
Bergen, 1986).
In summary, this chapter has presented a unified approach to representing a
wide range of spacetime patterns. The approach is based on a distributed charac-
terization of visual spacetime in terms of 3D, (x, y, t), spatiotemporal orientation.
The utility of the representation is demonstrated in the context of several visual
tasks that follow in Chapters 4 to 6.
63
![Page 78: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/78.jpg)
Chapter 4
Visual Spacetime Texture
Recognition
Everything should be made as simple as possible, but notsimpler.
ALBERT EINSTEIN
4.1 Introduction
4.1.1 Motivation
Many commonly encountered visual patterns are best characterized in terms of
the aggregate dynamics of a set of constituent elements, rather than in terms of
the dynamics of the individuals. Several examples of such patterns are shown in
64
![Page 79: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/79.jpg)
Fig. 4.1. In the computer vision literature, these patterns have appeared collec-
tively under various names, including, turbulent flow/motion (Heeger & Pentland,
1986), temporal textures (Nelson & Polana, 1992), time-varying textures (Bar-
Joseph et al., 2001), dynamic textures (Saisan et al., 2001), and textured motion
(Wang & Zhu, 2003). Most typically, the term “dynamic texture” has been used
with reference to images of natural processes that exhibit stochastic dynamics
(e.g., fire, turbulent water and windblown vegetation). In the present work, the
broader class that includes stochastic as well as simpler phenomena (e.g., orderly
pedestrian crowds, vehicular traffic, and even scenes containing purely translating
surfaces) when viewed on a regional basis will be considered in a unified fashion.
The term “spacetime texture” will be used herein in reference to this broad set
to avoid confusion with previous terms that focused on various subsets and thus
grouped dynamic patterns (e.g., motion and dynamic texture) in an artificially
disjoint manner.
The ability to discern dynamic patterns based on visual processing is of sig-
nificance to a number of applications. In the context of surveillance, the ability
to recognize dynamic patterns can serve to isolate activities of interest (e.g., bio-
logical movement and fire) from distracting background clutter (e.g., windblown
vegetation and changes in scene illumination). Further, dynamic patterns can
65
![Page 80: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/80.jpg)
Figure 4.1: Example image frames of spacetime textures in the real world. (left-to-right, top-to-bottom) Forest fire, crowd of people running, vehicular traffic,waterfall, dynamic water and flock of birds in flight.
serve as complementary cues to spatial appearance-based ones to support the
indexing and retrieval of video. Also, in the context of guiding the decisions of
intelligent agents, the ability to discern certain critical dynamic patterns may
serve to trigger corresponding reactive behaviours (e.g., flight and pursuit).
The goal of the present chapter is the development of a unified approach to
representing and recognizing a diverse set of dynamic patterns with robustness
to viewpoint and with ability to encompass recognition in terms of semantic
categories (e.g., recognition of fluttering vegetation without being tied to a spe-
cific view of a specific bush). Toward that end, an approach is developed that
is based solely on observed dynamics (i.e., excluding purely spatial appearance
66
![Page 81: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/81.jpg)
cues); with the dynamics factored-out and recognized separately, state-of-the-art
appearance methods can be combined for recognition. This approach follows the
classic divide-and-conquer principle whereby a complicated task (i.e., recognizing
dynamic imagery) is decomposed into simpler sub-tasks (i.e., separately recogniz-
ing spatial texture appearance and image dynamics), each of which is addressed
separately.
As discussed in Chapter 2, local spatiotemporal orientation is of fundamen-
tal descriptive power, as it captures the first-order correlation structure of the
data irrespective of its origin (i.e., irrespective of the underlying visual phenom-
ena), even while distinguishing a wide range of dynamic patterns of interest (e.g.,
flicker, single motion, multiple motions and scintillation). Correspondingly, each
dynamic pattern will be associated with a distribution (histogram) of measure-
ments that indicates the relative presence of a particular set of 3D orientations
in visual spacetime, (x, y, t), as captured by the approach to representation de-
veloped earlier in Chapter 3, and recognition will be performed by matching such
distributions. It is assumed that the spacetime region of support of the patterns
under analysis is provided; this can be achieved for example by the grouping
approach detailed in Chapter 5. Interestingly, the distribution of oriented space-
time structure has been shown to be an important discriminating factor in human
67
![Page 82: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/82.jpg)
perception studies of stochastic dynamics (Williams & Sekuler, 1984; Williams
et al., 1991). Portions of the work described in this chapter appear in (Derpanis
& Wildes, 2010a).
4.1.2 Related work
Over the past three decades various representations have been proposed for char-
acterizing dynamic textures for the purpose of recognition (Chetverikov & Peteri,
2005). In this section, several representative strands of research are reviewed.
One strand of research explores physics-based approaches, e.g., (Kung &
Richards, 1988). These methods derive models for specific dynamic textures
(e.g., water) based on a first-principles analysis of the generating process. With
the model recovered from input imagery, the underlying model parameters can
be used to drive inference. Beyond computational issues, the main disadvantage
of this class of approaches is that the derived models are highly focused on spe-
cific dynamic textures, and thus lack generalization to other classes of dynamic
textures.
Motivated by successes in spatial texture-related research, an early seminal
approach to uniform analysis of a diverse set of dynamic textures was based on
extracting first- and second-order statistics of motion flow field-based features,
68
![Page 83: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/83.jpg)
assumed to be captured by estimated normal flow (Nelson & Polana, 1992).
This work was followed-up by numerous proposed variations of normal flow, e.g.,
(Bouthemy & Fablet, 1998), and optical flow-based features, e.g., (Lu et al.,
2005). There are two main drawbacks related to this strand of research. First,
normal flow is highly correlated with dynamic texture spatial appearance (Polana
& Nelson, 1997). Thus, in contrast to the goal of the present work, recognition
is highly tuned to a particular spatial appearance. Second, optical flow and its
normal flow component are predicated on assumptions like brightness constancy
and local smoothness, which are generally difficult to justify in the context of
dynamic textures.
A recent trend in dynamic texture research is the use of statistical genera-
tive models to jointly capture the spatial appearance and dynamics of a pattern.
Recognition is realized by comparing the similarity between the estimated model
parameters. Several variants of this approach have appeared, including: autore-
gressive (AR) models (Szummer & Picard, 1996; Doretto et al., 2003a; Fitzgib-
bon, 2001; Wang & Zhu, 2003) and multi-resolution schemes (Heeger & Pentland,
1986; Bar-Joseph et al., 2001). By far the most popular of these approaches for
recognition is the joint photometric-dynamic, AR-based Linear Dynamic Sys-
tem (LDS) model, proposed in (Doretto et al., 2003a), which has formed the
69
![Page 84: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/84.jpg)
basis for several recognition schemes (Saisan et al., 2001; Chan & Vasconcelos,
2005b; Woolfe & Fitzgibbon, 2006; Vishwanathan et al., 2007); see Appendix
D.1 for a general description of this model. Although impressive recognition
rates have been reported (∼90%), most previous efforts have limited experimen-
tation to cases where the dynamic texture samples are taken from the exact
same viewpoint. As a result, much of the performance is highly tied to the
spatial appearance captured by these models rather than the underlying dynam-
ics (Chan & Vasconcelos, 2005b; Woolfe & Fitzgibbon, 2006). In a few vari-
ants, the cited approaches have considered only the dynamics portion of the LDS
model for recognition (Chan & Vasconcelos, 2005b; Woolfe & Fitzgibbon, 2006).
Significantly, a comparative study of many of the proposed approaches showed
that when applied to image sequences with non-overlapping views of the same
scene (“shift-invariant” recognition), all yield significantly lower recognition rates
(∼20%), whether using joint spatial-dynamic or only the dynamic portion of the
LDS model (Woolfe & Fitzgibbon, 2006).
In the current work, spatiotemporal oriented energy filters (as described in
Chapter 3) serve in defining the representation employed. Significantly, it appears
that no previous work has used the filter outputs to support dynamic texture
recognition, as shown here.
70
![Page 85: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/85.jpg)
4.1.3 Contributions
In the light of previous research, the contributions of the present work are as
follows. (i) A broader set of dynamic patterns are considered that subsume those
generally considered disjointly under various banners, including motion and dy-
namic texture. This broader set is termed “spacetime texture”. The key unifying
element of these patterns is that they can be distinguished by their underlying
spacetime oriented structure. (ii) The representation developed in Chapter 3
provides a unified foundation for representing and recognizing spacetime tex-
tures based solely on their underlying dynamics. While spacetime filters have
been used before for analyzing image sequences, they have not been applied to
the recognition of spacetime textures in the manner proposed. (iii) Empirical
evaluation on a standard dynamic texture data set and a traffic congestion data
set demonstrates that the proposed approach achieves superior performance over
state-of-the-art methods. (iv) In order to evaluate the proposed approach on the
wider set of patterns encompassed by spacetime textures, a new data set has been
assembled containing 610 challenging natural image sequences. The experiments
on this data set demonstrates the efficacy of the proposed approach to modeling
and recognition.
71
![Page 86: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/86.jpg)
4.2 Technical approach
There are three key parts to the proposed approach to texture recognition:
First, the definition of “dynamic texture” is broadened to encompass both
deterministic- and stochastic-based dynamic patterns; to avoid confusion with
previous terms in the literature, this broader definition is termed “spacetime tex-
ture”. Second, the representation developed in Chapter 3 is used as a basis to
discriminate between spacetime textures. Third, a match measure between any
two samples under consideration is proposed. This section begins by motivating
the significance of visual spacetime orientation in the context of spacetime tex-
ture analysis. Subsequently, the particulars of the proposed representation and
match measure are detailed.
4.2.1 What is dynamic texture?
Dynamic texture is an integral part of our daily visual experience and yet a pre-
cise, generally accepted definition has alluded us. Loosely speaking, dynamic tex-
tures have generally been characterized as consisting of homogeneous, spatiotem-
porally repeating patterns that exhibit a degree of temporal coherence (Doretto
et al., 2003a). The impediment to providing a precise definition is that its defini-
tion is a perceptual one, hence the common description “you know it, when you
72
![Page 87: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/87.jpg)
deterministic stochasticnear-deterministic near-stochastic
Figure 4.2: Spatial texture spectrum; adapted from (Lin et al., 2006). The imagesamples depict a range of spatial textures varying from completely deterministicto completely stochastic, with samples in between comprising a mixture of thesetwo categorical extremes.
see it”. Several formal definitions of dynamic textures have been proposed that
are centered on stochastic generative models, e.g., (Fitzgibbon, 2001; Doretto
et al., 2003a). Although precise, these definitions limit consideration to patterns
that are stochastic in nature (and are largely guided by mathematical convenience
considerations). This raises the obvious question of whether deterministic pat-
terns should also be considered. Before addressing this question, it is instructive
to first consider the classification of the related concept of spatial texture.
Spatial textures are often classified into two categories, deterministic and
stochastic textures (Haralick, 1979). Deterministic textures are characterized by
a set of identifiable texture elements (texels) that are spatially repeating based on
73
![Page 88: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/88.jpg)
multi-dominantorientation
unconstrainedorientation(limit case)
x
dominantorientation
y
t
spat
iote
mpo
ral
dom
ain
heterogeneousorientation
(typically referredto as
dynamic texture)
breakdownin characteristic orientation
superposition of spacetime orientation
underconstrainedorientation
degenerate
isotropicstructure
(limit case)
Figure 4.3: Spacetime texture spectrum. The images depict prototypical patternsof spacetime textures in the spacetime domain. The horizontal axis indicates theamount of spacetime oriented structure superimposed in a pattern, with increas-ing amounts given along the rightward direction.
a deterministic placement rule (e.g., wallpaper). In contrast, stochastic textures
consist of unidentifiable elements whose placement is modeled by a stochastic
process (e.g., sand). In the natural world, textures tend to consist of mixtures
of these two categories (e.g., wood grain). Figure 4.2 provides an organization
of spatial textures in terms of a continuous spectrum, where deterministic and
stochastic patterns represent the two categorical extremes.
Similarly, in the current work it is proposed that one can view dynamic pat-
terns in an analogous fashion. At one extreme lie deterministic dynamic patterns.
A prime example of such patterns is image motion, where strands of spacetime
74
![Page 89: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/89.jpg)
oriented lines (the texels) are placed based on pattern velocity (the deterministic
rule). At the other extreme lie pure stochastic patterns, such as specularities
off water. Notice that contrary to the qualitative definition of dynamic textures
given above, these stochastic patterns do not necessarily exhibit any temporal
coherence. In between these two categorical extremes lie mixtures of these two
pattern classes (e.g., dynamic textures). To avoid confusion with “dynamic tex-
ture” patterns, the patterns considered herein are collectively termed “spacetime
texture”, where traditional motion and dynamic textures are subsumed. Similar
to the organization of spatial texture depicted in Fig. 4.2, Fig. 4.3 illustrates the
spectrum of spacetime texture patterns spanning purely deterministic and purely
stochastic patterns. These patterns also coincide with the dynamic patterns of
focus in Chapter 2, where it was shown that the key discriminating attribute is
their constituent spacetime oriented structure.
Analogous to spatial texture discrimination work, e.g., (Knutsson &
Granlund, 1983; Bergen & Adelson, 1988; Fogel & Sagi, 1989; Malik & Per-
ona, 1990; Landy & Bergen, 1991), the texture discrimination model proposed
here assumes that two spacetime textures that produce similar spacetime orien-
tation distributions are elements of an equivalence class (i.e., same visual cate-
gory). This approach hinges on the principle that all of the spacetime information
75
![Page 90: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/90.jpg)
necessary for discriminating spacetime texture patterns can be captured by the
first-order correlation structure (spacetime orientation) of the data. Although a
simplification, this model captures an interesting set of spacetime textures.
As discussed in Chapter 2, the local spacetime orientation of a visual pat-
tern captures significant, meaningful aspects of its dynamic structure; therefore,
a spatiotemporal oriented decomposition of an input pattern is an appropriate
basis for local representation. By extension, aggregated measures of orientation
over a region of visual spacetime may be of use in characterizing the region’s
dynamic texture for recognition. Interestingly, distributions of spatially oriented
measurements have played a prominent role in the analysis of static visual texture
(see, e.g., (Bergen, 1991; Tuceryan & Jain, 1998) for reviews) and form the ba-
sis of state-of-the-art recognition methods (Leung & Malik, 2001; Cula & Dana,
2004; Varma & Zisserman, 2005); however, it appears that the present chapter
documents the first application of such an approach to dynamic visual texture.
4.2.2 Representation: Distributed spacetime orientation
This section presents details of the particular instantiation of the representation
used for spacetime texture recognition. For the motivations of the various steps,
see the discussion of the general framework in Section 3.2.1 of Chapter 3.
76
![Page 91: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/91.jpg)
The desired spacetime orientation decomposition is realized using a set of
broadly tuned 3D Gaussian third derivative filters (i.e., N = 3), G3θ(x), with
the unit vector θ capturing the 3D direction of the filter symmetry axis and
x = (x, y, t) spacetime position. The local oriented energy measurements are
given as follows (cf. Eq. (3.1))
Eθ =∑x∈Ω
[G3θ(x) ∗ I(x)]2, (4.1)
where I(x) denotes the input video and ∗ convolution. The spacetime aggregation
region, Ω, is defined over the spatiotemporal extent of the entire input pattern.
Here, only a single quadrature pair component is used. Given that the aggregation
support spatiotemporally covers the entire texture, reliable estimates are expected
from a single quadrature component.
Now, energy along a frequency domain plane with normal n and spatial ori-
entation discounted through marginalization, is given by summing across the set
of measurements, Eθi , as (cf. Eq. (3.5))
En =N∑i=0
Eθi , (4.2)
with θi one of N+1 = 4 specified directions, (3.4), and each Eθi calculated via the
oriented energy filtering, (4.1). In the present implementation, 27 different space-
time orientations, as specified by n, are made explicit, corresponding to static (no
77
![Page 92: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/92.jpg)
motion/orientation orthogonal to the image plane), slow (half pixel/frame move-
ment), medium (one pixel/frame movement) and fast (two pixel/frame move-
ment) motion in the directions leftward, rightward, upward, downward and diag-
onal, and flicker/infinite vertical and horizontal motion (orientation orthogonal
to the temporal axis); although, due to the relatively broad tuning of the fil-
ters employed, responses arise to a range of orientations about the peak tunings.
Here, a wide range of spacetime orientations were considered in order to support
granular distinctions between the various spacetime textures.
Finally, the energy measures are normalized by the sum of consort planar
energy responses at each point (cf. Eq. (3.6)),
Eni =Eni
M∑j=1
Enj + ε
, (4.3)
where M = 27 denotes the number of spacetime orientations considered and ε is
the scalar constant noise floor. Empirically, ε has been set to ∼1% of the maxi-
mum expected response. As applied to the 27 oriented, appearance marginalized
energy measurements, (4.2), Eq. (4.3) produces a corresponding set of 27 nor-
malized, marginalized oriented energy measurements. To this set an additional
measurement is included that explicitly captures lack of structure via normalized
78
![Page 93: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/93.jpg)
ε (cf. Eq. (3.7)),
Eε =ε
M∑j=1
Enj + ε
, (4.4)
to yield a 28 dimensional feature vector per spacetime texture.
4.2.3 Recognition: Spacetime orientation distribution similarity
An ensemble of normalized energy measurements, Eni ∪ Eε, is taken as a dis-
tribution with spatiotemporal orientation, ni, as variable. (In practice, these
measurements are maintained as histograms.) Given the spacetime oriented en-
ergy distributions of an input query and database with entries represented in like
fashion, the final step of the approach is recognition. In general, to compare two
distributions, denoted x and y, there are several standard similarity measures
in the literature that can be used (Rubner et al., 2001). In evaluation, the fol-
lowing measures were considered. (Individual entries in the employed histogram
representation of the distributions are specified via subscripting, e.g., xi, and
summations are taken across all bins.)
Minkowski-Form distance (Duda et al., 2001):
dLp(x,y) =
(∑i
|xi − yi|p)1/p
, where p = 1, 2 (4.5)
Bhattacharyya coefficient (similarity on the unit hyper-sphere) (Bhattacharyya,
79
![Page 94: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/94.jpg)
1943):
sB(x,y) =∑i
√xiyi (4.6)
Earth Mover’s Distance (EMD) (Rubner et al., 2000):
dEMD(x,y) =∑i
∑j
ci,jfi,j (4.7)
where fi,j, is the set of flows that minimizes the overall distance, (4.7), subject
to the following set of constraints,
∑i
fi,j = yj,∑j
fi,j = xi, and fi,j ≥ 0. (4.8)
To complete the definition of the EMD, the ground distance, ci,j, between his-
togram bins must be defined. In the empirical evaluation, L1 (Manhattan) and
L2 (Euclidean) distances, (4.5), were applied to the spacetime orientation space
with Eε considered a unit distance from each spacetime orientation bin. The flow
values, fi,j, are determined by solving a linear programming problem.
Finally, for any given distance measure, a method must be defined to deter-
mine the classification of a given texture sample relative to the database entries.
In order to make the results between the proposed approach and the various
recognition results reported elsewhere (Woolfe & Fitzgibbon, 2006) comparable,
the same Nearest-Neighbour (NN) classifier (Duda et al., 2001) was used in the
experiments to be presented. The NN classifier seeks the closest model to the
80
![Page 95: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/95.jpg)
Algorithm 2: Spacetime texture recognition.Input: Q: Query spacetime texture, D: Database containing labeled spacetime texturesOutput: c: Classification label
Step 1: Compute spacetime oriented energy representation (Sec. 4.2.2)
1. Initialize 3D G3 steerable basis.2. Compute normalized spacetime oriented energies for Q and D, Eq. (4.1) - (4.4).
Step 2: Recognition (Sec. 4.2.3)
3. Compute nearest-neighbour of Q in D using a similarity measure, Eq. (4.5) - (4.7).4. Assign label of nearest-neighbour in D to c.
input in a database based on the chosen similarity measure and declares the
input as belonging to the associated class of the closest model. Although not
state-of-the-art, the NN classifier has been shown to yield competitive results
relative to the state-of-the-art Support Vector Machine (SVM) classifier (Vapnik,
1995) for dynamic texture classification (Chan & Vasconcelos, 2007) and thus
provides a useful lower-bound on performance. Another motivation for the clas-
sifier choice is the desire to evaluate the utility of the proposed representational
substrate without confounding performance with classifier sophistication. With
the performance of the representation understood one can then turn attention to
optimizing recognition performance via choice of classifier (e.g., SVM).
To recapitulate, the proposed approach to spacetime texture recognition is
given in algorithmic terms in Algorithm 2 and illustrated in Fig. 4.4.
81
![Page 96: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/96.jpg)
Distributed Spacetime Orientation Decomposition
spacetime orientation
response
. . .
. . . . . .
. . . . . .
. . .
. . .
=input model
models of “wavy fluid”
. . .
models of “windblown vegetation”
histogram
input video
database
nearest match
Figure 4.4: Overview of spacetime texture classification approach. A novel in-put video is classified by forming its histogram of spacetime oriented energies,indicative of the relative presence of a given set of spacetime orientations in thepattern (while being independent of spatial appearance), and followed by nearestneighbour classification. The class of the input video is defined as the associatedclass of the nearest model in the database (in the sense of the chosen similaritymeasure).
82
![Page 97: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/97.jpg)
4.3 Empirical evaluation
4.3.1 Data sets
The performance of the proposed approach to spacetime texture recognition is
evaluated on three data sets capturing various subsets of spacetime textures.
The first is a standard data set that captures the subset of spacetime textures
commonly referred to as dynamic texture (i.e., heterogeneous spacetime oriented
patterns). The second is a new data set that captures a diverse set of space-
time textures containing both motion and non-motion related dynamic patterns,
including dynamic textures. The final data set contains a set of automobile traf-
fic videos depicting various traffic congestion scenarios (i.e., dominant oriented
patterns), which is used to illustrate a particular practical application.
4.3.1.1 UCLA dynamic texture data set
For the purpose of evaluating the proposed approach on the subset of spacetime
texture commonly referred to as dynamic textures, recognition performance was
tested on the standard UCLA dynamic texture data set (Saisan et al., 2001). The
data set is comprised of 50 dynamic texture scenes, including, boiling water, fire,
fountains, rippling water and windblown vegetation. Each scene is given in terms
83
![Page 98: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/98.jpg)
Figure 4.5: Sample frames from the UCLA dynamic texture data set. (left-to-right, top-to-bottom) Candle, fire, rising smoke, boiling water, fountain, spurtingspray style fountain, waterfall and the remaining samples depict a variety ofwindblown vegetation examples.
of four greyscale image sequences; for each scene, all four example sequences are
captured with the same viewing parameters (e.g., identical viewpoint). In total
there are 200 sequences. Each sequence consists of 75 frames of size 110 × 160.
Figure 4.5 shows sample frames from the data set.
84
![Page 99: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/99.jpg)
Basic-level Multi-level
unconstrained (16) N/A
underconstrained (77)flicker (45)
aperture problem (32)
dominant (293)single oriented (229)
non-single oriented (64)
multi-dominant (85) N/A
heterogeneous and isotropic (139)wavy fluid (35)stochastic (104)
Table 4.1: YUVL spacetime texture data set summary. The number of samplesper category are given in parentheses. N/A denotes not applicable.
4.3.1.2 York University Vision Lab (YUVL) spacetime texture data
set
For the purpose of evaluating the proposed approach on spacetime textures, a
new data set3 was collected for each of the categories of spacetime texture de-
scribed in Sec. 4.2.1 and illustrated in Fig. 4.3 (or equivalently in Fig. 2.1).
The data set contains a total of 610 spacetime texture samples. The videos
were obtained from various sources, including a Canon HF10 camcorder and
the “BBC Motion Gallery” (www.bbcmotiongallery.com) and “Getty Images”
(www.gettyimages.com) online video repositories; the videos vary widely in their
resolution and temporal extents. Owing to the diversity within and across the
video sources, the videos generally contain significant differences in scene appear-
3Available at: www.cse.yorku.ca/vision/spacetime-texture
85
![Page 100: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/100.jpg)
ance, scale, illumination conditions and camera viewpoint. Also, in the context of
the more stochastic phenomena, variations in the underlying physical processes
ensure large intra-class variations.
The data set is partitioned in two ways: (i) basic-level and (ii) subordinate-
level, by analogy to terminology for capturing the hierarchical nature of human
categorical perception (Rosch & Mervis, 1975). The basic-level partition, sum-
marized in Table 4.1 (left column), is based on the number of spacetime orien-
tations present in a given pattern; for a detailed description of these categories,
see Chapter 2. For the multi-dominant category, samples were limited to two
superimposed structures (e.g., rain over a stationary background).
To demonstrate the proposed approach’s ability to make finer categorical
distinctions, several of the basic-level categories were further partitioned into
subordinate-levels. This partition, summarized in Table 4.1 (right column), is
based on the particular spacetime orientations present in a given pattern. Beyond
the basic-level categorization of an unconstrained oriented pattern (i.e., unstruc-
tured), no further subdivision is possible. Underconstrained cases arise naturally
as the aperture problem and pure temporal variation (i.e., flicker). Dominant
oriented patterns can be distinguished by whether there is a single orientation
that describes a pattern (e.g., motion) or a narrow range of spacetime orienta-
86
![Page 101: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/101.jpg)
tions distributed about a given orientation (e.g., “nowhere-static” and “optical
snow”). Note that further distinctions can be made based, for example, on the
velocity of motion (e.g., stationary vs. rightward motion vs. leftward motion). In
the case of multi-dominant oriented patterns, the choice of restricting patterns
to two components to populate the database precludes further meaningful pars-
ing. The heterogeneous and isotropic basic category was partitioned into wavy
fluid and those more generally stochastic in nature. Note that further parsing
within the heterogeneous and isotropic basic-level category is possible akin to
the semantic categorization experiment based on the UCLA dynamic textures
data set discussed later (see Section 4.3.2.3). Since this more granular partition
is considered in detail elsewhere (i.e., the UCLA data set), in the YUVL data
set only a two-way subdivision of the heterogeneous and isotropic basic-level is
considered.
Figure 4.6 illustrates the overall organization of the YUVL spacetime texture
data set in terms of the basic- and subordinate-level categories. Further, Figs.
4.7 and 4.8 show sample image frames corresponding to the examples given in
Fig. 4.6.
87
![Page 102: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/102.jpg)
unconstrained
underconstrained dominant
multi-dominant
heterogeneous and
isotropic
spacetime
x
t
y
Basic-Level
Subordinate-Level
flicker
aperture problem
single oriented
non-single orientedwavy fluid
stochastic
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
clear skysnow over stationary backdrop
rising smoke
.
.
.
lightning
emergency light
stationarywooden fence
stationarystriped pillow
marathonrunning
rising bubbles
suburbanflyover
waterfall
televisionnoise
wavy ocean
moonlit wavy ocean
cheering crowd of people
Figure 4.6: Organization of the YUVL spacetime texture data set. A sampleframe is provided for each basic- and subordinate-level category; see Figs. 4.7and 4.8 for the corresponding image sequences of the various spacetime textureexamples.
88
![Page 103: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/103.jpg)
flicker
apertureproblem
unconstrained
underconstrained
Basic-Level Subordinate-Level
single oriented
dominant
non-singleoriented
Figure 4.7: Example image sequences from the YUVL spacetime texture dataset. Each row contains an image sequence labeled with its basic-level (left) andsubordinate-level (right) category. (top-to-bottom) clear sky, lightning, emer-gency light, stationary wooden fence, stationary striped pillow, suburban flyover,waterfall, marathon running and rising bubbles.
89
![Page 104: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/104.jpg)
multi-dominant
Basic-Level Subordinate-Level
wavy fluid
heterogeneousand
isotropic
stochastic
Figure 4.8: Example image sequences from the YUVL spacetime texture dataset. Each row contains an image sequence labeled with its basic-level (left) andsubordinate-level (right) category. (top-to-bottom) snow over stationary back-drop, rising smoke over stationary backdrop, wavy ocean, moonlit wavy ocean,cheering crowd of people and television noise.
90
![Page 105: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/105.jpg)
Figure 4.9: Example frames from the UCSD traffic video data set. The sampleframes depict various traffic congestion conditions, coarsely categorized as light(top row), medium (middle row) and heavy traffic (bottom row).
4.3.1.3 UCSD traffic data set
Most extant approaches to classifying vehicular traffic sequences use a combina-
tion of segmentation and tracking, e.g., (Cucchiara et al., 2000; Jung et al., 2001).
Problems associated with these approaches include, (i) segmentation issues due
to varying environmental conditions (e.g., lighting, such as shadows, overcast,
glare and night), occlusions and low-resolution imagery, and (ii) tracking issues
related to correspondence problems and occlusions. Alternatively, several meth-
ods have used statistics of computed optic flow (Yu et al., 2002; Li & Porikli,
91
![Page 106: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/106.jpg)
traffic condition description number of videos
light traffic around the speed limit 165medium reduced speed 45heavy slow or stop and go speeds 44
Table 4.2: UCSD traffic data set summary.
2004). Extracting reliable measurements of flow is difficult in traffic scenarios
due to environmental conditions. Recently, (Chan & Vasconcelos, 2005a; Chan
& Vasconcelos, 2005b) proposed that traffic flow could be treated directly as dy-
namic textures, thus forgoing the problems associated with the cited approaches.
In the terminology established in this dissertation, traffic induced patterns are
described as dominant oriented.
The UCSD traffic video data set4 (Chan & Vasconcelos, 2005a; Chan & Vas-
concelos, 2005b) consists of video sequences of daytime highway traffic, totalling
20 minutes of video footage. The videos contain a variety of traffic congestion
patterns and weather conditions (e.g., raining, overcast and sunny). Each video
has a resolution of 320 × 240 pixels with 42 to 52 frames captured at 10 frames
per second. The data set provides representative video patches for training and
testing, which were created by reducing the video resolution by a factor of four,
and manually selecting a 48× 48 window over the area with the “most activity”.
4Available at: www.svcl.ucsd.edu/projects/traffic
92
![Page 107: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/107.jpg)
1 2 3 4 50
50
100re
cogn
ition
rate
within top
EMD L1EMD L2L1L2BhattacharyyaChanceMartin (Saisan et al.)
Figure 4.10: Viewpoint specific recognition results based on the UCLA data set.EMD L1,2, L1,2 and Bhattacharyya correspond to the results of the proposedapproach with respective distance measures. Chance refers to performance basedon random guessing. The previous state-of-the-art result is denoted by Martinas reported in (Saisan et al., 2001); this result is based on a NN classifier, SVMclassifier-based results, as reported in (Chan & Vasconcelos, 2007), are slightlyhigher. The reported result in (Saisan et al., 2001) does not provide matchingbeyond top 1.
Also, hand-labeled ground truth is provided that describes the amount of traffic
congestion in each sequence. In total there are 254 video sequences, grouped into
three classes of traffic congestion, light, medium and heavy; see Table 4.2 for
summary. Example frames from the data set are shown in Fig. 4.9.
4.3.2 Dynamic texture classification with the UCLA data set
4.3.2.1 Viewpoint specific classification
The first experiment largely followed the standard protocol set forth in con-
junction with the original investigation of the UCLA data set (Saisan et al.,
2001). The only difference is that unlike (Saisan et al., 2001), where careful
93
![Page 108: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/108.jpg)
manual (spatial) cropping was necessary to reduce computational load in pro-
cessing, such issues are not a concern in the proposed approach and thus crop-
ping was avoided altogether. (Note that the actual windows used in the original
experiments (Saisan et al., 2001) were not reported other than to say that they
were selected to, “include key statistical and dynamic features”.) As in (Saisan
et al., 2001), a leave-one-out classification procedure (Theodoridis & Koutroum-
bas, 2006) was used, where a correct classification for a given dynamic texture
sequence was defined as having one of the three other dynamic texture sequences
of its scene as its nearest-neighbour. Thus, the recognition that is tested is view-
point specific in that the correct answer arises as a match between two acquired
sequences of the same scene from the same view.
Results are presented in Fig. 4.10. The highest recognition rate achieved
using the proposed spatiotemporal oriented energy approach to representing dy-
namic texture was 81% with the L2 and Bhattacharyya measures. Considering
the closest five matches, classification improved to 92.5%. Although, below the
state-of-the-art NN benchmark of 89.5% using cropped input imagery (Saisan
et al., 2001) (and higher rate reported using a SVM classifier, 97.5% (Chan
& Vasconcelos, 2007), again with cropped input), the current results are com-
petitive given that the benchmark setting AR-LDS approaches are based on a
94
![Page 109: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/109.jpg)
joint photometric-dynamic model, with the photometric portion playing a piv-
otal role (Chan & Vasconcelos, 2005b; Woolfe & Fitzgibbon, 2006)5; whereas, the
proposed approach focuses strictly on pattern dynamics due to the spatial ap-
pearance marginalization step in the construction of the representation, (4.2). In
subsequent experiments, it will be shown that there are distinct advantages to es-
chewing the purely spatial appearance attributes as one moves beyond viewpoint
specific recognition.
4.3.2.2 Shift-invariant classification
To remove the effect of identical viewpoint, and thus the appearance bias in the
data set, it was proposed in (Woolfe & Fitzgibbon, 2006) that each sequence in
the data set be cropped into non-overlapping pairs, with subsequent comparisons
only performed between different crop locations. Recognition rates under this
evaluation protocol showed dramatic reduction in the state-of-the-art LDS-based
approaches from approximately 90% to 15% (Woolfe & Fitzgibbon, 2006); chance
performance was ∼1%. Further, introduction of several novel distance measures
5Given that the image sequences of each scene in the UCLA database were captured fromthe exact same viewpoint and that the scenes are visually distinctive based on image stills alone,it has been conjectured that much of the early reported recognition performance was drivenmainly by spatial appearance (Chan & Vasconcelos, 2005b). Subsequently, this conjecture wassupported by showing that using the mean frame of each sequence in combination with a NNclassifier yielded a 60% classification rate (Woolfe & Fitzgibbon, 2006).
95
![Page 110: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/110.jpg)
1 2 3 4 50
20
40
60
80
100re
cogn
ition
rate
within top
EMD L
1EMD L
2L
1L
2Bhattacharyya
Chance
Martin (Saisan et al.)
Cepstral univariate (Woolfe and Fitzgibbon)
Chernoff (Woolfe and Fitzgibbon)
KL divergence (Chan and Vasconcelos)
Figure 4.11: Shift-invariant recognition results based on the UCLA data set.EMD L1,2, L1,2 and Bhattacharyya correspond to the results of the proposedapproach with respective distance measures. Chance refers to performance basedon random guessing. Martin, cepstral univariate, Chernoff and KL divergenceresults are taken from (Woolfe & Fitzgibbon, 2006). The reported results in(Woolfe & Fitzgibbon, 2006) do not provide results for matching beyond top 1.
yielded slightly improved recognition rates of ∼20% (Woolfe & Fitzgibbon, 2006).
Restricting comparisons between non-overlapping portions of the original im-
age sequence data tests shift-invariant recognition in that the “view” between
instances is spatially shifted. As a practical point, shift-invariant recognition
arguably is of more importance than viewpoint specific, as real-world imaging
scenarios are unlikely to capture a scene from exactly the same view across two
different acquisitions.
The second experiment reported here closely follows previous shift-invariant
experiments using the UCLA data set, as described above (Woolfe & Fitzgibbon,
2006). Each sequence was spatially partitioned into left and right halves (win-
96
![Page 111: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/111.jpg)
dow pairs), with a few exceptions. (In contrast, (Woolfe & Fitzgibbon, 2006)
manually cropped sequences into 48×48 subsequences; again, the location of the
crop windows were not reported.) The exceptions arise as several of the imaged
dynamic textures are not spatially stationary; therefore, the cropping regimen
described above would result in left and right views of different dynamic textures
for these cases. For instance, in several of the fire and candle samples, one view
would capture a static background, while the other would capture the flame.
Previous shift-invariant experiments elected to neglect these cases, resulting in
a total of 39 scenes (Woolfe & Fitzgibbon, 2006). In the present evaluation, all
cases in the data set were retained with special manual cropping introduced to
the non-stationary cases to include their key dynamic features; see Appendix E
for the documentation of these special crop windows. (In experimentation, it
was found that dropping these special cases entirely had negligible impact on the
overall result.)
Overall, the current experimental design yielded a total of 400 sequences, as
each of the original 200 sequences were divided into two non-overlapping portions
(views). Comparisons were performed only between non-overlapping views. A
correct detection for a given dynamic texture sequence was defined as having one
of the four dynamic texture sequences from the other views of its scene as its
97
![Page 112: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/112.jpg)
nearest-neighbour.
The results for the second experiment are presented in Fig. 4.11. In this sce-
nario the proposed approach achieved a 42.3% classification rate, significantly
outperforming the best result of 20% reported in (Woolfe & Fitzgibbon, 2006).
Considering the closest five matches, classification improved to ∼60%. This
strong performance owes to the proposed spatiotemporal oriented energy rep-
resentation’s ability to capture dynamic properties of visual spacetime without
being tied to the specifics of spatial appearance. Figures 4.12 and 4.13 provide
several successful classification examples.
Interestingly, close inspection of the results shows that many of the misclassi-
fications for the proposed approach arise between different scenes of semantically
the same material, especially from the perspective of visual dynamics. Figure
4.14 shows several illustrative cases. For example, the most common “confusion”
arises from strong matches between two different scenes of fluttering vegetation.
Indeed, vegetation dominates the data set and consequently has a great impact
on the overall classification rate.
Finally, recall that the results in (Woolfe & Fitzgibbon, 2006) used carefully
chosen windows of spatial size 48× 48; whereas, the results reported here for the
proposed approach are based on simply splitting the full size dynamic textures
98
![Page 113: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/113.jpg)
in half. To control against the impact of additional spatiotemporal support, the
proposed approach was also evaluated on cropped windows of similar size to
(Woolfe & Fitzgibbon, 2006). This manipulation was found to have negligible
impact on the recognition results.
99
![Page 114: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/114.jpg)
(a) fire
(b) smoke
(c) boiling-b-near
Figure 4.12: Example correct classifications for shift-invariant dynamic texturecategorization. In each subfigure, the first row shows several frames from aninput sequence and the second row shows the corresponding nearest match inthe database. The text below each figure, indicating the dynamic texture scene,refers to the filename prefix used in the UCLA data set.
100
![Page 115: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/115.jpg)
(a) plant-d-near
(b) sea-a-mid
(c) wfalls-d-near
Figure 4.13: Example correct classifications for shift-invariant dynamic texturecategorization. In each subfigure, the first row shows several frames from aninput sequence and the second row shows the corresponding nearest match inthe database. The text below each figure, indicating the dynamic texture scene,refers to the filename prefix used in the UCLA data set.
101
![Page 116: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/116.jpg)
input nearest match
plant-m-mid plant-n-mid
sea-e-near sea-a-mid
candle fire
wfalls-b-near wfalls-c-near
Figure 4.14: Examples of several misclassifications from the shift-invariant recog-nition experiment. From a semantic perspective, the inputs and their respectivenearest match are equivalent. The text below each figure, indicating the dynamictexture scene, refers to the filename prefix used in the UCLA data set.
102
![Page 117: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/117.jpg)
category filename prefix descriptionnumberof videos
flames candle, fire flames 16fountain fountain-c spurting spray style fountain 8smoke smoke smoke 8
water turbulence boiling, water turbulent water dynamics 40water waves sea wave dynamics 24
waterfalls fountain-a,b, wfalls water flowing down surfaces 64windblown vegetation flower, plant fluttering vegetation 240
Table 4.3: Summary of semantic reorganization of UCLA dynamic texture dataset. “filename prefix” refers to the filename prefix of the original scenes in theUCLA data set.
4.3.2.3 Semantic category classification
Examining the UCLA data set, one finds that many of the scenes (50 in total)
are capturing semantically equivalent categories. As examples, different scenes
of fluttering vegetation share fundamental dynamic texture similarities, as do
different scenes of water waves vs. fire, etc; indeed, these similarities are readily
apparent during visual inspection of the data set as well as the shift-invariant con-
fusions shown in Fig. 4.14. In contrast, the usual experimental use of the UCLA
data set relies on distinctions made on the basis of particular scenes, emphasizing
their spatial appearance attributes (e.g., flower-c vs. plant-c vs. plant-s) and the
video capture viewpoint (i.e., near, medium and far). This parceling of the data
set overlooks the fact that there are fundamental similarities between different
scenes and views of the same semantic category.
103
![Page 118: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/118.jpg)
Classified
flam
es
fou
nta
in
smok
e
wat
ertu
rbu
len
ce
wat
erw
aves
wat
erfa
ll
win
dblo
wn
vege
tati
on
Act
ual
flames (total 16) 12 1 2 1fountain (8) 8
smoke (8) 2 6water turbulence (40) 34 6
water waves (24) 24waterfall (64) 2 51 11
windblown vegetation (240) 3 1 2 234
Table 4.4: Dynamic texture confusion matrix. Confusion matrix for seven se-mantic categories of dynamic texture. Results are based on the Bhattacharyyasimilarity measure.
In response to the observations above, the final reported experiment reorga-
nizes the UCLA data set into the seven semantic categories (reorganization done
by author) summarized in Table 4.3. Although alternative categorical organi-
zations might be considered, the present one is reasonably consistent with the
semantics of the depicted patterns. Evaluation on this data set was conducted
using the same procedure outlined for the shift-invariant experiment to yield
semantic category recognition.
The semantic category recognition results based on the closest match are
104
![Page 119: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/119.jpg)
shown as a confusion table in Table 4.4. The overall classification rate in this
scenario is 92.3%. As with the previous experiment, inspection of the confusions
reveals that they typically are consistent with their apparent dynamic similarities
(e.g., waterfall and turbulence confusions, smoke and flames confusions). These
results provide strong evidence that the proposed approach is extracting infor-
mation relevant for delineating dynamic textures along semantically meaningful
lines; moreover, that such distinctions can be made based on dynamic information
without inclusion of spatial appearance.
105
![Page 120: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/120.jpg)
Classified
un
con
stra
ined
un
derc
onst
rain
ed
dom
inan
t
mu
lti-
dom
inan
t
hete
roge
neo
us
and
isot
ropi
c
Act
ual
unconstrained (total 16) 16underconstrained (77) 76 1
dominant (293) 1 278 5 9multi-dominant (85) 2 6 71 6
heterogeneous and isotropic (139) 1 3 3 132
Table 4.5: Spacetime texture confusion matrix. Confusion matrix for the fivebasic-level categories of spacetime texture. Results are based on the Bhat-tacharyya coefficient measure.
4.3.3 Spacetime texture classification with the YUVL data set
4.3.3.1 Basic-level classification
As with the previous evaluation on dynamic texture classification performance,
the same leave-one-out classification procedure was used to evaluate the perfor-
mance of the proposed spacetime texture recognition approach using the YUVL
spacetime texture data set. Overall results are presented in Fig. 4.15 (a). The
highest recognition rate achieved using the proposed spacetime oriented energy
106
![Page 121: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/121.jpg)
approach was 94% with the L1, L2 and Bhattacharyya similarity measures. In the
remainder of this section, discussion will be limited to results based on the Bhat-
tacharyya coefficient measure; results based on the alternative distance measures
are generally slightly lower. Considering the closest three matches, classification
improved to 98.9%. Class-by-class results are presented in Fig. 4.15 (b) and Table
4.5. In the case of the class-by-class results, nearest-neighbour recognition rates
ranged between 83.5% to 100%. Considering the closest three matches, recog-
nition rates improved, ranging between 96.5% to 100%. Figures 4.16 and 4.17
provide an example correct classification for each basic-level category.
Figure 4.18 provides several representative examples of common misclassifi-
cations. In Fig. 4.18 (a), the corn field was classified as underconstrained, rather
than dominant oriented, as labeled in the ground truth. From the figure it is clear
that the rows of corn form a nearly vertical spatial pattern (similar to the wooden
fence in the nearest match) and thus could also be considered an instance of the
aperture problem (underconstrained). In Fig. 4.18 (b), the scene of the people
walking down the street was classified as heterogeneous oriented, rather than
dominant oriented, as labeled. In the input, although there is a small amount of
vertical motion downward due to the walking motion, there is also a significant
component of bobbing which is reminiscent of the nearest match. Both Fig. 4.18
107
![Page 122: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/122.jpg)
(a) and (b) highlight the often ambiguous nature of providing ground truth class
labels. In Fig. 4.18 (c), the scene consisting of falling leaves with a stationary
backdrop was classified as dominant oriented rather than multi-oriented, as la-
beled. In this example, the orientation of the nearest sample matched one of the
orientation components of the misclassified input. In this case, one of the orien-
tation components of the multi-oriented sequence may lie beyond the resolution
(scale) of the filters used; recall that the current analysis is restricted to a single
spatiotemporal scale. Such confusions may be addressed via the use of multiple
scales of analysis, which is an interesting direction for future research.
108
![Page 123: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/123.jpg)
1 2 30
20
40
60
80
100
within top
reco
gniti
on ra
te
EMD L1EMD L2L1L2Bhattacharyya
(a)
1 2 30
20
40
60
80
100
within top
reco
gniti
on ra
te
UnconstrainedUnderconstrainedDominantMulti-DominantHeterogeneous
(b)
Figure 4.15: Spacetime texture basic-level recognition results. (a) Overall basic-level category results. Results correspond to the proposed representational ap-proach under various distance measures. (b) Class-by-class basic-level categoryresults. Results for (b) correspond to the proposed representational approachunder the Bhattacharyya measure.
109
![Page 124: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/124.jpg)
(a) unconstrained
(b) underconstrained
(c) dominant orientation
Figure 4.16: Example correct classifications for basic-level categorization. Ineach subfigure, the first row shows several frames from an input sequence and thesecond row shows the corresponding nearest match in the database. (a) Night skymatched to clear day sky. (b) Stationary striped pillow matched to a stationarycar grille. (c) Camera panning down building matched to camera panning downcityscape.
110
![Page 125: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/125.jpg)
(a) multi-dominant orientation
(b) heterogeneous orientation and isotropic structure
Figure 4.17: Example correct classifications for basic-level categorization. Ineach subfigure, the first row shows several frames from an input sequence and thesecond row shows the corresponding nearest match in the database. (a) Heavysnow over city backdrop matched to heavy snow over Kremlin. (b) Silhouette ofcheering crowd matched to cheering crowd.
111
![Page 126: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/126.jpg)
(a) dominant oriented misclassified as underconstrained
(b) dominant oriented misclassified as heterogeneous oriented
(c) multi-oriented misclassified as dominant oriented
Figure 4.18: Example misclassifications for basic-level categorization. In eachsubfigure, the first row shows several frames from an input sequence and the sec-ond row shows the corresponding nearest match in the database. (a) Stationaryview of corn field matched to stationary view of wooden fence. (b) Pedestrianswalking down the street matched to crowd jogging in place. (c) Leaves fallingover stationary park backdrop matched to stationary cityscape.
112
![Page 127: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/127.jpg)
4.3.3.2 Subordinate-level classification
The previous experiment demonstrated strong recognition performance in the
context of fairly broad structural categories. This experiment considered finer
categorical distinctions using the subordinate-level partition of the YUVL space-
time texture data set. Evaluation on this data (including the unpartitioned
categories of “unconstrained” and “multi-dominant”) was conducted using the
same leave-one-out procedure outlined for the basic-level experiment above. The
subordinate-level recognition results are shown in Fig. 4.19. Considering only
the first nearest-neighbour, class-by-class results ranged between 81.3% to 100%.
Considering the closest three matches, recognition rates improved, ranging be-
tween 95.3% to 100%. Figures 4.20 to 4.22 provide an example correct classifica-
tion for each subordinate-level category.
Figure 4.23 provides two representative examples of common misclassifica-
tions. In Fig. 4.23 (a), both patterns contain roughly the same dominant ori-
entation (same velocity) corresponding to upward motion yet the rising bubbles
contain additional deviations. Two possible sources for this misclassification are:
(i) the orientation deviations in the rising bubbles sequence are beyond the res-
olution of the current instantiation of the representation and (ii) the orientation
deviations in the bubble sequence are significant, yet the data set does not con-
113
![Page 128: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/128.jpg)
1 2 30
20
40
60
80
100
within top
reco
gniti
on ra
te
FlickerAperture ProblemSingle OrientedNon-Single OrientedWavyStochastic
Figure 4.19: Spacetime texture subordinate-level recognition results. Class-by-class subordinate-level category results. Results correspond to the proposed rep-resentational approach under the Bhattacharyya measure.
tain a single-oriented pattern that matches closely with the input (i.e., has the
same dominant image velocity component). In Fig. 4.23 (b), the traffic scene
was classified as non-single oriented, rather than single oriented, as labeled. One
could argue that the traffic scene should also have been labeled as non-single
oriented in the ground truth. More generally, most of the misclassifications are
of this type (i.e., correspond to matches to “neighbouring” categories). Again,
this demonstrates the ambiguous nature of the ground truth labeling task.
Taken together with the results from the broad-level category experiment,
the results provide strong evidence that the proposed approach is extracting
relevant structural dynamic information to delineate the spectrum of spacetime
textures. In particular, the approach is able to represent and recognize patterns
114
![Page 129: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/129.jpg)
encompassing those that traditionally have been treated separately (e.g., motion,
dynamic texture, as well as other image dynamics) when considered as aggregate
measurements over a region.
115
![Page 130: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/130.jpg)
(a) aperture problem
(b) flicker
Figure 4.20: Example correct classifications for subordinate-level categorization.In each subfigure, the first row shows several frames from an input sequenceand the second row shows the corresponding nearest match in the database.(a) Stationary picket fence matched to stationary striped wall. (b) Flashingemergency light matched to stationary scene lit by lightning.
116
![Page 131: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/131.jpg)
(a) single oriented
(b) non-single oriented
Figure 4.21: Example correct classifications for subordinate-level categorization.In each subfigure, the first row shows several frames from an input sequence andthe second row shows the corresponding nearest match in the database. (a) Textscrolling upward matched to camera panning down park scene. (b) People ridingbicycles matched to crowd walking.
117
![Page 132: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/132.jpg)
(a) wavy fluid
(b) stochastic
Figure 4.22: Example correct classifications for subordinate-level categorization.In each subfigure, the first row shows several frames from an input sequence andthe second row shows the corresponding nearest match in the database. (a) Wavyocean matched to wavy fluid. (b) Busy trading floor matched to a commotion ofpeople.
118
![Page 133: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/133.jpg)
(a) single oriented misclassified as non-single oriented
(b) single oriented misclassified as non-single oriented
Figure 4.23: Example misclassifications for subordinate-level categorization. Ineach subfigure, the first row shows several frames from an input sequence and thesecond row shows the corresponding nearest match in the database. (a) Suburbflyover matched to rising bubbles. (b) Car traffic moving downward matched tomarathon moving downward.
119
![Page 134: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/134.jpg)
Classifiedlight medium heavy
Act
ual
light (total 165) 163 1 1medium (45) 1 40 4
heavy (44) 1 4 39
Table 4.6: Traffic congestion confusion matrix. Cumulative confusion matrixfor traffic congestion classification for all four testing trials using the proposedapproach. Results are based on the Bhattacharyya distance measure.
4.3.4 Traffic congestion classification with the UCSD data set
Reported traffic congestion classification results in (Chan & Vasconcelos, 2005a;
Chan & Vasconcelos, 2005b) are based on the average classification rate taken
over four trial runs. Each trial consisted of splitting the data differently with
75% of the video samples reserved for training and 25% for testing. The training
and test data splits for each trial are provided with the data set. The empirical
results described next are based on the aforementioned testing protocol.
The proposed approach based on the Bhattacharyya measure has an overall
classification rate of 95.28% versus the best classifier reported in (Chan & Vascon-
celos, 2005a; Chan & Vasconcelos, 2005b) that has an overall classification rate
of 94.5%. Table 4.6 provides the cumulative confusion matrix for all four testing
trials using the proposed approach. Figure 4.24 shows a correct classification
result for each of the three traffic congestion conditions.
120
![Page 135: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/135.jpg)
As shown in Table 4.6, the majority of misclassifications (ten cases in to-
tal) occur between neighbouring classes (i.e., light vs. medium and medium vs.
heavy). Such matches are reasonable, given the indeterminate nature of cate-
gory boundaries and the corresponding ambiguities of generating ground truth,
as previously encountered in Section 4.3.3. Figure 4.25 (a) shows an example
of this ambiguity where the input depicting heavy traffic is matched closest to
an instance of medium traffic. The remaining two misclassifications are related
to confusions between light and heavy traffic (i.e., non-neighbouring classes). In
both instances, the light traffic sequences largely depict the background (i.e., few
cars are present), while the matched heavy traffic sequences depict cars that are
virtually at a standstill (see Fig. 4.25 (b) for an example). From the viewpoint
of dynamics, both scenes are similar and thus the matches are reasonable.
121
![Page 136: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/136.jpg)
(a) light congestion
(b) medium congestion
(c) heavy congestion
Figure 4.24: Example correct classifications for traffic congestion categorization.In each subfigure, the first row shows several frames from an input sequence andthe second row shows the corresponding nearest match in the database.
122
![Page 137: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/137.jpg)
(a) heavy congestion misclassified as medium
(b) heavy congestion misclassified as light
Figure 4.25: Example misclassifications for traffic congestion categorization. Ineach subfigure, the first row shows several frames from an input sequence and thesecond row shows the corresponding nearest match in the database.
123
![Page 138: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/138.jpg)
4.3.5 Recapitulation
This section has provided several sets of evaluations of classification performance
of the proposed approach on a wide range of spacetime textures. The first set of
evaluations focused on the narrow set of patterns that have been of primary con-
cern in the literature, namely, dynamic textures. It was concluded that the pro-
posed approach significantly outperforms extant approaches once controls were
introduced in the standard evaluation to remove the effect of identical viewpoint.
Further, the standard data set was reorganized in terms of semantically meaning-
ful dynamic categories. The results in this context provide compelling evidence
that the proposed approach is indeed extracting dynamic information relevant
for delineating dynamic textures along semantically meaningful lines of inquiry.
The second set of experiments considered the broader set of spacetime textures
that encompass both motion and non-motion-related dynamic patterns. Typ-
ically these patterns have been treated non-uniformly on a case-by-case basis.
It was concluded that the proposed approach can make both broad structural
pattern distinctions (e.g., dominant vs. multi-dominant oriented), as well as finer
ones (e.g., wavy fluid vs. general stochastic cases). Finally, the proposed approach
was evaluated on the practical application of traffic congestion classification. This
last result demonstrated state-of-the-art performance of the proposed approach.
124
![Page 139: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/139.jpg)
Note that the same set of parameters for the proposed approach were used for
all evaluations. Significantly, these results collectively indicate that the represen-
tation of spacetime orientation information in a distributed manner as proposed
provides a powerful substrate for classifying a wide-range of dynamic patterns.
4.4 Discussion and summary
There are two main contributions in this chapter. First, the definition of texture
in the context of dynamics has been broadened to uniformly capture a wide range
of dynamic phenomena ranging from deterministic patterns such as motion to
more stochastic patterns typically associated with dynamic texture. The key
unifying principle is their underlying first-order spacetime correlation structure.
Second, although the application of spacetime oriented filters is well documented
in the literature for patterns readily characterized as single motion (Fahle &
Poggio, 1981; Adelson & Bergen, 1985; Watson & Ahumada, 1985; Heeger, 1988;
Simoncelli, 1993) and semi-transparent motion (Fleet, 1992; Simoncelli, 1993; Yu
et al., 1999; Beauchemin & Barron, 2000a), its application to analyzing more
complicated phenomena as manifest in dynamic texture patterns, where dominant
oriented structure can break down, has received no previous attention. Through
empirical evaluation it has been shown that this tack yields a strong approach
125
![Page 140: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/140.jpg)
to shift-invariant, basic/subordinate and semantic category-based recognition of
spacetime textures.
In this work, the dynamic portion of a given texture pattern has been factored
out from the purely spatial appearance portion for subsequent recognition. In
contrast, LDS-based recognition approaches generally have considered the spatial
appearance and dynamic components jointly, which appears to limit performance
in significant ways (e.g., weak performance on shift-invariant recognition relative
to the proposed approach). Further, the spatial appearance component of these
methods is based primarily on a Principal Components Analysis (PCA) that does
not represent the current state-of-the-art in spatial texture analysis, e.g., (Varma
& Zisserman, 2009). These observations motivate the future investigation of com-
bining a state-of-the-art appearance-based scheme with the proposed approach
to recognizing pattern dynamics.
Although the proposed representation has been presented in terms of oriented
filters at a single spatiotemporal scale (i.e., radial frequency), it is an obvious
candidate for multi-scale treatment (Lindeberg, 1997). This extension may serve
to support finer categorical distinctions due to characteristic signatures mani-
festing across scale. Another possible direction of research concerns the use of
second- and higher-order structure measurements (e.g., curvature) and statistics
126
![Page 141: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/141.jpg)
in combination with more sophisticated distance measures and classifiers. Recall,
currently only first-order measurements of spacetime structure and the zeroth-
order moment (i.e., mean energy) are considered. These higher-order extensions
would allow for the approach to encompass further patterns of interest beyond
the scope of the current first-order analysis (e.g., parametric motion such as affine
motion and acceleration); nonetheless, it has been demonstrated that the current
approach based on first-order statistics has successfully captured a broad and
important set of dynamic patterns.
In summary, this chapter has presented a unified approach to representing and
recognizing spacetime textures based on the underlying pattern dynamics. The
approach is based on a distributed characterization of visual spacetime in terms
of 3D, (x, y, t), spatiotemporal orientation. Empirical evaluation on a variety of
challenging data sets, including quantitative comparisons with state-of-the-art
methods, demonstrates the strong potential of the proposed approach.
127
![Page 142: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/142.jpg)
Chapter 5
Spacetime Grouping and Local
Boundary Detection
That the problem (organization) is difficult to frame atthe functional level makes it no less important or inter-esting. In avoiding the problem, we have been, as thestory goes, looking for our keys where the light is.
ANDREW WITKIN and JAY TENENBAUM (On theRole of Structure in Vision)
5.1 Introduction
5.1.1 Motivation
The grouping of coherent regions of dynamics and the detection of their bound-
aries in (temporal) image sequences have been and remain longstanding challenges
128
![Page 143: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/143.jpg)
in computer vision. Reasons for their continued interest include providing bound-
ary conditions for any process that requires knowledge of the spacetime support
of coherent data for recovering reliable local estimates (e.g., optical flow), visual
tasks can benefit from the reduction in complexity achieved by such an initial
organization of the data (Zucker et al., 1975; Marr, 1982; Lowe, 1985; Grimson,
1991) and boundaries provide useful information about the 3D structure of the
imaged scene (Koenderink, 1984).
What facilitates such an organization? The answer lies in the fact that our sur-
rounding environment is not arbitrary nor random, rather it is a very structured
place organized by the physical, biological, etc. forces in the world (Thompson,
1952). As a result, the apparent complexity in our visual world may be ex-
plained by a limited vocabulary of basic patterns combined in a vast number of
ways (Stevens, 1974; Pentland, 1986). It is this structure that facilitates reliable
inferences of our environment from raw image signals (Witkin & Tenenbaum,
1983; Jepson et al., 1996). In particular, the appearance of spatiotemporal co-
herence is so unlikely to arise by chance interaction of independent entities that
such regular structure, when observed, almost certainly represents some under-
lying unified cause (Witkin & Tenenbaum, 1983), i.e., unlikely to co-occur by
accident (Lowe, 1985). As such, they represent “semantic precursors” that de-
129
![Page 144: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/144.jpg)
serve and demand further explanation and elaboration (Witkin & Tenenbaum,
1983).
As discussed in Chapter 2, image motion, which forms the predominate focus
in the literature, represents a particular instance of the myriad spatiotemporal
patterns encountered in image sequences. Examples of non-motion-related pat-
terns of significance include, unstructured (e.g., “blank wall”), flicker (i.e., pure
temporal intensity change), and dynamic texture (e.g., as typically associated
with stochastic phenomena, such as windblown vegetation and turbulent water).
These types of dynamic patterns have received far less attention than motion in
the literature.
The goal of the work described in this chapter is the development of unified
approaches to spatiotemporal grouping and local boundary detection that are
broadly applicable to the diverse phenomena encountered in the natural world,
including but not limited to motion. It is proposed that the choice of represen-
tation is key to meeting this challenge: If the representation cannot adequately
characterize and distinguish the patterns of interest, then no subsequent algo-
rithm will make the appropriate delineations. As discussed in Chapter 2, local
spatiotemporal orientation is of fundamental descriptive power, as it captures
the structure of a broad and important set of dynamic patterns. Correspond-
130
![Page 145: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/145.jpg)
ingly, visual spacetime will be represented according to its local 3D, (x, y, t),
orientation structure, as captured by the proposed approach to representation
developed earlier in Chapter 3. With visual spacetime represented according to
its local orientation structure, distinct regions are defined as groups of (x, y, t)
pixels coalesced according to similarities of their orientation distributions, while
boundaries will be extracted via detection of spatiotemporal change in the local
orientation structure. Portions of the work described in this chapter appear in
(Derpanis & Wildes, 2009a; Derpanis & Wildes, 2009b).
5.1.2 Related work
In this chapter, two main lines of image sequence analysis are considered: space-
time grouping and local spacetime structure boundary detection. In this section,
representative strands of these lines of research are surveyed.
5.1.2.1 Spacetime grouping
Generic grouping entails associating together tokens that respect a given measure
of coherency. Spacetime grouping attempts to extract coherent regions in a video
sequence across both space and time. There are two key parts to any grouping
process: (i) the initial representation of the data and (ii) the actual grouping
131
![Page 146: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/146.jpg)
mechanism. For a survey on extant representations of dynamics, the reader is
referred to Section 3.1.2 of Chapter 3.
Spacetime grouping mechanisms can be classified into one of two basic com-
putational paradigms (Megret & DeMenthon, 2002): sequential and multidimen-
sional methods. Sequential approaches make distinctions between the spatial
and temporal dimensions which is reflected in their interleaving of spatial and
temporal grouping processes. These methods can further be subdivided into two
categories based on the priority given to the spatial and temporal grouping pro-
cesses.
Approaches that give priority to spatial grouping rely on temporally extending
recovered spatial segments across time (i.e., temporal tracking). This class has
historically been by far the most studied, as it subsumes the wealth of work related
to single frame image and motion segmentation. This class includes methods
based on spatial grouping of optical flow (Fennema & Thompson, 1979; Wang &
Adelson, 1994), parametric motion model fittings (via sequential application of
dominant motion estimators, e.g., (Burt et al., 1991; Wang & Adelson, 1994; Irani
et al., 1994; Bober & Kittler, 1994; Duc et al., 1995; Darrell & Pentland, 1995;
Odobez & Bouthemy, 1995; Sawhney & Ayer, 1996; Black & Anandan, 1996)
and simultaneous fitting of mixture models, e.g., (Jepson & Black, 1993; Ayer
132
![Page 147: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/147.jpg)
& Sawhney, 1995; Weiss, 1997; Jojic & Frey, 2001; Cremers & Soatto, 2005),
colour/texture-based features (Deng & Manjunath, 1998; del Bimbo et al., 2000;
Deng & Manjunath, 2001) and stochastic models of dynamic textures (Doretto
et al., 2003b; Ghoreyshi & Vidal, 2006; Chan & Vasconcelos, 2008; Chan &
Vasconcelos, 2009).
Approaches that give priority to temporal grouping seek to cluster temporal
trajectories of discrete features according to the different scene motions these tra-
jectories belong to. An early example of this approach was presented in (Allmen
& Dyer, 1991). Trajectories were mapped to high-dimensional feature vectors
based on local estimates of slope and curvature that were then grouped. Follow-
up work largely incorporated a camera model into the modeling process and
recast the grouping problem of feature trajectories as seeking the underlying lin-
ear subspaces that comprise the data, e.g., (Torr & Zisserman, 1998; Costeira &
Kanade, 1998; Gear, 1998; Ichimura, 2000). A drawback among these approaches
is that they preclude grouping regions where trajectories cannot be extracted
(e.g., dynamic water) or trajectories are induced by non-rigid objects/surfaces
but nonetheless correspond to an underlying coherent process (e.g., fluttering
vegetation).
A general drawback with prioritizing grouping along spatial and temporal di-
133
![Page 148: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/148.jpg)
mensions, irrespective of the ordering of these processes, is that they are prone
to propagating errors across the grouping stages. Recently, attempts have been
made to process spatial and temporal information simultaneously, thus ignor-
ing the question of prioritizing dimensions altogether. These multidimensional
approaches treat the image sequence and its extracted local attributes as a higher-
dimensional feature-space, where grouping is performed. Examples of these meth-
ods include, volume growing (Porikli & Wang, 2001), variational (Cremers &
Soatto, 2005; Brox et al., 2006), voting (Nicolescu & Medioni, 2003; Min &
Medioni, 2008), graph partitioning (Shi & Malik, 1998; Fowlkes et al., 2001), sta-
tistical parametric (Greenspan et al., 2004) and non-parametric, e.g., mean-shift
(DeMenthon, 2002; Wang et al., 2004).
Common to the cited approaches above is their neglect of the rich underly-
ing local structure of the image data: Grouping is based on overly restrictive
measurements, e.g., optical flow and models of stochastic processes, or the rich
temporal information is ignored entirely, e.g., colour. In contrast, the proposed
approach to grouping leverages the more descriptive spatiotemporal orientation
structure of the data. For spacetime grouping, the multidimensional mean-shift
algorithm is employed; note that the emphasis in this work is on the choice of rep-
resentation, rather than the particulars of the grouping scheme, other grouping
134
![Page 149: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/149.jpg)
schemes may be considered.
5.1.2.2 Spacetime structure boundary detection
Previous dynamic boundary detection methods can be categorized as either local
or global. Local methods restrict analysis to limited neighbourhoods around each
point. In contrast, global methods generally attempt to simultaneously estimate
a consistent flow field and its discontinuities across the image.
Early efforts focused on the local detection of motion discontinuities in dense
optical flow fields through the use of edge operators, e.g., (Nakayama & Loomis,
1974; Potter, 1977; Thompson et al., 1985). Closely related is work that used
changes in the sign/magnitude of the normal flow components to detect motion
boundaries (Marr & Ullman, 1981; Hildreth, 1984). Alternatively, regions ex-
hibiting a high percentage of unmatched features on a frame-to-frame basis are
identified as motion boundaries (Mutch & Thompson, 1985). Other methods
have detected boundaries from the shape of the local template match surface,
e.g., (Anandan, 1984; Black & Anandan, 1990). Boundary detection also has
been performed using a detector over basis flows for simple events (e.g., motion
of occluding edge or bar) (Fleet et al., 1998). In follow-up work, motion dis-
continuity regions were captured using a non-linear generative model (Black &
135
![Page 150: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/150.jpg)
Fleet, 2000). Recently, hand-labeled motion boundaries have been used to train
discriminative classifiers (Apostoloff & Fitzgibbon, 2005; Stein & Hebert, 2009a).
Further, motion boundary detection has been based on analysis of patch-based
distributions of image features (e.g., intensity, colour, flow) (Spoerri & Ullman,
1987; Stein & Hebert, 2007). Perhaps most closely related to the approach pro-
posed here are methods that detect motion boundaries from the structure of spa-
tiotemporal brightness patterns as captured by local estimates of spatiotemporal
orientation (Feldman & Weinshall, 2008) or, more generally, oriented bandpass
filters (Niyogi, 1995; Wildes & Bergen, 2000). Also related are previous efforts us-
ing oriented energy measurements for boundary detection in 2D intensity images,
e.g., (Morrone & Owens, 1987; Freeman & Adelson, 1991).
Typically, the focus of global methods has been the recovery of regional flows,
with inter-region boundaries made explicit to various degrees (Heitz & Bouthemy,
1993; Cremers & Soatto, 2005). The particular formulations developed in these
cases are limited to motion boundary detection and not more generally applicable
to additional classes of spatiotemporal structure boundaries. Alternatively, global
methods have been developed that indicate regions of dynamic texture and their
boundaries, e.g., (Doretto et al., 2003b; Ghoreyshi & Vidal, 2006); however, it
does not appear that such methods are directly applicable to motion boundaries.
136
![Page 151: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/151.jpg)
Overall, similar to extant grouping approaches, it appears that no single pre-
vious method for spatiotemporal boundary detection is capable of capturing the
wide range of juxtaposed spacetime patterns encountered in the real-world. Fur-
thermore, the emphasis of most previous work has been on the special case of
motion boundaries.
5.1.3 Contributions
In the light of previous research, the major contributions of the present work are
as follows. (i) The first unified approach to representing and segregating (i.e.,
grouping and detecting local boundaries) a wide range of juxtaposed spacetime
patterns (motion, static, flicker, (pseudo-)transparency, translucency, spacetime
texture, unstructured) is proposed. Previous analyses of spacetime patterns op-
erate on a case-by-case basis, with image motion predominant. (ii) The rep-
resentation developed in Chapter 3 provides a unified basis to representing and
delineating coherent spacetime regions of spacetime structure. The representation
converts dynamic structural differences to spatiotemporal contrast; correspond-
ingly, standard grouping and simple contrast detection mechanisms (e.g., local
differential operators) can delineate regions and mark boundaries, respectively.
(iii) The approach’s ability to group and detect boundaries in terms of meaningful
137
![Page 152: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/152.jpg)
spatiotemporal structure is demonstrated both qualitatively and quantitatively
on a wide range of synthetic and challenging natural image sequences.
5.2 Technical approach
In this section, the proposed approach to spacetime representation and analysis
for grouping and local boundary detection is presented. At an abstract level, the
approach consists of initially exposing the spacetime oriented structure of the
input video using the representation proposed in Chapter 3. This is followed by
an adaptive spacetime smoothing step to suppress noise and enhance structural
boundaries. The analysis culminates in grouping coherent dynamics as captured
by the representation or identifying boundaries of significant spacetime structure
contrast. The major assumption exercised in the proposed grouping and bound-
ary detection approaches is that of smoothness of spacetime structure surfaces
in a higher-dimensional attribute space for each region of coherent structure,
cf. (Nicolescu & Medioni, 2003). Invoking this assumption avoids the burden of
explicitly modeling the set of spatiotemporal structures that may be encountered.
The proposed approach to representation is motivated by the fact that such
an orientation-based decomposition captures significant, meaningful aspects of
temporal pattern variation, as discussed in Chapter 2. As examples: A signifi-
138
![Page 153: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/153.jpg)
cant response in a single component of the decomposition is indicative of motion;
significant responses in multiple components of the decomposition are indicative
of transparency-based superposition; more uniform, yet still significant responses
across the entire decomposition are indicative of dynamic texture (e.g., scintilla-
tion); lack of response in any component of the decomposition is indicative of un-
structured regions (e.g., uniform intensity). Under this representation, coherency
of spacetime is defined in terms of consistent patterns across the decomposition,
while inconsistencies indicate spacetime structural boundaries. Figure 5.1 pro-
vides a simple illustration of this point. In the orientation decomposition, it is
seen that the foreground tree yields relatively large and small intensities in the
“static” and “rightward” components, respectively; whereas, the moving back-
ground yields the opposite behaviour. Therefore, coherency in the decomposition
corresponds to regions exhibiting coherent dynamics while spatiotemporal change
(i.e., contrast) in the decomposition is indicative of the boundary between the
tree and background.
It should be noted that the proposed approach delineates unstructured regions
from neighbouring structured ones. This is in contrast with standard variational
approaches that attempt to fill-in structure in regions where none is present, e.g.,
(Horn & Schunck, 1981). As argued in (Fleet, 1992; Shi & Malik, 1998), the
139
![Page 154: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/154.jpg)
fabrication of structure in such cases may add little reliable information.
Integration of purely spatial cues, such as colour and texture (Martin et al.,
2004; Stein & Hebert, 2007), although of obvious benefit and a potential direction
for future research, is beyond the scope of the current effort.
140
![Page 155: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/155.jpg)
Spacetime Oriented Energy Decomposition
Boundary Salience
. . .
x
t
y
Input Video
static rightward
Grouping Contrast Computation
Spacetime Groupings
Adaptive Smoothing
Figure 5.1: Oriented energy decomposition maps structural differences to inten-sity differences. (top) Input image sequence of a foreground tree tracked (stabi-lized) by a moving camera with background in relative motion. (top-middle) Ori-ented energy decomposition of input shows marked differences in intensity corre-sponding to dynamic pattern differences of foreground vs. background. (bottom-middle) Oriented energy decomposition anisotropically smoothed. (bottom-left)Groupings indicated according to coherency within the anisotropically smoothedenergy decomposition. (bottom-right) Boundaries marked according to spa-tiotemporal contrast across the anisotropically smoothed energy decomposition;here displayed with their corresponding saliency value.
141
![Page 156: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/156.jpg)
5.2.1 Representation: Distributed spacetime orientation
This section presents details of the particular instantiation of the representation
used for the segregation tasks. For the motivations of the various steps, the reader
is referred to the discussion of the general framework in Section 3.2.1 of Chapter
3.
The desired spacetime orientation decomposition is realized using a set of
broadly tuned 3D Gaussian second derivative filters (i.e., N = 2), G2θ(x), and
their Hilbert transforms, H2θ(x), with the unit vector θ capturing the 3D direction
of the filter symmetry axis and x = (x, y, t) spacetime position. The local oriented
energy measurements are given as follows (cf. Eq. (3.1))
Eθ(x) =∑x∈Ω
[G2θ(x) ∗ I(x)]2 + [H2θ
(x) ∗ I(x)]2, (5.1)
where I(x) denotes the input video and ∗ convolution. The spacetime aggre-
gation within the spacetime region Ω is implemented by lowpass filtering the
data using the nine-tap binomial filter,
[1 8 28 56 70 56 28 8 1
]/256,
in spacetime. In application to grouping and boundary detection, it is advanta-
geous to use quadrature pair filtering, (5.1), so that the aggregation region can
be kept small to yield precise localization between juxtaposed regions.
Now, energy along a frequency domain plane with normal n and spatial ori-
142
![Page 157: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/157.jpg)
entation discounted through marginalization, is given by summing across the set
of measurements, Eθi , as (cf. Eq. (3.5))
En(x) =N∑i=0
Eθi(x), (5.2)
with θi one of N + 1 = 3 specified directions, (3.4), and each Eθi calculated
via the oriented energy filtering, (5.1). In the present implementation, six dif-
ferent spacetime orientations are made explicit, corresponding to static (no mo-
tion/orientation orthogonal to the image plane), leftward, rightward, upward,
downward motion (one pixel/frame movement), and flicker/infinite motion (ori-
entation orthogonal to the temporal axis); although, due to the broad tuning of
the filters employed, responses arise to a range of orientations about the peak
tunings.
Finally, the energy measures are normalized by the sum of consort planar
energy responses at each point (cf. Eq. (3.6)),
Eni(x) =Eni(x)
M∑j=1
Enj(x) + ε
, (5.3)
where M = 6 denotes the number of spacetime orientations considered and ε
is the scalar constant noise floor. Empirically, ε has been set to ∼1% of the
maximum expected response.
143
![Page 158: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/158.jpg)
5.2.2 Anisotropic smoothing: Mean-shift
Prior to attempting to group segments (Section 5.2.3) or mark loci of significant
spatiotemporal boundaries (Section 5.2.4) in the oriented energy decomposition,
it is appropriate to smooth the derived representation to suppress noise. For
this purpose, an anisotropic smoothing is performed as it serves to attenuate
noise while enhancing structural boundaries. Also, anisotropic smoothing is es-
pecially helpful around segment boundaries where the initial filter outputs may
mix structural measurements from adjoining regions and thus result in the hal-
lucination of regions (Galun et al., 2003). In the current implementation, mean-
shift is employed as the anisotropic smoothing operation (Fukunaga & Hostetler,
1975; Cheng, 1995; Comaniciu & Meer, 2002); however, other vector-valued meth-
ods are also applicable, e.g., (Tschumperle & Deriche, 2005).
To promote spatiotemporal coherence at the smoothing stage, the spacetime
orientation feature-space, (5.3), is augmented with positional information in the
form of spacetime coordinates, (x, y, t). Putting the above features together yields
a 9D feature vector (six oriented energies plus three for spacetime location), per
image point.
Conceptually, mean-shift regards the feature-space as an empirical distribu-
tion. Each feature-point is associated with a mode (local maximum) of the distri-
144
![Page 159: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/159.jpg)
bution via a gradient-ascent-like process and thereby all points associated with a
particular mode share a common feature value. In its simplest formulation (i.e.,
based on the Epanechnikov kernel), the mean-shift property can be written as
(see (Comaniciu & Meer, 2002), for details)
∇f(xc) ∝(
meanxi∈Sh,xc
xi − xc
), (5.4)
where f(x) denotes the underlying probability density function of a n-dimensional
space, ∇ the gradient estimate, x ∈ Rn, xi the given set of samples, and Sh,xc a
n-dimensional hyper-ball with radius h (the so-called kernel density bandwidth)
centered at xc. Repeated application of (5.4) converges to a local mode of the
distribution (i.e., ∇f(x) = 0). Specifically, modes are realized by moving the
window Sh,xc at each iteration by the mean-shift vector, (5.4), until the shift
magnitude falls below a given threshold. In the present case, modes arise as
particular values across 9D spatiotemporal feature vectors, x.
For the case of spacetime grouping, the converged 9D spatiotemporal feature
vectors serve as input for grouping. While for boundary analysis, the smoothed
input energy representation is realized by assigning the converged oriented en-
ergy portion of the feature vectors to their respective initial spacetime positions.
For both visual tasks, three bandwidth parameters serve as input, hspace, htime
and hrange, which determine the resolution of mode detection along the spatial,
145
![Page 160: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/160.jpg)
temporal and range (here, spacetime orientation) dimensions, respectively.
By far the most expensive operation of the mean-shift approach is finding the
points within the neighbourhood defined by the hyper-ball Sh,xc . This problem is
commonly referred to as multi-dimensional range searching. In implementation,
computational time reductions in range searching are realized by employing a
kD-tree (de Berg et al., 2008) to represent the data.
5.2.3 Spacetime grouping
To group the oriented energy feature vectors into coherent regions of structure,
mean-shift clustering (Comaniciu & Meer, 2002) is employed. Note, however,
the intent of this work is to demonstrate the power of the proposed representa-
tion in the context of spacetime grouping, rather than advocate any one group-
ing/clustering scheme. Implicitly this grouping strategy enforces spatial and
temporal smoothness simultaneously in the 9D space, while avoiding additional
assumptions on the camera model, parametric motion model (e.g., translation,
affine, etc.) or the number of distinct coherent spacetime structure regions.
Given the smoothed 9D joint feature space from the output of mean-shift
smoothing (Section 5.2.2), groupings are realized through a final transitive closure
step (i.e., connected-components). This step merges modes that lie within a
146
![Page 161: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/161.jpg)
distance τ from each other in the 9D space. More formally, let xi and xj ∈ Rn
denote two feature points. These points will be grouped together if there exists a
path, x1 = xi, . . . ,xm,xm+1, . . . ,xM = xj and m = 1, . . . ,M − 1, between them
such that neighbouring elements, xm,xm+1 in this path lie within a Euclidean
distance τ from each other (i.e., ‖xm−xm+1‖2 ≤ τ). In implementation, this step
is realized via a union-find algorithm (Cormen et al., 1990). Optionally, spacetime
groupings containing less than a specified number of pixels can be removed in a
postprocessing step via merging with the nearest-neighbour in the joint feature
space. The combined steps of mean-shift filtering followed by transitive closure
constitute the mean-shift clustering algorithm (Comaniciu & Meer, 2002).
To recapitulate, the proposed approach to grouping raw image data into a set
of coherent spacetime regions is given in algorithmic terms in Algorithm 3.
5.2.4 Spacetime structure boundaries
In essence, the oriented energy representation converts spacetime structure differ-
ences to intensity differences across its decomposition. Correspondingly, bound-
aries simply correspond to image loci exhibiting significant spatiotemporal con-
trast in the representation. Specifically, the orientation decomposition is a multi-
valued image, with spatiotemporal contrast indicative of spacetime boundaries in
147
![Page 162: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/162.jpg)
Algorithm 3: Spacetime grouping.Input: Greyscale image sequence: I(x, y, t), Means-shift bandwidth parameters: hspace,
htime, and hrange, Mean-shift merging threshold: τ , M : minimum number ofpixels per group
Output: Labeled image sequence denoting the group of each point: L(x, y, t)
Step 1: Compute spacetime oriented energy representation (Sec. 5.2.1)
1. Initialize 3D G2/H2 steerable basis.2. Compute normalized spacetime oriented energy measure, Eqs. (5.1)-(5.3).Step 2: Anisotropic smoothing: Mean-shift (Sec. 5.2.2)
3. Augment each normalized spacetime oriented energy measure, (5.3), with itsspacetime coordinate (x, y, t).
4. Apply mean-shift smoothing iterations, (5.4), to each point in the joint featurespace.
Step 3: Compute spacetime groupings (Sec. 5.2.3)
5. Merge via transitive closure all modes that lie within a distance τ from each otherin the joint feature space.
6. (Optional) Remove via merging spacetime regions containing less than M pixels.
the underlying data. Here, it is interesting to note the difference in the behaviour
of flow estimates and the proposed distributed representation across boundaries.
In the former, the results are unpredictable due to a total failure of its intrinsic
assumptions (e.g., brightness conservation). In the latter, due to the consider-
able overlap in spacetime and orientation tuning of the filters, the representa-
tion changes smoothly across structure boundaries reflecting the shift of energies
among channels.
To capture the spatiotemporal contrast in the smoothed oriented energy repre-
sentation, (5.3), a generalized gradient formulation (Kreyszig, 1959) is employed,
as it captures change in a uniform manner across the multiple components of the
decomposition. Let Ek be the kth band of the oriented energy representation,
148
![Page 163: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/163.jpg)
(5.3), following smoothing, (5.4), and ξi = x, y, t for i = 1, 2, 3, respectively, de-
fine the directions that partial derivatives are taken, then the generalized gradient
is a 3× 3 matrix S where
Sij ≡n∑k=1
∂Enk
∂ξi
∂Enk
∂ξj. (5.5)
Notice that S amounts to the summation of the more standard gradient structure
tensor (Jahne, 2005) of each energy band6. The eigenvector of (5.5) associated
with the greatest eigenvalue, λ1, denoted e1, points in the direction of greatest
change in the feature-space. For multivalued images (i.e., n > 1), a boundary
is not indicated simply by a large value for λ1; instead, it must be large relative
to the other eigenvalues of S (Sapiro & Ringach, 1996). Correspondingly, a
normalized measure of spacetime structure boundary salience is employed in the
present context
boundarysalience =λ1 − λ2
λ1 + λ2 + φ, (5.6)
where λ1 > λ2 denote the two largest eigenvalues of S and φ is a constant in-
troduced as a noise floor. High values of the boundary salience measure, (5.6),
(i.e, values close to one), are indicative of the presence of a spacetime struc-
ture boundary. Next, similar to the non-maximum suppression principle used
6Other adaptations of the generalized gradient to multiband image boundary detection in-clude application to colour (Di Zenzo, 1986; Sapiro & Ringach, 1996) and spatial texture(Rubner & Tomasi, 1996).
149
![Page 164: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/164.jpg)
in intensity-based edge detection (Canny, 1986), a candidate boundary point is
defined as a point that achieves a maximum in boundary salience, (5.6), in the
direction of the eigenvector e1, as follow,∂boundarysaliency
∂e1
= 0
∂2boundarysaliency
∂e21
< 0
. (5.7)
Finally, candidate loci having a saliency value greater than a certain threshold,
τ , are marked as boundary points.
To recapitulate, the proposed approach to detecting spacetime structure
boundaries is summarized in algorithmic terms in Algorithm 4.
5.3 Empirical evaluation
5.3.1 York University Vision Lab (YUVL) Spacetime Structure Data
Set
Current available data sets, e.g., (Stein & Hebert, 2009a), are limited by their
focus on motion boundaries at the expense of more general spacetime struc-
tural groupings and boundaries. For the purpose of quantitatively evaluating
the proposed grouping and boundary detection approaches, a new data set with
hand-labeled ground truth was collected containing a broad range of spacetime
150
![Page 165: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/165.jpg)
Algorithm 4: Spacetime structure boundary detection.Input: Greyscale image sequence: I(x, y, t), Means-shift bandwidth parameters: hspace,
htime, and hrange, Boundary detection threshold: τOutput: Binary image sequence marking spatiotemporal structure boundaries:
B(x, y, t)
Step 1: Compute spacetime oriented energy representation (Sec. 5.2.1)
1. Initialize 3D G2/H2 steerable basis.2. Compute normalized spacetime oriented energy measure, Eqs. (5.1)-(5.3).Step 2: Anisotropic smoothing: Mean-shift (Sec. 5.2.2)
3. Augment each normalized spacetime oriented energy measure, (5.3), with itsspacetime coordinate (x, y, t).
4. Apply mean-shift smoothing iterations, (5.4), to each point in the joint featurespace.
5. Replace each energy measure in (5.3) with the final converged energy measure.Step 3: Compute spatiotemporal structure boundary salience (Sec. 5.2.4)
for each spacetime point6. Construct generalized gradient, (5.5), from (smoothed) oriented energy
representation, (5.3).7. Compute the eigenvector/eigenvalues of the generalized gradient, (5.5).8. Compute boundary salience, (5.6).
Step 4: Non-maximum suppression (Sec. 5.2.4)
for each spacetime point9. Apply non-maximum suppression, (5.7).
10. Retain candidate boundaries that have a saliency value, (5.6), greater than τ .
structures,7 including but not restricted to motion. Each sequence spans 10
frames. Example frames from the data set are shown in Fig. 5.2 (see caption for
description of inputs). The challenging aspects of this data set include, regions
that are unstructured, exhibit significant temporal aliasing due to fast motion,
contain superimposed motion and non-motion structure (e.g., transparency and
scintillation). These sequences, consisting of juxtaposed natural and man-made
structures, were obtained from a variety of sources: a Canon HF10 camcorder,
7Available at: http://www.cse.yorku.ca/vision/research/spacetime-grouping/
151
![Page 166: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/166.jpg)
the BBC documentary “Planet Earth” (Fothergill, 2007) and the “BBC Motion
Gallery” online video repository (www.bbcmotiongallery.com).
In addition, a synthetic image sequence was constructed that contains three
distinct juxtaposed spacetime structures consisting of upward motion, a scintil-
lating pattern in the form of pink noise and transparency in the form of left/right
superimposed motion (see Fig. 5.3). The upward moving region contains three
subregions that differ only in spatial appearance and therefore should be grouped
together given their common dynamic structure (i.e., upward motion).
152
![Page 167: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/167.jpg)
Ground TruthInput
(a)
x
y
t
(b)
Input
(f)
(g) (h) (i)
(j)
Ground Truth
(c)
(d) (e)
Input Ground Truth
Figure 5.2: Sample frames and corresponding ground truth boundaries from theYUVL Spacetime Structure Data Set. In each example, the input sequence anda frame from the human-labeled ground truth, respectively, are given. (a) Apanning sequence consisting of a clear sky (i.e., unstructured) and a building(source: HF10). (b) Motion parallax sequence consisting of two mountain faces,where the foreground surface moves rapidly revealing a slower moving surface(source: “Planet Earth”). (c) Tree in foreground being coarsely stabilized bymoving camera operator with resulting background motion (source: HF10). Thebackground consisting of the ground plane is not fronto-parallel with respect tothe camera, as a result the motion varies across the surface. (d) A leopard rapidlymoving leftward behind a static tree (source: “Planet Earth”). (e) A flying birdcrudely tracked by the camera operator to yield a slow moving target and arapidly moving background (source: “Planet Earth”). (f) A ship moving overa scintillating water surface (source: “BBC Motion Gallery”). (g) A paintinghanging on an unstructured wall with a light flickering in an adjacent hallway(source: HF10). (h) A translucency sequence realized by projecting (using anLCD projector) a walking person over a static painting (source: HF10). (i)A pseudo-transparency sequence consisting of a person walking behind a fence(source: HF10). (j) A juxtaposed motion and pseudo-transparency sequenceconsisting of two people moving rightward, one moving in front of a fence whilethe second is moving behind it (source: HF10).
153
![Page 168: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/168.jpg)
(a)
x
y
t
(b)
(a.1) (a.2) (a.3) (a.4) (a.5)
Figure 5.3: Synthetic juxtaposed spacetime structures. (a) Synthetic input imagesequence. (a.1) Bandpass white noise with spatial horizontal/vertical structureenhanced moving upward, (a.2) same pattern as previous with contrast increasedby 25%, (a.3) bandpass white noise with spatial diagonal structures enhancedmoving upward, (a.4) spacetime pink noise (scintillation) and (a.5) superim-posed pink noise patterns moving rightward/leftward (transparency). All motion-related structures move with a speed of 1 pixel/frame. (b) Spacetime groupingground truth with shadings denoting coherent groupings.
154
![Page 169: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/169.jpg)
5.3.2 Spacetime grouping
A variety of grouping experiments have been performed on a set of synthetic
and real-world image sequences. Algorithm parameter settings are as follows.
Mean-shift clustering has four input parameters: the bandwidths, hspace, htime
and hrange, which determine the resolution of mode detection along the spatial,
temporal and range (here, spacetime orientation) dimensions, respectively, and
the merging threshold, τ , which determines the distance between points that are
grouped via transitive closure in the 9D feature space. Unless otherwise stated,
the mean-shift bandwidth parameters for the space, time and range dimensions
are set to hspace = 32, htime = 10, and hrange = 0.12, respectively. Groupings
containing less than 40 pixels/frame are considered outliers and removed via
merging. Videos of all grouping results are available at: http://www.cse.yorku.
ca/vision/research/spacetime-grouping.
5.3.2.1 Representation comparisons
Figure 5.4 shows a comparison of grouping results on the YUVL synthetic se-
quence based on the proposed representation and the following alternative rep-
resentations: raw-pixelwise intensity, thresholded frame-to-frame intensity differ-
ence, spacetime gradient, normalized oriented energy (5.3) without appearance
155
![Page 170: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/170.jpg)
(e)
(g)
(b)
(f)
(c) (d)
(a)
(h)
Figure 5.4: Grouping with various levels of data abstraction. (a) Ideal groupingresult with shadings denoting coherent groupings. (b)-(h) Grouping results basedon: (b) proposed representation, (c) optical flow, (d) normal flow (e) normalizedoriented energy (5.3) without appearance marginalization (5.2), (f) spacetimegradient, (g) thresholded frame-to-frame intensity difference, and (h) raw pixel-wise intensity; white denotes outlier components. For each representation, themean-shift parameters were empirically optimized.
marginalization (5.2), normal flow and optical flow estimated as in (Granlund
& Knutsson, 1995). All representations are grouped using the mean-shift proce-
dure outlined in Sections 5.2.2 and 5.2.3, by exchanging the energy-based range
component of the feature vector with the given representation. Note that these
representations constitute commonly considered sample points along the level of
commitment axis shown in Fig. 3.1.
Representations that do not adequately discount pure spatial appearance
(normalized oriented energy (5.3) without appearance marginalization (5.2), spa-
156
![Page 171: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/171.jpg)
tiotemporal gradient, normal flow and raw pixel-wise intensity) correspondingly
fail to support grouping of the coherently moving region as such. Successive
frame differencing fails in making the pattern distinctions sought because it is
limited to making binary distinctions between dynamic and stationary patterns.
All of these approaches also fail to group successfully the transparency and scin-
tillation cases. While optical flow proves capable of supporting correct grouping
of the coherently upward moving regions, it is incapable of dealing with scintil-
lating and transparency patterns due to violations of the underlying assumption
of brightness conservation. Among these abstractions, only the proposed dis-
tributed representation successfully supports the partition of the input into its
three constituent structures, coherent motion, scintillation and superimposed mo-
tion. Owing to the superior performance of the proposed approach relative to the
considered alternatives, only the proposed approach will be considered further in
the subsequent spacetime grouping experiment.
157
![Page 172: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/172.jpg)
5.3.2.2 Quantitative results on natural imagery
Figure 5.5 shows a set of successful grouping results on the YUVL data set using
the proposed grouping approach. Beginning with (a), the clear sky (unstructured)
is correctly delineated from the building (moving structure). The unstructured
region represents a situation where there is insufficient information to determine
flow. In (b), although the spatial appearance alone is insufficient to make the
surface distinctions, the two surfaces are successfully partitioned based on their
spacetime structure signatures. In (c), the stabilized foreground (static) is clearly
delineated from the rightward moving background. In (d), although significant
temporal aliasing is present due to the rapid motion of the leopard, the portion
attributed to the leopard is successfully delineated from the static tree. An op-
tical flow abstraction, although feasible, would require preprocessing the input
with a stabilization procedure, e.g., pyramid processing (Jahne, 2005), to bring
the speed of the leopard within its operating speed range (∼1 pixel/frame). In
(e), similar to (d), correct delineation is achieved in the presence of significant
temporal aliasing. In (f), the sequence is successfully partitioned into two co-
herent regions, one consisting of a leftward motion while the second consisting
of a scintillating region. Inspection of the oriented energy representation reveals
a fairly uniform energy distribution at each point consistent with scintillation.
158
![Page 173: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/173.jpg)
Spacetime Grouping
Ground TruthInput
(a)
(b)
(c)
Input
(f)
(g)
(h)
(d)
(e)
(i)
(j)
x
y
t
Spacetime Grouping
Ground Truth
Figure 5.5: Successful grouping results on the YUVL Spacetime Structure DataSet. In each example, the input sequence, a frame from the human-labeled groundtruth and grouping result, respectively, are given.
Since the assumption of brightness conservation is clearly violated, optical flow
would be inappropriate here. From a semantic viewpoint, the water group is
consistent with one’s impression that some coherent process is being imaged.
In (g), the recovered dominant coherent structures consist of unstructured (the
wall), static (the painting) and flicker (the hallway). The painting is partially
159
![Page 174: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/174.jpg)
over-segmented along the boundaries due to the strong one-dimensional structure
along the painting’s boundaries that induce an oriented energy signature consis-
tent with the aperture problem. At this level of analysis, this is a reasonable
interpretation/grouping: The regions near the border of the painting against the
unstructured wall are uniformly consistent as being characterized by the aperture
problem. In (h), the translucent region (static background with superimposed
moving person) is successfully distinguished from the purely static background,
although the static background is slightly over-segmented due to low texture re-
gions. Standard optical flow is inappropriate in the translucency region since it
cannot be described by a single motion. In (i), the sequence is partitioned into
the two constituent coherent structures: static structure vs. superimposed static
and leftward motion (i.e., pseudo-transparency). Finally in (j), the sequence is
partitioned into the two constituent coherent structures: rightward motion vs.
superimposed static and rightward motion (i.e., pseudo-transparency). Notice,
even with an oriented structure mutually shared in both patterns (rightward
motion), successful segregation is achieved on the basis of the additional static
orientation in one of the patterns due to the fence.
The grouping results in Fig. 5.5 provide compelling qualitative evidence that
the proposed approach to grouping performs well on image sequences contain-
160
![Page 175: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/175.jpg)
Figure 5.6: Spacetime grouping precision-recall curves. Points along the curvecorrespond to variations of the range bandwidth within [0.04, 0.18].
ing a variety of spatiotemporal structures. To quantify performance, results are
compared with frame-by-frame human labeled (spatial) boundary ground truth
on the YUVL data set. Following a standard evaluation methodology for image
segmentation (Estrada & Jepson, 2009), mean precision-recall scores were cal-
culated across all the image sequences illustrated in Fig. 5.2 and are shown as
tuning curves in Fig. 5.6. These curves characterize the grouping performance
over a range of the input parameters. Precision is defined as the percentage of
detected boundary pixels in the recovered grouping8 that correspond to ground
8Boundaries in the labeled groupings are identified as any pixel that has at least a single
161
![Page 176: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/176.jpg)
truth boundary pixels:
Precision =Matched(Sr, Sg)
|Sr|, (5.8)
where Sr and Sg represent the grouping result and ground truth, respectively,
Matched(Sr, Sg) denotes the number of boundary pixels in Sr that were matched
in Sg within a neighbourhood of radius ρ and | · | denotes the cardinality of the set
of boundary pixels. Recall is defined as the percentage of ground truth boundary
pixels that were detected in the recovered grouping:
Recall =Matched(Sg, Sr)
|Sg|. (5.9)
Over-segmentation is characterized in the curves by high recall but low precision,
and the converse holds for under-segmented images.
For mean-shift, there are four input parameters, the bandwidths,
hspace, htime, hrange, and the merging threshold, τ . Through preliminary experi-
mentation it was found that the largest differences between grouping results are
realized when varying the range bandwidth and merging threshold parameters.
In Fig. 5.6, points along the curve correspond to varying the range bandwidth,
hrange, within [0.04, 0.18] while fixing the spatial and temporal bandwidths to
(hspace, htime) = (32, 10) and merging threshold to τ ∈ 0.05, 0.10. Matching was
neighbour with a differing label. This is followed by a thinning procedure to realize 1 pixelwidth boundaries.
162
![Page 177: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/177.jpg)
Figure 5.7: Grouping on a (synthetic) smoothly varying spacetime structure.(left) Input consisting of a counter-clockwise rotating pink noise disc over astationary pink noise background. (right) A sample frame from the successfulbipartite grouping result.
carried out using a distance threshold ρ = 8, which is reasonable given that the
oriented energy filtering support also spans eight pixels. The consistently high re-
call indicates that the recovered groupings show little tendency to under-segment
and thus one can conclude that the combination of representation and grouping
mechanism capture salient image structure. At the same time, a relatively high
precision is attained, which indicates that significant over-segmentation is not
prevalent. Note that dips in the precision-recall curves, such as the prominent
one in the curve related to τ = 0.10, are to be expected, given the relatively small
amount of test data used.
163
![Page 178: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/178.jpg)
5.3.2.3 Additional examples
Figure 5.7 illustrates a successful grouping result on a synthetic input sequence
consisting of a counter-clockwise rotating pink noise disc over a stationary pink
noise background. This example indicates the approach’s ability to recover group-
ings in cases where spacetime structure varies smoothly across spacetime. A sim-
ilar example has been successfully demonstrated using the tensor voting frame-
work (Nicolescu & Medioni, 2003); however, that approach relies on flow, while
the examples of Fig. 5.4 show that flow yields difficulties in situations involving
more complicated patterns (e.g., transparency and scintillation). Significantly,
in the present framework the cases of grouping smoothly varying flow as well as
non-motion dynamic structure can be uniformly handled.
Figure 5.8 shows a successful grouping result on a natural input sequence,
where spacetime structure changes over time, as a foreground target initially at
rest goes into motion against a static background. This example demonstrates the
approach’s ability to deal with both spatial and temporally juxtaposed structure
in a uniform manner.
Figures 5.9 and 5.10 show successful grouping results on two dynamic texture
examples previously used in (Chan & Vasconcelos, 2008) to evaluate their pro-
posed approach to dynamic texture grouping. In the case of the dynamic texture
164
![Page 179: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/179.jpg)
Figure 5.8: Grouping of a target initially at rest, then in motion. (left-to-right,top-to-bottom) Input frame of initially static scene; input frame taken after theonset of motion; grouping result of initial frame shows no foreground/backgrounddelineation as person and surround are both static; grouping result as person is inmotion. Correct grouping is maintained throughout person’s subsequent motion.
shown in Fig. 5.9, the two juxtaposed dynamic textures vary in both their spatial
appearance and dynamics, while in Fig. 5.10, the juxtaposed textures share the
same spatial appearance but differ in dynamics. Previously reported results on
these examples require knowledge of the number of textures as well as hand con-
tour initialization (see Fig. 5.10 (c)) and show significant under-segmentation on
the second example (Chan & Vasconcelos, 2008). The present approach does not
require similar a priori knowledge, yet yields a reasonable grouping result for the
165
![Page 180: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/180.jpg)
(a) (b)
Figure 5.9: Grouping result on a set of (synthetically combined) dynamic texturesused in (Chan & Vasconcelos, 2008). (a) Input frame of two widely differing dy-namic textures (turbulent water and rising bubbles). (b) Corresponding groupingresults using the proposed approach.
first example and a superior result in the second example as compared to (Chan
& Vasconcelos, 2008) (see Fig. 5.10 (d)).
166
![Page 181: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/181.jpg)
(a) (b)
(c) (d)
Figure 5.10: Grouping on a set of (synthetically combined) dynamic textures usedin (Chan & Vasconcelos, 2008). (a) Input frame consisting of two instances ofturbulent water moving left and right, respectively, both textures share the samespatial appearance but differ in image dynamics. (b) Corresponding groupingresult using the proposed approach. (c) A video frame and the initial contourfor the approach reported in (Chan & Vasconcelos, 2008). (d) Correspondinggrouping result reported in (Chan & Vasconcelos, 2008). (c) and (d) are reprintedfrom (Chan & Vasconcelos, 2008) with permission (© 2008 IEEE).
167
![Page 182: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/182.jpg)
5.3.3 Spacetime structure boundary detection
In the spacetime boundary detection evaluation, parameter settings for the pro-
posed detector are as follows. The noise floor, φ, for boundary salience, (5.6),
empirically has been set to φ = 0.01. Mean-shift (anisotropic) smoothing includes
three bandwidth parameters, hspace, htime and hrange, which determine the resolu-
tion of detail along the spatial, temporal and range (here, spacetime orientation)
dimensions, respectively. Unless otherwise stated, the mean-shift bandwidths are
set to: hspace = 32, htime = 10, and hrange = 0.12. Videos of all boundary de-
tection results are available at: http://www.cse.yorku.ca/vision/research/
spacetime-grouping.
Figure 5.11 shows a set of successful boundary detection results on the YUVL
data set using the proposed boundary detection approach. To quantify per-
formance, results of the proposed detector are compared with the hand-labeled
ground truth as well as alternative approaches on the YUVL data set. Mean
precision-recall scores were calculated and are shown as tuning curves in Fig.
5.12 as detection parameters are varied. Again, over-partitioning is character-
ized in the curves by high recall but low precision, and the converse holds for
under-partitioned image sequences.
Figure 5.12 (a) shows several different curves for the proposed method, with
168
![Page 183: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/183.jpg)
each curve corresponding to a different value of the smoothing parameter, hrange;
all curves are swept as the detection threshold varies from 0 − 1. Matching
between ground truth and identified boundary points was carried out using a
distance threshold of eight, which is reasonable given that the support of the
various compared detectors span approximately eight pixels. The consistently
high recall indicates that ground truth boundaries are accurately marked. At the
same time, a relatively high precision is attained, which indicates false boundaries
are not prevalent. Further, the approach is seen to be stable with respect to
variation of the smoothing parameter.
Figure 5.12 (b) compares the best curve of the proposed approach, hrange =
0.14, with two alternative methods: (i) edge detection on dense optical flow fields
(Thompson et al., 1985) (implemented as a 3D Canny edge operator (Canny,
1986) applied to flow recovered using the Lucas-Kanade algorithm (Lucas &
Kanade, 1981)) and (ii) the rank-based method that analyzes the gradient struc-
ture tensor over a neighbourhood (Feldman & Weinshall, 2008); the reader is
referred to Appendix D.2 and D.3 for details of these approaches. These meth-
ods are selected for comparison as they are local (like the proposed method) and
edge detection in flow fields is a long standing approach, while the rank-based
analysis is a recent proposal that has shown strong results for certain boundary
169
![Page 184: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/184.jpg)
types. Tuning curves were swept for the flow- and rank-based detectors by vary-
ing their detection thresholds from 0 − 10 and 0 − 1, respectively. Curves for
all three methods have the expected shape; however, flow- and rank-based are
translated along the precision axis, which indicates significant over-partitioning
relative to the proposed approach. Along these lines, rank-based outperforms
flow, but is still noticeably worse than the proposed method.
To scrutinize the results in Fig. 5.12 further, Fig. 5.13 shows a comparison
of the various boundary detectors on selected examples from Fig. 5.11 (c), (f)
and (i). Also to compare against global methods, results from a recent level-set
formulation are shown (Cremers & Soatto, 2005). Note that the global method
must be supplied with a priori knowledge of the number of regions and hand
initialization of its boundary9. For the motion parallax example, all of the al-
ternative methods yield reasonable results. This is to be expected, as they are
designed for motion boundaries. In the other two examples, the flow and rank
methods yield spurious boundaries in the transparency and scintillation regions.
This shortcoming arises from the inability of these methods to recover coherent
measurements in non-coherent motion regions, as the assumption that coherency
9Due to the dependence of the extracted boundaries on hand contour initialization, numberof regions and various scaling parameters in the level-set approach, it is not customary to sweepprecision-recall curves for level sets; hence, only qualitative comparisons are provided.
170
![Page 185: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/185.jpg)
is well characterized by a single smoothly varying flow is violated. These spurious
boundaries are the source of the low precision yet high recall rates indicated for
the flow- and rank-based detectors in Fig. 5.12. For the transparency case using
level sets, the part of the initial contour that is outside the moving target evolves
correctly; however, the part initialized within the moving region converges incor-
rectly, as the target interior does not conform to the method’s assumption of a
single smooth flow. In the scintillation case, the level-set collapses to a single
region. Here, the failure is due to the relative lack of spatial structure in the ship
interior, which allows the approach to fit a flow across the ship that is consis-
tent with whatever flow it (erroneously) recovers for the scintillating water. This
highlights the difficulties (discussed earlier in Section 5.2) with approaches that
attempt to fill-in structure in regions devoid of it. The relative lack of structure
in the ship interior also accounts for the apparent difference in performance of
the flow and rank methods in such regions: Flow recovers highly variable vector
fields that are interpreted as boundaries; whereas, unstructured regions are rank
consistent and thus do not yield spurious boundaries. In contrast to the alterna-
tives, the proposed detector naturally handles all three cases highlighted in Fig.
5.13.
171
![Page 186: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/186.jpg)
Detected Boundaries
Ground TruthInput
(a)
(b)
(c)
Input
(f)
(g)
(h)
(d)
(e)
(i)
(j)
x
y
t
Detected Boundaries
Ground Truth
Figure 5.11: Successful boundary detection results on the YUVL SpacetimeStructure Data Set. In each example, the input sequence and a frame from thehuman-labeled ground truth and the boundary detection result, respectively, aregiven.
172
![Page 187: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/187.jpg)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Precision
Rec
all
hrange = 0.08hrange = 0.10hrange = 0.12hrange = 0.14
(a)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Precision
Rec
all
rank-based detectoroptical flow edgesproposed detector (hrange = 0.14)
(b)
Figure 5.12: Local boundary detection precision-recall curves. (a) Precision-recall of the proposed detector, each curve corresponds to a different setting ofthe range bandwidth used for smoothing. (b) Comparison of precision-recall withthe proposed, flow- and rank-based detectors.
173
![Page 188: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/188.jpg)
scin
tilla
tion
mot
ion
para
llax
trans
pare
ncy
ground truthreference frame optical flow result
rank-basedresult
proposed detectorresult
level setresult
Figure 5.13: Comparison of results for flow, rank, level-set and proposed ap-proaches to boundary detection applied to Fig. 5.11 (c), (f) and (i). For levelsets, red and black curves show hand initialized and converged result, respectively.For the scintillation example the level set collapses to yield a single region.
174
![Page 189: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/189.jpg)
5.4 Discussion and summary
In contrast to the main theme of representation with a distributed representation
of spatiotemporal orientation and subsequent segregation of spacetime informa-
tion, the specific details of the implementations are less important. In terms
of grouping, mean-shift can be replaced by one of the alternatives cited in Sec-
tion 5.1.2.1; whereas, for boundary analysis, a local patch-based formulation, cf.
(Spoerri & Ullman, 1987; Stein & Hebert, 2009a), or a global method, cf. (Rous-
son et al., 2003), operating over the proposed representation, may prove useful.
Nevertheless, it is promising that the straightforward use of a popular grouping
mechanism and the standard generalized gradient operating over the proposed
representation have worked on a variety of natural image sequences.
Most previous methods for spatiotemporal grouping and boundary detection
are concerned with regions of contrasting optical flow. Others are focused on
juxtaposed dynamic textures. Improvements to these various methods might be
realized via introduction of thresholds (e.g., confidence measures), multi-scale
analyses (e.g., pyramid schemes for accommodating rapid motion), contour com-
pletion, cf. (Feldman & Weinshall, 2008), a more sophisticated flow estimator
than considered here, etc. These approaches, however, fundamentally are lim-
ited by their underlying assumptions regarding the classes of visual phenomena
175
![Page 190: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/190.jpg)
that are to be encountered, which in turn limit their applicability to detecting
a very circumscribed class of boundaries (e.g., motion). In comparison, it has
been demonstrated that the proposed approach can naturally deal with the wide
variety of natural world scenarios presented.
A major weakness of the proposed segregation approaches are their behaviour
along segment boundaries. Specifically, since the initial filter measurements are
taken over a region, the filter outputs over the neighbourhood of segment bound-
aries can significantly differ from the outputs of the adjoining regions due to the
mixing of structures; this is an issue shared by all segregation approaches rely-
ing on filter outputs, be it spatial or spatiotemporal. Similarly, pure intensity
boundaries may lead to significant filter responses along the boundary (Malik
et al., 2001). In both cases, the result may be inaccuracies in localizing bound-
aries and at worst hallucinating segments altogether. This issue is ameliorated
to some degree through the application of mean-shift smoothing to the initial
filter outputs, a form of adaptive scale selection; see (Malik et al., 2001; Konishi
et al., 2003; Martin et al., 2004; Zalesny et al., 2005; Sharon et al., 2006; Stein &
Hebert, 2007) for other approaches that rely on cue integration (e.g., colour) in
the segregation process.
In summary, this chapter has presented the first unified approach to repre-
176
![Page 191: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/191.jpg)
senting and segregating a wide range of juxtaposed spacetime patterns (motion,
static, flicker, (pseudo-) transparency, translucency, scintillation/temporal tex-
ture, unstructured). The approach is founded on the distributed characterization
of visual spacetime in terms of 3D, (x, y, t), spatiotemporal orientation, discussed
in Chapter 3. This representation converts many salient spacetime structure
differences into explicit intensity differences within components of the derived
oriented energy volumes. Thus, the grouping and local boundary detection anal-
yses simply reduce to determining significant intensity differences in spacetime.
This approach avoids the necessity of introducing sophisticated specialized pro-
cesses for the analysis of spacetime structure and unifies the operations dedicated
to early visual segregation processing (e.g., intensity, colour, texture, motion and
spatiotemporal structure). Empirical evaluation on a wide variety of imagery
demonstrates the approaches’ ability to parse spacetime imagery into coherently
structured regions. The delineated coherent regions, in conjunction with the un-
derlying representation of spatiotemporal orientation, can serve as building blocks
for further organization and interpretation of visual spatiotemporal information.
177
![Page 192: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/192.jpg)
Chapter 6
Action Spotting
I have always thought the actions of men the best inter-preters of their thoughts.
JOHN LOCKE
6.1 Introduction
6.1.1 Motivation
Over the past decade, a rapid proliferation of digital video content has occurred
and continues unabated. This is due to the availability of low cost capture de-
vices (e.g., camcorders and camera-equipped cell phones), increases in processing
power and large amounts of local and practically infinite network-based storage.
A key bottleneck in the use of video is that current retrieval systems organize
178
![Page 193: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/193.jpg)
data largely around text descriptions provided by a human annotator. Beyond
the labor-intensive, time-consuming and error-prone nature of this process when
applied to ever growing amounts of video data, another major issue is that the
annotator cannot anticipate all queries users will pose. These observations moti-
vate the need for automated organization tools that use the information inherent
in the visual data stream, rather than rely solely on manual annotations. This
chapter addresses the challenge of detecting and localizing spacetime patterns,
as represented by a single query video, in a reference video database. Specif-
ically, patterns of current concern are those induced by human actions. This
problem is referred to as “action spotting”(cf. keyword spotting in speech recog-
nition). The term “action” refers to a simple dynamic pattern executed by an
actor over a short duration of time (e.g., walking and hand waving). Action
spotting attempts to detect and spatiotemporally localize a given set of actions
within a video that may contain a large corpus of unknown actions. Most current
action-related approaches have targeted the problem of classification, where each
video segment is assigned to a small fixed set of predefined actions (typically
less than 10). Localization in these approaches, if addressed at all, has been
generally limited to fairly contrived scenarios. In addition to video indexing and
browsing, potential applications of the presented approach include surveillance,
179
![Page 194: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/194.jpg)
visually-guided interfaces and tracking initialization.
A key challenge in action spotting arises from the fact that the same under-
lying pattern dynamics can yield very different image intensities due to spatial
appearance differences, as with changes in clothing and live action versus ani-
mated cartoon content. Another challenge arises in natural imaging conditions
where scene clutter requires the ability to distinguish relevant pattern informa-
tion from distractions. Clutter can be of two types: (i) background clutter arises
when actions are depicted in front of complicated, possibly dynamic, backdrops
and (ii) foreground clutter arises when actions are depicted with distractions su-
perimposed, as with dynamic lighting, pseudo-transparency (e.g., walking behind
a chain-link fence), temporal aliasing and weather effects (e.g., rain and snow). It
is proposed that the choice of representation is key to meeting these challenges:
A representation that is invariant to purely spatial pattern allows actions to be
recognized independent of actor appearance; a representation that supports fine
delineations of spacetime structure makes it possible to tease action information
from clutter.
For present purposes, local spatiotemporal orientation is of fundamental de-
scriptive power, as it captures the first-order correlation structure of the data irre-
spective of its origin (i.e., irrespective of the underlying visual phenomena), even
180
![Page 195: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/195.jpg)
while distinguishing a wide range of image dynamics (e.g., single motion, multiple
superimposed motions and temporal flicker). Correspondingly, visual spacetime
will be represented according to its local 3D, (x, y, t), orientation structure: Each
point of spacetime will be associated with a distribution of measurements indi-
cating the relative presence of a particular set of spatiotemporal orientations.
Comparisons in searching are made between these distributions. The spacetime
orientation measurements are provided by the approach to representation devel-
oped earlier in Chapter 3. Figure 6.1 provides an overview of the approach. The
work described in this chapter appears in (Derpanis et al., 2010).
181
![Page 196: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/196.jpg)
(a)Input Videos
……
template (query)
oriented energy volumes
…
(b)Distributed
Spacetime OrientationDecomposition
(c) Sliding Window
Matching: Pointwise Distribution Matching
(d) Similarity Volume
Output
search (database)x
t
y
Figure 6.1: Overview of approach to action spotting. (a) A template (query)and search (database) video serve as input; both the template and search videosdepict the action of “jumping jacks” taken from the Weizmann action data set(Gorelick et al., 2007). (b) Application of spacetime oriented energy filters de-composes the input videos into a distributed representation according to 3D,(x, y, t), spatiotemporal orientation. (c) In a sliding window manner, the distri-bution of oriented energies of the template is compared to the search distributionat corresponding positions to yield the similarity volume given in (d). Finally,significant local maxima in the similarity volume are identified.
182
![Page 197: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/197.jpg)
6.1.2 Related work
A wealth of work has considered the analysis of human actions from visual data
(Gavrila, 1999; Aggarwal & Cai, 1999; Moeslund & Granum, 2001; Wang & Zhu,
2003; Moeslund et al., 2006; Turaga et al., 2008; Poppe, 2010). A brief survey of
representative approaches follows.
Tracking-based methods begin by tracking body parts or joints or both and
classify actions based on features extracted from the motion trajectories, e.g.,
(Yacoob & Black, 1999; Ramanan, 2003; Fanti et al., 2005; Ali et al., 2007).
General impediments to fully automated operation include tracker initialization
and robustness. Consequently, much of this work has been realized with some
degree of human intervention.
Other methods have classified actions based on features extracted from 3D
spacetime body shapes as represented by contours or silhouettes, with the moti-
vation that such representations are robust to spatial appearance details (Bobick
& Davis, 2001; Weinland et al., 2006; Gorelick et al., 2007; Ke et al., 2007; Yil-
maz & Shah, 2008). This class of approach relies on figure-ground segmentation
across spacetime, with the drawback that robust segmentation remains elusive
in uncontrolled settings. Further, silhouettes do not provide information on the
human body limbs when they are in front of the body (i.e., inside silhouette) and
183
![Page 198: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/198.jpg)
thus yield ambiguous information.
Recently, spacetime interest points (Laptev, 2005; Dollar et al., 2005;
Oikonomopoulos et al., 2006; Wong et al., 2007; Willems et al., 2008) have
emerged as a popular means for action classification (Schuldt et al., 2004; Niebles
et al., 2008; Laptev et al., 2008; Liu et al., 2009). Interest points typically are
taken as spacetime loci that exhibit variation along all spatiotemporal dimen-
sions and provide locations where a sparse set of descriptors are extracted to
guide action recognition. Sparsity is appealing as it yields significant reduction
in computational effort; however, interest point detectors often fire erratically
on shadows and highlights (Ke et al., 2005; Apostoloff & Fitzgibbon, 2005), and
along object occluding boundaries, which casts doubt on their applicability to
cluttered natural imagery. Additionally, for actions substantially comprised of
smooth motion, important information is ignored in favour of a small number of
possibly insufficient interest points.
Most closely related to the approach proposed in the present chapter are others
that have considered dense templates of image-based measurements to represent
actions (e.g., optical flow, spatiotemporal gradients and other filter responses se-
lective to both spatial and temporal orientation), typically matched to a video
of interest in a sliding window formulation. Chief advantages of this framework
184
![Page 199: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/199.jpg)
include avoidance of problematic localization, tracking and segmentation prepro-
cessing of the input video; however, such approaches can be computationally
intensive. Further limitations are tied to the particulars of the image measure-
ment used to define the template.
In general, optical flow-based methods, e.g., (Efros et al., 2003; Ke et al.,
2005; Rodriguez et al., 2008), suffer as dense flow estimates are unreliable where
their local single flow assumption does not hold (e.g., along occluding boundaries
and in the presence of foreground clutter). Work using spatiotemporal gradients
has encapsulated the measurements in the gradient structure tensor (Shechtman
& Irani, 2007; Ke et al., 2007; Seo & Milanfar, 2009a). This tack yields a com-
pact way to characterize visual spacetime locally, with template video matches
computed via dimensionality comparisons. However, the compactness also lim-
its its descriptive power: Areas containing two or more orientations in a region
are not readily discriminated, as their dimensionality will be the same; further,
the presence of foreground clutter in a video of interest will contaminate dimen-
sionality measurements to yield match failures. Finally, methods based on filter
responses selective for both spatial and temporal orientation, e.g., (Chomat &
Crowley, 1999; Jhuang et al., 2007; Ning et al., 2009), suffer from their inability
to generalize across differences in spatial appearance (e.g., different clothing) of
185
![Page 200: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/200.jpg)
the same action.
6.1.3 Contributions
In the light of previous work, the major contributions of the present chapter are as
follows. (i) The representation developed in Chapter 3 serves as a novel compact
feature set for action spotting. The representation supports fine delineations of
visual spacetime structure to capture the rich underlying dynamics of an action
from a single query video. (ii) An associated computationally efficient similarity
measure and search method are proposed that leverage the structure of the rep-
resentation. The approach does not require preprocessing in the form of actor
localization, tracking, motion estimation, figure-ground segmentation or learn-
ing. (iii) The approach can accommodate variable appearance of the same action,
rapid dynamics, multiple actions in the field-of-view, cluttered backgrounds and
is resilient to the addition of distracting foreground clutter. While others have
dealt with background clutter, it appears that the present work is the first to
address directly the foreground clutter challenge. (iv) The proposed approach
is demonstrated on a set of challenging natural videos with quantitative results
showing superior performance as compared to two recent approaches.
186
![Page 201: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/201.jpg)
6.2 Technical approach
As discussed in Chapter 2, in visual spacetime the local 3D, (x, y, t), orientation
structure of a pattern captures significant, meaningful aspects of its dynamics.
For action spotting, single motion at a point, e.g., motion of an isolated body
part, is captured as orientation along a particular spacetime direction. Signif-
icantly, more complicated scenarios still give rise to well defined spacetime ori-
entation distributions: Occlusions and multiple motions (e.g., as limbs cross or
foreground clutter intrudes) correspond to multiple orientations; high velocity
and temporal flicker (e.g., as encountered during rapid action executions) cor-
respond to orientations that become orthogonal to the temporal axis. Further,
appropriate definition of local spacetime oriented energy measurements can yield
invariance to purely spatial pattern characteristics and support action spotting
as an actor changes spatial appearance. Based on these observations, the devel-
oped action spotting approach makes use of such measurements as local features
that are combined into spacetime templates that maintain their relative geomet-
ric spacetime positions. In this work, it is assumed that the camera is stationary
in order for the proposed spacetime measurements to capture the dynamics of
the action rather than the camera movement. Empirically the proposed features
have been found to be robust to small amounts of camera movement, such as
187
![Page 202: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/202.jpg)
camera jitter from a handheld video camcorder; however, they are not invariant
to large camera movements. In order to accommodate large camera movements,
a camera stabilization procedure may be introduced as a preprocessing step, e.g.,
(Hansen et al., 1994).
6.2.1 Features: Spacetime orientation
This section presents details of the particular instantiation of the representation
used for action spotting. For the motivations of the various steps, the reader is
referred to the discussion of the general framework in Section 3.2.1 of Chapter 3.
The desired spacetime orientation decomposition is realized using a set of
broadly tuned 3D Gaussian third derivative filters (i.e., N = 3), G3θ(x), with
the unit vector θ capturing the 3D direction of the filter symmetry axis and
x = (x, y, t) spacetime position. The local oriented energy measurements are
given as follows (cf. Eq. (3.1))
Eθ(x) =∑x∈Ω
[G3θ(x) ∗ I(x)]2, (6.1)
where I(x) denotes the input video and ∗ convolution. The space-
time aggregation within the spacetime region, Ω, is implemented
by lowpass filtering the data using the thirteen-tap binomial filter,
188
![Page 203: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/203.jpg)
[1 12 66 220 495 792 924 792 · · · 1
]/4096, in spacetime. With
an eye towards near-to-real-time implementation, here only the G3 filter is used
in order to avoid redundant filtering computations.
In tracking applications it is vital to preserve both the spatial appearance
and dynamic properties of a region of interest, cf. (Cannons et al., 2010). In
contrast, in action spotting one wants to be invariant to appearance, while being
sensitive to dynamic properties. This is necessary in order to detect different
people wearing a variety of clothing as they perform the same action. This
motivates the next step of discounting the spatial orientation component (i.e.,
spatial appearance) through “marginalization”.
Now, energy along a frequency domain plane with normal n and spatial ori-
entation discounted through marginalization, is given by summing across the set
of measurements, Eθi , as (cf. Eq. (3.5))
En(x) =N∑i=0
Eθi(x), (6.2)
with θi one of N + 1 = 4 specified directions, (3.4), and each Eθi calculated
via the oriented energy filtering, (6.1). In the present implementation, six dif-
ferent spacetime orientations are made explicit, corresponding to static (no mo-
tion/orientation orthogonal to the image plane), leftward, rightward, upward,
189
![Page 204: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/204.jpg)
downward motion (one pixel/frame movement), and flicker/infinite motion (ori-
entation orthogonal to the temporal axis); although, due to the broad tuning of
the filters employed, responses arise to a range of orientations about the peak
tunings. Here, only six spacetime orientations were considered to reduce com-
putational overhead. Informally, empirical evaluation was also conducted with
the same 27 spacetime orientations used for spacetime texture recognition (see
Section 4.2.2 in Chapter 4) and found to yield negligible action spotting improve-
ment.
Finally, the energy measures are normalized by the sum of consort planar
energy responses at each point (cf. Eq. (3.6)),
Eni(x) =Eni(x)
M∑j=1
Enj(x) + ε
, (6.3)
where M = 6 denotes the number of spacetime orientations considered and ε is
the scalar constant noise floor. Empirically, ε has been set to ∼1% of the maxi-
mum expected response. As applied to the six oriented, appearance marginalized
energy measurements, (6.2), Eq. (6.3) produces a corresponding set of six normal-
ized, marginalized oriented energy measurements. To this set is added a seventh
measurement that explicitly captures lack of structure via normalized ε (cf. Eq.
190
![Page 205: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/205.jpg)
(3.7)),
Eε(x) =ε
M∑j=1
Enj(x) + ε
, (6.4)
to yield a seven dimensional feature vector at each point in the image data.
In addition, to the general representation attributes discussed in Section 3.2.2
of Chapter 3, the representation enjoys an additional two attributes in its current
usage worth emphasizing:
1. Owing to the oriented energies being defined over a spatiotemporal support
region, (6.1), the representation can deal with input data that are not
exactly spatiotemporally aligned.
2. Owing to the distributed nature of the representation, foreground clutter
can be accommodated: Both the desirable action pattern structure and the
undesirable clutter structure can be captured jointly so that the desirable
components remain available for matching even in the presence of clutter.
6.2.2 Spacetime template matching
To detect actions (as defined by a small template video) in a larger search video,
the search video is scanned over all spacetime positions by sliding a 3D tem-
plate over every spacetime position. At each position, the similarity between the
191
![Page 206: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/206.jpg)
oriented energy distributions (histograms) at the corresponding positions of the
template and search volumes are computed.
To obtain a global match measure, Γ(x), between the template and search
videos at each image position, x, of the search volume, the individual histogram
similarity measurements are summed across the template:
Γ(x) =∑u
γ[S(u),T(u− x)], (6.5)
where u = (u, v, w) ranges over the spacetime support of the template volume
and γ[S(u),T(u−x)] is the similarity between local distributions of the template,
T, and the search, S, volumes. The peaks of the global similarity measure across
the search volume represent potential match locations.
There are several histogram similarity measures that could be used (Rub-
ner et al., 2001). Here, the Bhattacharyya coefficient (Bhattacharyya, 1943) is
used, as it takes into account the summed unity structure of distributions (unlike
Lp-based match measures) and yields to efficient implementation (see Section
6.2.2.2). The Bhattacharyya coefficient for two histograms P and Q, each with
B bins, is defined as
γ(P,Q) =B∑b=1
√PbQb, (6.6)
with b the bin index. This measure is bounded below by zero and above by one
(Comaniciu et al., 2003), with zero indicating a complete mismatch, intermediate
192
![Page 207: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/207.jpg)
values indicating greater similarity and one complete agreement. Significantly,
the bounded nature of the Bhattacharyya coefficient makes it robust to small
outliers (e.g., as might arise during occlusion in the present application).
The final step consists of identifying peaks in the similarity volume, Γ, where
peaks correspond to volumetric regions in the search volume that match closely
with the template dynamics. The local maxima in the volume are identified in
an iterative manner to avoid multiple detections around peaks: In the spacetime
volumetric region about each peak the match score is suppressed (set to zero);
this non-maxima suppression process repeats until all remaining match scores are
below a threshold, τ . In the experiments, the volumetric region of the template
centered at the peak is used for suppression.
6.2.2.1 Weighting template contributions
Depending on the task, it may be desirable to weight the contribution of various
regions in the template differently. For example, one may want to emphasize
certain spatial regions and/or frames in the template. This can be accommodated
with the following modification to the global match measure:
Γ(x) =∑u
w(u)γ[S(u),T(u− x)], (6.7)
193
![Page 208: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/208.jpg)
where w denotes the weighting function. In some scenarios, it may also be desired
to emphasize the contribution of certain dynamics in the template over others.
For example, one may want to emphasize the dynamic over the unstructured and
static information. This can be done by setting the weight in the match mea-
sure, (6.7), to w = 1 − (Eε + Estatic), with Estatic the oriented energy measure,
(6.3), corresponding to static, i.e., non-moving/zero-velocity, structure and Eε
capturing local lack of structure, (6.4). An advantage of the developed represen-
tation is that it makes these types of semantically meaningful dynamics directly
accessible.
6.2.2.2 Efficient matching
For efficient search, one could resort to: (i) spatiotemporal coarse-to-fine search
using spacetime pyramids (Shechtman & Irani, 2007), (ii) evaluation of the tem-
plate on a coarser sampling of positions in the search volume, (iii) evaluation of a
subset of distributions in the template and (iv) early termination of match com-
putation (Barnea & Silverman, 1972). A drawback of these optimizations is that
the target may be missed entirely. In this section, it is shown that exhaustive
computation of the search measure, (6.5), can be realized in a computationally
efficient manner.
194
![Page 209: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/209.jpg)
Inserting the Bhattacharyya coefficient, (6.6), into the global match measure,
(6.5), and reorganizing by swapping the spacetime and bin summation orders
reveals that the expression is equivalent to the sum of cross-correlations between
the individual bin volumes:
Γ(x) =∑b
∑u
√Sb(u)
√Tb(u− x) =
∑b
√Sb ?
√Tb, (6.8)
with ? denoting cross-correlation, b indexing histogram bins and u = (u, v, w)
ranging over template support.
Consequently, the correlation surface can be computed efficiently in the
frequency domain using the convolution theorem of the Fourier transform
(Bracewell, 2000), where the expensive correlation operations in spacetime are
exchanged for relatively inexpensive pointwise multiplications in the frequency
domain:
Γ(x) = F−1
∑b
F√SbF
√T ′b
, (6.9)
with F· and F−1· denoting the Fourier transform and its inverse, respectively,
and T ′b the reflected template. In implementation, the Fourier transforms are
realized efficiently by the fast Fourier transform (FFT).
195
![Page 210: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/210.jpg)
6.2.3 Computational complexity analysis
Let WT,S, HT,S, DT,S be the width, height and temporal duration, respec-
tively, of the template, T, and the search video, S, and B denote the number of
spacetime orientation histogram bins. The complexity of the correlation-based
scheme in the spacetime domain, (6.8), is O(B∏
i∈T,SWiHiDi). In the case
of the frequency domain-based correlation, (6.9), the 3D FFT can be realized
efficiently by a set of 1D FFTs due to the separability of the kernel (Jahne,
2005). The computational complexity of the frequency domain-based correlation
is O[BWSHSDS(log2DS + log2WS + log2HS)].
In practice, the overall runtime to compute the entire match volume between
a 50 × 25 × 20 template and a 144 × 180 × 200 search video with six spacetime
orientations and ε is 20 seconds (i.e., 10 frames/sec) when computed using the
frequency-based scheme, (6.9), with the computation of the representation (Sec.
6.2.1) taking up 16 seconds of the total time. In contrast, the search in the space-
time domain, (6.8), takes 26 minutes. These timings are based on unoptimized
Matlab code executing on a 2.3 GHz processor. In comparison, using the same
sized input and a Pentium 3.0 GHz processor, (Shechtman & Irani, 2007) report
that their approach takes 30 minutes for exhaustive search.
Depending on the target application, additional savings of the proposed ap-
196
![Page 211: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/211.jpg)
Algorithm 5: Action spotting.Input: T : Query video, S: Search video, τ : Similarity thresholdOutput: Γ: Similarity volume, d: Set of bounding volumes of detected action
Step 1: Compute spacetime oriented energy representation (Sec. 6.2.1)
1. Initialize 3D G3 steerable basis.2. Compute normalized spacetime oriented energies for T and S, Eq. (6.1) - (6.4).
Step 2: Spacetime template matching (Sec. 6.2.2, 6.2.2.1 and 6.2.2.2)
3. (Optional) Weight template representation.4. Compute similarity volume, Γ, between T and S (Eq. 6.9).
Step 3: Find similarity peaks (Sec. 6.2.2)
5. Find global maximum in Γ.6. Suppress values around the maximum (set to zero within template volume).7. Repeat 5. and 6. until remaining match scores are below τ .8. Centre bounding boxes, d, with size of template at identified maxima.
proach can be achieved by precomputing the search target representation off-line.
Also, since the representation construction and matching are highly paralleliz-
able, real to near-real-time performance is anticipated through the use of widely
available hardware and instruction sets, e.g., multicore CPUs, GPUs and SIMD
instruction sets.
To recapitulate, the proposed approach to action spotting is given in algorith-
mic terms in Algorithm 5.
6.3 Empirical evaluation
The performance of the proposed action spotting algorithm has been evaluated
on an illustrative set of test sequences. Matches are represented as a series
of (spatial) bounding boxes that spatiotemporally outline the action. Unless
197
![Page 212: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/212.jpg)
otherwise stated, the templates are weighted by 1 − (Eε + Estatic), see (6.7),
to emphasize the dynamics of the action pattern over the background por-
tion of the template. Video results and additional examples are available at:
www.cse.yorku.ca/vision/action-spotting.
Figure 6.2 shows results of the proposed approach on an aerobics routine
consisting of 806 frames and a resolution of 361×241 (courtesy of www.fitmoves.
com). The instructor performs four cycles of the routine composed of directional
jumping, spinning and squatting. The first cycle consists of jumping left, spinning
right and squatting while the instructor is moving to her left, then a mirrored
version while moving to her right, then again moving to her left and concludes
moving to her right. Challenging aspects of this example include, fast (realistic)
human movements and the instructor introducing nuances into each performance
of the same action; thus, no two examples of the same action are exactly alike.
The templates are all based on the first cycle of the routine. The jumping right
and spinning left templates are defined by mirrored versions of the jumping left
and spinning right, respectively. Each template is matched with the search target,
yielding five similarity volumes in total. Matches in each volume are combined
to yield the final detections. All actions are correctly spotted, with no false
positives.
198
![Page 213: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/213.jpg)
Figure 6.3 shows results of the proposed approach on an outdoor scene con-
sisting of three actions, namely, walking left, walking right and two-handed wave.
In addition, there are several instances of distracting background clutter, most
notably the water flowing from the fountain. The video consists of 672 frames
with a resolution of 321× 185. The walk right template is a mirrored version of
the walk left template. Similar to the previous example, the matches for each
template are combined to yield the final detections. All actions are spotted cor-
rectly, except one walking left action that differs significantly in spatial scale by
a factor of 2.5 from the template; there are no false positives.
Figure 6.4 shows results of the proposed approach on two outdoor scenes
containing distinct forms of foreground clutter, which superimpose significant
unmodeled patterning over the depicted actions and thereby test robustness to
irrelevant structure in matching. The first example contains foreground clutter
in the form of dappled sunlight with the query actions of walking left and one-
handed wave. The second example contains foreground clutter in the form of
superimposed static structure (i.e., pseudo-transparency) caused by the chain-
linked fence and the query action of two-handed wave. The first and second
examples contain 365 and 699 frames, respectively, with the same spatial reso-
lution of 321 × 185 pixels. All actions are spotted correctly; there are no false
199
![Page 214: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/214.jpg)
positives.
For the purpose of quantitatively evaluating the proposed approach with di-
rect comparison to alternatives, action spotting performance was tested on the
publicly available CMU action data set (Ke et al., 2007). The data set is com-
prised of five action categories, namely “pick-up”, “jumping jacks”, “push ele-
vator button”, “one-handed wave” and “two-handed wave”. The total data set
consists of 20 minutes of video containing 109 actions of interest with 14 to 34
testing instances per category performed by three to six subjects. The videos
are 160× 120 pixels in resolution. In contrast to the widely used KTH (Schuldt
et al., 2004) and Weizmann (Gorelick et al., 2007) data sets which contain rel-
atively uniform intensity backgrounds, the CMU data set was captured with a
handheld video camcorder in crowded environments with moving people and cars
in the background. Also, there are large variations in the performance of the tar-
get actions, including their distance with respect to the camera.
Results are compared with ground truth labels included with the CMU data
set. The labels define the spacetime positions and extents of each action. For
each action, a Precision-Recall (P-R) curve is generated by varying the similarity
threshold between 0.6 to 0.97:
Precision =TP
TP + FP(6.10)
200
![Page 215: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/215.jpg)
and
Recall =TP
nP, (6.11)
where TP is the number of true positives, FP is the number of false positives
and nP is the total number of positives in the data set. In evaluation, the same
testing protocol as in (Ke et al., 2007) is used. A detected action is considered a
true positive if it has a (spacetime) volumetric overlap greater than 50% with a
ground truth label. The same action templates from (Ke et al., 2007) are used,
including the provided spacetime binary masks to emphasize the action over the
background (see Fig. 6.5 for example video frames).
To gauge performance, the results of the proposed approach were compared to
two recent action spotting approaches: (i) holistic flow (Shechtman & Irani, 2007)
and (ii) parts-based shape plus flow (Ke et al., 2007). Similar to the proposed
approach, the holistic flow approach only considers image dynamics; whereas, the
parts-based shape plus flow approach combines the same holistic flow approach
in (Shechtman & Irani, 2007) with additional shape information recovered from
a spatiotemporal over-segmentation procedure and embeds the matching process
in a parts-based framework (Felzenszwalb & Huttenlocher, 2005) to improve gen-
eralization. For further details of the holistic flow and parts-based shape plus
flow action spotting approaches, see Appendices D.4 and D.5, respectively.
201
![Page 216: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/216.jpg)
Figure 6.6 shows P-R curves for each action with representative action spot-
ting results; the spotting results for the baseline approaches are taken from (Ke
et al., 2007). In comparison to the holistic flow approach, the proposed approach
generally achieves superior spotting results, except in the case of two-handed
wave; nevertheless, the proposed approach still outperforms holistic flow over
most of the P-R plot in this case. Furthermore, as discussed in Section 6.2.3
the proposed approach is significantly superior from a computational perspec-
tive. In comparison to the parts-based shape plus flow approach, the spotting
performance of the proposed approach once again is generally superior, except
in the case of two-handed wave. In both cases, two-handed wave is primarily
confused with one-handed wave, resulting in a higher false positive rate and thus
lower precision. Such confusions are not unreasonable given that the one-handed
wave action is consistent with a significant portion of two-handed wave. In con-
trast, the parts-based decomposition allows for greater flexibility in distinguishing
one-handed vs. two-handed wave. It should also be noted that the parts-based
approach is computationally more expensive than the proposed approach, since
it combines the computationally expensive flow approach with a costly spacetime
over-segmentation approach.
202
![Page 217: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/217.jpg)
Jump Left/Right Spin Left/Right Squat
Figure 6.2: Aerobics routine example. The instructor performs four cycles of adance routine composed of directional jumping, spinning and squatting; each rowcorresponds to a cycle.
203
![Page 218: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/218.jpg)
Figure 6.3: Multiple actions in the outdoors. (top, middle) Sample frames fortwo query templates depicting walking left and two-handed wave actions. Athird template for a walking right action is derived from mirroring the walkingleft template. (bottom) Several example detection results from an outdoor scenecontaining the query actions.
204
![Page 219: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/219.jpg)
Figure 6.4: Foreground clutter example. (top) Sample frames for a one-handedwave query template. Templates for other actions in this figure (two-handedwave, walking left) are shown in Fig. 6.3. (middle) Sample walking left andone-handed wave detection results with foreground clutter in the form of locallighting variation caused by overhead dappled sunlight. (bottom) Sample two-handed wave detection results as action is performed beside and behind chain-linked fence. Foreground clutter takes the form of superimposed static structurewhen action is behind fence.
205
![Page 220: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/220.jpg)
(a) jumping jacks (b) pick-up
(c) push elevator button (d) two-handed wave
(e) one-handed wave
Figure 6.5: CMU action data set templates. (left) Sample frames from theintensity-based templates for each action. (right) Corresponding binary masks.
206
![Page 221: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/221.jpg)
0 0.2 0.4 0.6 0.8 10
0.5
1
Recall
Pre
cisi
on
ST−Structure (proposed)Shape+Flow (baseline)Flow (baseline) 0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Recall
Pre
cisi
on
(a) legend (b) jumping jacks
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Recall
Pre
cisi
on
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Recall
Pre
cisi
on
(c) pick-up (d) push elevator button
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Recall
Pre
cisi
on
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Recall
Pre
cisi
on
(e) one-handed wave (f) two-handed wave
Figure 6.6: Precision-Recall curves for CMU action data set. (left) Precision-Recall plots; blue curves correspond to the proposed approach, and red andgreen to the baselines proposed in (Ke et al., 2007) (shape+flow parts-based) and(Shechtman & Irani, 2007) (holistic flow), respectively. The spotting results forthe baseline approaches are taken from (Ke et al., 2007). (right) Correspondingexample action spotting results recovered by the proposed approach.
207
![Page 222: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/222.jpg)
6.4 Discussion and summary
The main contribution of this chapter is the representation of visual spacetime
via spatiotemporal orientation distributions for the purpose of action spotting.
It has been shown that this tack can accommodate variable appearance of the
same action, rapid dynamics, multiple actions in the field-of-view and is robust to
scene clutter (both foreground and background), while being amenable to efficient
computation.
A current limitation of the proposed approach is related to the use of a sin-
gle monolithic template for any given action. This choice limits the ability to
generalize across the range of observed variability as an action is depicted in nat-
ural conditions. For instance, large spatial and temporal deformations, such as
anthropomorphic attributes (e.g., height and shape) and action execution speed,
respectively, are not currently handled but may be addressed partly through the
use of a multi-scale (spacetime) framework. Response to these action variations
motivates future investigation of deformable action templates (e.g., parts-based),
which allows for flexibility in matching template (sub)components, cf. (Ke et al.,
2007). A related limitation arises when the motion of a body part is not specified
during an action (i.e., performance nuances). For example, consider the action
of throwing a ball. The salient part of this action is related to the motion of the
208
![Page 223: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/223.jpg)
upper body, rather than the legs which may be exhibiting a wide range of mo-
tions. Possible ways for addressing this limitation include, the use of a deformable
model, as suggested above, to capture the wide variability and automatic means
for weighting the relative importance of the spatiotemporal parts of the action
template and thus masking away nonessential regions. Another possible man-
ner to reduce the number of false detections is by intergrating complementary
cues into the action template description, e.g., shape. Along these lines, it is
interesting to note that the presented empirical results attest that the proposed
approach already is robust to a range of deformations between the template and
search target (e.g., modest changes in spatial scale, rotation and execution speed
as well as individual performance nuances). Such robustness owes to the rela-
tively broad tuning of the oriented energy filters, which discount minor differences
in spacetime orientations between template and target.
In summary, this chapter has presented an efficient approach to action spot-
ting using a single query template; there is no need for extensive training. The
approach is founded on a distributed characterization of visual spacetime in terms
of 3D, (x, y, t), spatiotemporal orientation that captures underlying pattern dy-
namics. Empirical evaluation on a broad set of image sequences, including a
quantitative comparison with two baseline approaches on a challenging public
209
![Page 224: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/224.jpg)
data set, demonstrates the potential of the proposed approach.
210
![Page 225: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/225.jpg)
Chapter 7
Summary and Conclusions
7.1 Summary
This dissertation has developed a unified understanding of visual spacetime pat-
terns. More specifically, a representation of dynamics suitable for a wide range of
visual tasks, both low- and high-level, has been developed. Key to this disserta-
tion has been the analysis and representation of spacetime patterns (dynamics)
in terms of their first-order structure.
Chapter 2 began the developments by introducing a generative model for
the purpose of describing spatiotemporal signals (e.g., video) in terms of their
constituent spacetime orientation structure. A novel aspect of this model is that
211
![Page 226: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/226.jpg)
it embodies the physical constraint that natural motion signals cannot take on
negative values. In the context of multiplicative motion patterns, it was shown
that this physical constraint yields spacetime oriented structure that previously
was conjectured to be annihilated in the image formation process. More generally,
it was shown that a wide range of important dynamic patterns, both motion and
non-motion, can be modeled in a unified manner simply through consideration
of their constituent spacetime oriented structure in the spatiotemporal domain,
(x, y, t), or equivalently, in terms of the set of constituent planes through the
origin in the frequency domain, (ωx, ωy, ωt).
Chapter 3 developed a representation of visual spacetime to be applied at
the earliest stages of processing that exploits the findings of the developed gen-
erative model and is broadly applicable to the diverse phenomena that may be
encountered in the world. The chapter began by briefly reviewing existing repre-
sentations of image dynamics and noting their various shortcomings. Motivated
by the findings in Chapter 2, the second part of the chapter presented a novel rep-
resentation concerned with exposing constituent spacetime orientation structure
present in video data. The approach is based on a distributed characterization
of visual spacetime in terms of 3D, (x, y, t), spatiotemporal orientation. This
representation satisfies all the criteria of a “good” representation established in
212
![Page 227: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/227.jpg)
Section 1.2. Specifically, the significant structure (i.e., spacetime orientation as
established in Chapter 2) is made explicit while suppressing unnecessary detail
(e.g., spatial appearance), the information is represented concisely as a distribu-
tion/histogram, it facilitates further computation, a procedure for realizing the
representation has been established and its applicability to useful (visual) tasks
has been established.
Chapter 4 presented an approach to spacetime texture classification. The
chapter began by introducing the concept spacetime texture. These dynamic pat-
terns represent a broader set than previously considered in the dynamic texture
literature, subsuming both motion and non-motion-related phenomena. Analo-
gous to previous spatial texture discrimination work, it was proposed that the key
discriminating element between spacetime textures is their underlying spacetime
oriented structure; two spacetime textures that contain similar spacetime oriented
structure are assumed to be in the same category of visual dynamics. Next, a
unified approach to representing and classifying spacetime textures was proposed.
The proposed representation developed in Chapter 3 provided a unified basis for
representing and recognizing spacetime textures based solely on their underlying
dynamics. The chapter concluded by providing several empirical evaluations of
classification performance of the proposed approach and showed its superiority
213
![Page 228: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/228.jpg)
over the state-of-the-art.
Chapter 5 presented an approach to spacetime grouping and local boundary
detection. Motivated by the analysis in Chapter 2, both approaches are founded
on the characterization of visual spacetime in terms of 3D, (x, y, t), spatiotem-
poral orientation, as captured by the proposed representation in Chapter 3. The
representation converts many salient spacetime structure differences into explicit
intensity differences. This conversion reduces the grouping and local boundary
detection approaches to simply determining significant intensity differences in the
representation. The chapter concluded by providing empirical evaluations of the
proposed segregation approaches on a wide variety of synthetic and challenging
natural image sequences.
Chapter 6 presented an approach that detects and spatiotemporally localizes
an action defined by a single video template in a larger video stream (i.e., action
spotting). The proposed approach is founded on the representation proposed in
Chapter 3. In this context, the representation supports fine delineations of visual
spacetime structure to capture the rich underlying dynamics of an action. In
addition, an associated computationally efficient similarity measure and search
method were proposed that leverage the structure of the representation. The
chapter concluded by evaluating the approach on a set of challenging natural
214
![Page 229: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/229.jpg)
videos.
At this point it is instructive to emphasize several points regarding the sig-
nificance of the presented work:
• The theoretical work in Chapter 2 has presented an in-depth analysis of
the structure of a wide range of spacetime patterns. The relations between
image dynamics and spacetime oriented structure have been precisely de-
fined.
• The theoretical work led to a method for representing and analyzing video
in a uniform manner rather than on a case-by-case basis which is currently
the norm. The power of the proposed representation was demonstrated in
several visual tasks described in Chapters 4 to 6.
• The approach for recognizing spacetime textures uniformly considers the
broadest range of dynamic patterns to date. This can serve as a basis for
dynamic scene understanding.
• The first unified approach to representing and segregating (i.e., grouping
and detecting local boundaries) a wide range of juxtaposed spacetime pat-
terns (motion, static, flicker, (pseudo-)transparency, translucency, space-
time texture, unstructured) was proposed. Previous analyses of spacetime
215
![Page 230: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/230.jpg)
patterns operate on a case-by-case basis, with image motion predominant.
The approach for the early recovery of groupings and local boundaries be-
tween regions differing in spacetime structure can serve as building blocks
for further organization and interpretation of visual spacetime information.
• The representation supports fine delineations of visual spacetime structure
to capture the rich underlying dynamics of an action and is resilient to the
addition of distracting foreground clutter. While others have dealt with
background clutter, it appears that the proposed approach is the first to
directly address the issue of foreground clutter.
7.2 Suggestions for further research
To conclude, several directions for further research are discussed below.
From a representational point of view, attention in this dissertation has been
limited to the analysis of first-order spacetime pattern structure. Having under-
stood the tangent space, one possible extension is to consider the representation of
higher-order contact structure (e.g., curvature). This can follow on a firm founda-
tion from the first-order spacetime orientation measurements (Parent & Zucker,
1989; Koenderink & Richards, 1988; Koenderink, 1991). Making higher-order
structure explicit would increase the discrimination power of the representation
216
![Page 231: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/231.jpg)
in visual tasks where deemed necessary.
The role of scale in the proposed representation warrants further investigation.
In this dissertation, the the representation has been limited to a single spatiotem-
poral scale. Future work might consider multi-scale extensions that may support
finer distinctions between spacetime structures, as characteristic signatures could
be manifest across scale. The proposed representation consists of two distinct
scale parameters. The first is related to the scale of the spatiotemporal region of
analysis that is used in integrating the initial energy measurements. The second
scale parameter is related to the frequency tuning (i.e., resolution of detail) of the
underlying oriented filters. In both cases the characterization of the underlying
structure may vary with scale. As a simple example of the scaling behaviour
related to the region of analysis, consider a region containing the aperture prob-
lem. Over larger spatiotemporal extents of analysis, the interpretation of the
structure may shift to one of a coherent motion structure. Now consider a cloud
forming a moving shadow over (stationary) grass. The shadows spatiotemporal
energy is concentrated in the low frequencies, while the grass contains low and
high frequencies. At high frequencies of the underlying filters, the interpretation
will be of a static region due to the grass. At low frequencies that encompass
the shadow itself, the interpretation will be of transparency due to the coherent
217
![Page 232: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/232.jpg)
motion of the shadow over the static grass. Taken together these two types of
orthogonal scaling behaviours enrich the description of spacetime structure.
There are several potential research directions related to spacetime textures.
First, one could consider an even broader set of spacetime structures than those
considered here. For instance, similar to translational motion, parametric mo-
tion (e.g., affine motion) can potentially be reconsidered as another instance of
a deterministic texture, where the parametric model defines the deterministic
placement rule. Second, the use of combinations of second- and higher-order
statistics of spacetime oriented structure with more sophisticated distance mea-
sures and classifiers for recognition could be considered. The consideration of
higher-order statistics would allow for discriminating spacetime textures where
the representation differs spatially (e.g., rotational motion), temporally (e.g., ac-
celeration) and spatiotemporally (e.g., general affine motion) across the region of
analysis. Third, complementary spatial appearance attributes may be augmented
with dynamics to enhance the description of the spacetime texture. Fourth, an
additional possibility not considered in this dissertation is the recovery of shape
information. The use of spatial textures as a cue for shape recovery has a long
history in the vision sciences with its origins in (Gibson, 1950). In contrast,
apparently only a single paper (Sheikh et al., 2006) has addressed the topic of
218
![Page 233: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/233.jpg)
shape from “dynamic textures”, with their starting point being recovered flow.
It would be interesting to study the shape information available in the proposed
representation.
In terms of grouping and local boundary detection, there are several directions
to pursue. First, both proposed segregation approaches were strictly focused on
dynamics. One direction forward is the integration of a variety of complementary
cues, including those related to intensity, colour and spatial texture, cf. (Malik
et al., 2001; Konishi et al., 2003; Martin et al., 2004; Zalesny et al., 2005; Sharon
et al., 2006; Stein & Hebert, 2007). Joint consideration of these cues would
address the general segregation limitations related to filter-based segmentation
schemes (see Chapter 5 - Section 5.4 for a discussion). Second, consideration
should be given to establishing new benchmarks. This dissertation takes a first
step in this direction by providing a publicly available data set; however, a much
larger set of diverse videos is needed. Previous benchmarks in some sense can be
considered unrepresentative of natural scenes in that they have been primarily fo-
cused on scenes that can be described solely by juxtaposed motions, e.g., (Stein &
Hebert, 2009b), or alternatively, solely by juxtaposed dynamic textures (Chan &
Vasconcelos, 2008). One possible source is the human annotated video data from
the recently introduced “LabelMe video” project (Yuen et al., 2009). The project
219
![Page 234: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/234.jpg)
is an online and openly accessible video annotation system for real-world videos.
Third, the proposed boundary saliency measure may be useful in the context of
patch-based dynamic texture synthesis applications. In short, patch-based tex-
ture synthesis approaches create a new texture by copying and stitching together
a given set of input textures at various offsets (Efros & Freeman, 2001; Kwatra
et al., 2003). A key part in the stitching process is in identifying compatible
merging interfaces between textures such that boundary inconsistencies are not
salient. The proposed boundary saliency measure may be useful in identifying
compatible seam candidates.
A current limitation of the proposed action spotting approach is its modest
ability to generalize across the range of observed variability as an action is de-
picted in natural conditions. Large spatial and temporal deformations, such as
anthropomorphic attributes (e.g., height and shape) and action execution speed,
respectively, and performance nuances are not currently handled but may be ad-
dressed partly through the use of a multi-scale (spacetime) framework, deformable
action templates and an automatic procedure to weight the relative spatiotem-
poral importance of parts of the action template. Further, false detections may
be reduced through the integration of complementary cues, such as shape cues.
A possible source for shape information is the proposed grouping method.
220
![Page 235: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/235.jpg)
A more general direction for action spotting concerns encapsulating a large set
of actions. By limiting focus to the challenge of detecting a single action, the need
for segmentation has been avoided altogether by simply running the detector at
all spatiotemporal locations and potentially at multiple spatiotemporal scales. It
is clear that such an approach will not scale with numerous actions of interest. A
direction forward could be the use of grouping mechanisms, such as the spacetime
grouping mechanism proposed in this dissertation, to group together causally
related spatiotemporal dynamics into parts as a first pass of the data, followed
by computing the proposed spatiotemporal descriptor over the various parts. A
similar approach has long been advocated in the context of object recognition;
for a historical perspective, see (Dickinson, 2009). Notice that once the actions
are detected, they themselves could be considered as parts that can be combined
together to form activities. In addition, it would be interesting to build a set of
dynamic pattern primitives that can be shared among the actions of interest.
Finally, additional visual tasks may benefit from this representation and be
brought within its unifying framework. Examples include, spacetime saliency, cf.
(Bruce & Tsotsos, 2008; Gao et al., 2008; Seo & Milanfar, 2009b) and coupling
dynamic attributes captured by the proposed representation with static texture
cues, e.g., (Renninger & Malik, 2004), for the purpose of scene classification, cf.
221
![Page 236: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/236.jpg)
(Shroff et al., 2010).
222
![Page 237: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/237.jpg)
Appendix A
Frequency Domain Analysis of
Spacetime Structure
A.1 Image translation
Consider an image signal, I(x), with x = (x, y, t)> denoting the spatial, (x, y),
and temporal, t, coordinates. Let the image signal be defined by a 2D spatial
intensity profile, I(x, y), moving with velocity, v = (u, v)>. In the spatiotemporal
domain, the model for a translating image signal is given by,
I(x) = I(x− ut, y − vt). (A.1)
Using the Fourier shift property (Bracewell, 2000), the corresponding spec-
223
![Page 238: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/238.jpg)
trum is given by (Fahle & Poggio, 1981; Watson & Ahumada, 1985; Adelson &
Bergen, 1985; Heeger, 1988)
I(k) =
∫ ∫ ∫I(x− ut, y − vt)e−i(x>k)dx (A.2)
=
∫I(ωx, ωy)e
−it(uωx+vωy)e−itωtdt (A.3)
= I(ωx, ωy)
∫e−it(uωx+vωy)e−itωtdt (A.4)
= I(ωx, ωy)δ(ωxu+ ωyv + ωt). (A.5)
A.2 Aperture problem
Consider an image signal, I(x), defined by a 1D spatial intensity profile I(x)
within the region of analysis (i.e., the aperture) and let n be a unit vector normal
to the spatial oriented structure and translating with a velocity vn in the direction
of n (i.e., the normal velocity). In the spatiotemporal domain, the aperture
problem can be written as
I(x) = I[(n>,−‖vn‖)x]. (A.6)
Similar to the derivation of the Fourier transform of a translating image sig-
nal (Appendix A.1), the Fourier transform for the aperture problem can be de-
rived using the Fourier shift property (Bracewell, 2000). In addition, a vari-
able substitution of the spatial variables, (x, y)>, is introduced. Specifically,
224
![Page 239: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/239.jpg)
let (y1, y2)> = U(x, y)>, where U is a rotation matrix (i.e., U−1 = U> and
det U = 1) defined as
U =
n>
u>
, (A.7)
with u orthogonal to n.
The following derivation of the Fourier transform of the aperture problem
is adapted from (Fleet, 1992; Bigun, 2006). The Fourier transform of (A.6) is
written as,
I(k) =
∫ ∫ ∫I[(n>,−‖vn‖)x]e−ix
>kdxdydt. (A.8)
Substituting the change of spatial variables into (A.8), yields
I(k) =
∫ ∫ ∫I(y1 − ‖vn‖t)e−i[(y1,y2)U(ωx,ωy)>+tωt]dy1dy2dt. (A.9)
Finally, the Fourier transform of the aperture problem is given by
I(k) =
∫ ∫ ∫I(y1 − ‖vn‖t)e−i[(ωx,ωy)ny1+(ωx,ωy)uy2+ωtt]dy1dy2dt (A.10)
=
∫ ∫ ∫I(y1 − ‖vn‖t)e−i(ωx,ωy)ny1e−i(ωx,ωy)uy2e−iωttdy1dy2dt (A.11)
=
∫ ∫I[(ωx, ωy)n]e−i(ωx,ωy)n‖vn‖te−i(ωx,ωy)uy2e−iωttdy2dt (A.12)
= I[(ωx, ωy)n]
∫e−i(ωx,ωy)uy2dy2
∫e−i(ωx,ωy)n‖vn‖te−iωttdt (A.13)
= I[(ωx, ωy)n]δ[(ωx, ωy)u]δ[(v>n , 1)k]. (A.14)
225
![Page 240: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/240.jpg)
Appendix B
3D Separable Steerable Filters
B.1 Introduction
In many vision and image processing tasks, the output of filters at arbitrary ori-
entations is necessary. A prime example is the construction of the representation
described in Chapter 3. A computationally expensive approach is the applica-
tion of many rotated versions of a filter. This appendix describes a far more
efficient approach to synthesizing filter outputs at arbitrary orientations via the
interpolation of the output of a small number of filters.
In (Freeman & Adelson, 1991), the authors demonstrate that for a certain
class of functions (i.e., filters), rotated copies can be synthesized by taking linear
226
![Page 241: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/241.jpg)
combinations of a small set of basis functions that span the space of all rotations;
a filter amenable to this synthesis process is termed a “steerable filter”. Their
steerability condition is based on the Fourier angular decomposition of the filter
(expressed in polar coordinates) consisting of the sum of a finite number of fre-
quencies. Additionally, the authors present the construction of two-dimensional
separable filters (for polynomial functions) that provide a significant gain in com-
putational efficiency over their non-separable equivalents. The computational
efficiency gain is achieved by simplifying the complexity of the two-dimensional
convolution from O(k2n2) to O(kn2), where k is the size of the filter and n is the
size of the image (n-by-n).
Figure B.1 illustrates a general architecture for computing steerable filter
outputs described in (Freeman & Adelson, 1991). A bank of fixed basis filters
first process the input video. This step need only be applied once per video.
The outputs are multiplied by a set of gain maps that represent the orientation
dependent coefficients. Finally, the outputs are summed to yield the adaptively
filtered video. The key to realizing this architecture is the distributive property
of linear filters.
Further related work to steerable filters include, early work in image restora-
tion, e.g., (Knutsson et al., 1983), analyzing oriented patterns, e.g., (Kass &
227
![Page 242: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/242.jpg)
x
x
x
+
k ( )θi
InputImageSequence
GainMaps
OutputImageSequence
BasisFilterBank
Summation
Figure B.1: Three-dimensional steerable filter architecture. A bank of fixed basisfilters first process the input video. The outputs are multiplied by a set of gainmaps that represent the orientation dependent coefficients. Finally, the outputsare summed to yield the adaptively filtered video. Adapted from (Freeman &Adelson, 1991).
Witkin, 1987), alternative analytic descriptions of steerable filters, e.g., (Huang
& Chen, 1995; Lenz, 1990; Beil, 1994; Simoncelli & Freeman, 1995; Michaelis &
Sommer, 1995; Teo & Hel Or, 1998; Yu et al., 2001), generating steerable filters
through SVD approximations, e.g., (Greenspan et al., 1994; Perona, 1995; Shy
& Perona, 1994) and work considering various orders of Gaussian derivatives,
e.g., (Danielsson & Seger, 1990; Haralick, 1984; Koenderink & van Doorn, 1987).
Portions of the work described in this appendix appear in (Derpanis & Gryn,
2005).
228
![Page 243: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/243.jpg)
The current appendix provides explicit forms for three dimensional (e.g., x-
y-z) Nth degree polynomial separable steerable filters based on the minimum
spanning set. The approach presented is an extension of the construction of two-
dimensional (x-y) separable steerable filters for polynomial functions outlined in
(Freeman & Adelson, 1991). As examples, both analytic and discrete forms for
the second derivative of a Gaussian (G2) and its Hilbert transform (H2). Addi-
tionally, numerical evaluations of the discrete separable versions of the G2 and H2
filters are provided. It appears that explicit analytic and numerical formulations
for these filters have not previously appeared. In Section B.5, the steering formu-
las for several higher-order derivatives of Gaussians and corresponding Hilbert
transforms are given.
B.2 Technical approach
B.2.1 Preliminaries
In this appendix it is assumed that the functions to be steered are of the form of
a polynomial times a separable windowing function. Additionally, the functions
are assumed to have an axis of rotational symmetry. These functions, rotated by
a transformation R such that their axis of symmetry points along the direction
229
![Page 244: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/244.jpg)
cosines α, β and γ, can be written as,
fR(x, y, z) = W (r)PN(x′), (B.1)
where W (r) is any spherically symmetric function (e.g., a three-dimensional
Gaussian-like function: e−r2, r =
√x2 + y2 + z2) and PN(x′) is an Nth order
polynomial in
x′ = αx+ βy + γz. (B.2)
The following theorem, Theorem 4 in (Freeman & Adelson, 1991), provides
the means for constructing steerable filters of axially symmetric three-dimensional
functions.
Theorem 1 Given a three-dimensional axially symmetric function f(x, y, z) =
W (r)PN(x), where PN(x) is an even or odd symmetry N th order polynomial in
x. Let α, β and γ be the direction cosines of the axis of symmetry of fR(x, y, z)
and αj, βj and γj be the direction cosines of the axis of symmetry of fRj(x, y, z).
Then the steering equation,
fR(x, y, z) =M∑j=1
kj(α, β, γ)fRj(x, y, z), (B.3)
holds if and only if
1. M ≥ (N + 1)(N + 2)/2 and
230
![Page 245: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/245.jpg)
2. the kj(α, β, γ) satisfy
αN
αN−1β
αN−1γ
αN−2β2
...
γN
=
αN1 αN2 . . . αNM
αN−11 β1 αN−1
2 β2 . . . αN−1M βM
αN−11 γ1 αN−1
2 γ2 . . . αN−1M γM
αN−21 β2
1 αN−22 β2
2 . . . αN−2M β2
M
......
......
γN1 γN2 . . . γNM
k1(α, β, γ)
k2(α, β, γ)
k3(α, β, γ)
. . .
kM(α, β, γ)
. (B.4)
Note that all polynomial functions, f , can be written as a separable steerable
basis in x-y-z; however, it may have many more basis functions than the mini-
mum spanning set. A simple way of realizing such an overcomplete basis is by
applying the rotation to the x, y and z terms of f . The result is a linear combina-
tion of products of the powers of x, y and z (the separable basis functions) with
coefficients that are functions of the direction cosine (the steering/interpolation
functions). This approach was used in (Huang & Netravali, 1994). In this ap-
pendix, focus is placed on steerable functions that can be written in terms of
the minimum number of basis filters that span the rotation space. Reducing the
231
![Page 246: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/246.jpg)
number of filters yields gains in computational efficiency.
B.2.2 Basis functions separable in x-y-z
The cases of even and odd parity filters are considered, written as
fΩ(x, y, z) = G(r)QN(x′), (B.5)
where G(r) is a separable windowing function (e.g., Gaussian-like function e−r2
=
e−(x2+y2+z2) = e−x2e−y
2e−z
2) and QN(x′) is an Nth order polynomial in,
x′ = αx+ βy + γz, (B.6)
(cf., (Freeman & Adelson, 1991) that does not deal explicitly with 3D separability,
only 2D).
By Theorem 1, (N+1)(N+2)/2 functions can form a basis set for fΩ(x, y, z).
Assuming that a basis of M = (N + 1)(N + 2)/2 x-y-z separable functions exists,
then there will be some set of separable basis functions Qi,j(x)Ri,j(y)Si,j(z) such
that
fΩ(x, y, z) = G(r)∑
ki,j(Ω)Qi,j(x)Ri,j(y)Si,j(z). (B.7)
The interpolation functions, ki,j(Ω) are found by equating the highest order
products of x, y and z in Eq. (B.5) with those in Eq. (B.7), i.e., equating the
coefficients of xN−i−jyizj for i, j|i, j ∈ N and i + j ≤ N with N denoting the
232
![Page 247: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/247.jpg)
set of natural numbers 0, 1, . . . . Substituting Eq. (B.6) into Eq. (B.5), yields
M different products of x, y and z of order N , since by the specialization of the
multinomial theorem (Rosen, 1998),
(x′)N =∑ N !
(N − i− j)!i!j!αN−i−jβiγjxN−i−jyizj, (B.8)
where the summation is taken over terms with all possible integer values of i, j
between 0 and N subject to the constraint that i+ j ≤ N .
Each basis function Qi,j(x)Ri,j(y)Si,j(z) can contribute only one product of
powers of x, y and z of order N ; otherwise, Qi,j(x)Ri,j(y)Si,j(z) would be a
polynomial of x, y and z of order higher than N . Thus,
Qi,j(x)Ri,j(y)Si,j(z) = c(xN−i−j + · · · )(yi + · · · )(zj + · · · ), (B.9)
where c is a constant. Therefore, Eq. (B.5) shows that the coefficients of the
highest order terms xN−i−jyizj, in fΩ(x, y, z) are
ki,j(α, β, γ) =N !
(N − i− j)!i!j!αN−i−jβiγj. (B.10)
Note that the lower order terms can appear in more than one separable basis
function, so their coefficients will be the result of a sum of different ki,j(α, β, γ).
To find the separable basis functions Qi,j(x)Ri,j(y)Si,j(z) from the original
function f(x, y, z), note that the steering equation for the separable basis func-
233
![Page 248: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/248.jpg)
tions, Eq. (B.7), yields the following system of equations,
fΩ1(x, y, z)
fΩ2(x, y, z)
...
fΩM (x, y, z)
= G(r)
· · · ki,j(Ω1) · · ·
· · · ki,j(Ω2) · · ·
......
...
· · · ki,j(ΩM) · · ·
...
Qi,j(x)Ri,j(y)Si,j(z)
...
.
(B.11)
Finally, the Qi,j(x)Ri,j(y)Si,j(z) can be written as a linear combination of fΩj by
inverting the matrix of ki,j’s of Eq. (B.11).
B.2.3 Example: G2/H2 steerable quadrature filter pairs
A three-dimensional Gaussian-like function can be written as,
G(x, y, z) = e−(x2+y2+z2). (B.12)
The second derivative with respect to x of the three-dimensional Gaussian-like
function is written as,
G2 ≡∂2G
∂x2= (4x2 − 2)e−(x2+y2+z2). (B.13)
Using Theorem 1 with N = 2, six basis functions are needed to span the rotation
space of the G2 filter. The corresponding steering formulas given in Tables B.2,
B.3 and B.4 of Section B.5 were arrived at by using the construction outlined in
234
![Page 249: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/249.jpg)
Section B.2.2. The direction cosines α, β and γ used were a distinct subset of the
unit normals to the faces of the dodecahedron (or the vertices of the icosahedron),
which can be generated according to a cyclic permutation of (±1, 0,±r), where r
is the golden mean, (√
5 + 1)/2 (Weisstein, 2003). The dodecahedron was chosen
because it provides a uniform 12-face tessellation of a sphere.
An approximation to the Hilbert transform10 of the second derivative of the
three-dimensional Gaussian function is written as,
H2(x, y, z) = (−2.254x+ x3)e−(x2+y2+z2). (B.14)
Using Theorem 1 with N = 3, ten basis functions are needed to span the rotation
space of the H2 filter. The corresponding steering formulas given in Tables B.5,
B.6 and B.7 of Section B.5 were arrived at by using the construction outlined in
Section B.2.2. The direction cosines α, β and γ used were a distinct subset of the
unit normals to the faces of the icosahedron (or the vertices of the dodecahedron),
which can be generated according to a cyclic permutation of (0,±r,±1/r) and
(±1,±1,±1) where r is the golden mean (Weisstein, 2003).
235
![Page 250: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/250.jpg)
B.3 Empirical evaluation
For each of the filters evaluated, the convolution output of a test image with ro-
tated copies realized with the non-separable and separable filters were compared.
The test image used was a three-dimensional zone plate, specifically a 3D analog
of the 1D linear chirp (i.e., f(x) = cos[(ωx)x] = cos(ωx2), where ωx denotes the
instantaneous frequency), defined as,
f(x, y, z) = cos[(ω√x2 + y2 + z2)
√x2 + y2 + z2]
= cos[ω(x2 + y2 + z2)]. (B.15)
Given the orientation and frequency selectivity of the filters, the zone plate was
selected because it captures a continuum in both orientations and frequencies
above and beyond those that the individual filters are selective. In Fig. B.2
a two-dimensional slice of the three-dimensional zone plate is presented. The
rotated versions of the non-separable filters were arrived at by conducting the
rotations in the continuous domain followed by discretization.
The sampling of the orientation of the filters were arrived at by representing
the direction cosines (α, β, γ) of the axis of symmetry parametrically by spherical
10The Hilbert transform of the second derivative of the three-dimensional Gaussian functionwas approximated by finding the least squares fit to a third-order polynomial times a Gaussian(Freeman & Adelson, 1991).
236
![Page 251: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/251.jpg)
coordinates (θ, φ) as follows,
α = cos(θ) sin(φ)
β = sin(θ) sin(φ) (B.16)
γ = cos(φ)
and sampling θ and φ at 5° intervals with the following bounds, 0 ≤ θ ≤ 2π and
−π/2 ≤ φ ≤ π/2.
The maximum root mean square (RMS) errors for the G2 and H2 were found
to be 6.63 × 10−13 and 3.55 × 10−12 across all orientations, respectively. In Fig.
B.3(a) and Fig. B.3(b) plots of the RMS errors for the G2 and H2 filter are pre-
sented. The symmetrical nature of the RMS error plots is due to the symmetrical
nature of the filters in the frequency domain about the origin. The prominent
minima of the H2 RMS error correspond to the filters with a single non-zero
coefficient ki(·). The increased errors in the remaining filter outputs may be due
to accumulation errors in the construction of the basis filters (i.e., more than
one non-zero coefficient ki(·)); nonetheless, the maximum error encountered is
negligible.
237
![Page 252: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/252.jpg)
(a) zone plate
(b) G2: (1, 0, 0) (c) H2: (1, 0, 0)
(d) G2: (0, 1, 0) (e) H2: (0, 1, 0)
(f) G2: (1, 1, 0) (g) H2: (1, 1, 0)
Figure B.2: Zone plate slices. (a) x-y slice from a three-dimensional zone plate.(b)-(g) various corresponding G2/H2 filtered x-y slices. The triples in the captionsof the filtered outputs denote the orientation of the filter’s axis of symmetry.
238
![Page 253: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/253.jpg)
3
3.5
4
4.5
5
5.5
6
6.5
7
x 10−13
G2 rms error
2π
−π/2 0
π/2
0φ
θ
(a) G2 RMS error
0
0.5
1
1.5
2
2.5
3
3.5
4
x 10−12
H2 rms error
2π
−π/2 0
π/2
0φ
θ
(b) H2 RMS error
Figure B.3: G2/H2 RMS error plots. RMS error plots are shown between thecorresponding non-separable and separable oriented filter outputs on the three-dimensional zone plate.
239
![Page 254: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/254.jpg)
B.4 Summary
In this appendix, the construction of three-dimensional separable steerable filtersof Nth degree polynomials based on the minimum spanning set was presented.As a proof of concept, the second derivative of the Gaussian and its Hilbert trans-form, both in the continuous and discrete domains, were presented; additionalsteering formulas for higher-order derivatives of Gaussians and correspondingHilbert transforms can be found in Section B.5. Experimental evaluation showsthat the error in the construction of the separable steerable second derivative ofthe Gaussian and Hilbert transform is negligible. In the context of the presentthesis, the filters developed allow for the computationally efficient realization ofthe representation of spacetime oriented structure described in Chapter 3.
B.5 x-y-z Separable, Steerable Quadrature Pair Basis Fil-ters
G2 = 0.41146(−2 + 4x2) ∗ e−(x2+y2+z2)
H2 = (−1.9355x+ 0.85846x3)e−(x2+y2+z2)
G3 = 0.1840(12x− 8x3)e−(x2+y2+z2)
H3 = (−0.84564 + 2.6468x2 − 0.58875x4)e−(x2+y2+z2)
G4 = 0.0696(12− 48x2 + 16x4)e−(x2+y2+z2)
Table B.1: Several Gaussian derivatives and polynomial fits to their Hilbert trans-forms. The functions are normalized such that the integral over all space of thesquare of the function equals one.
240
![Page 255: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/255.jpg)
Basis Interpolation
G2a = C(2x2 − 1)e−(x2+y2+z2) k(α, β, γ) = α2
G2b = C(2xy)e−(x2+y2+z2) k(α, β, γ) = 2αβ
G2c = C(2y2 − 1)e−(x2+y2+z2) k(α, β, γ) = β2
G2d = C(2xz)e−(x2+y2+z2) k(α, β, γ) = 2αγ
G2e = C(2yz)e−(x2+y2+z2) k(α, β, γ) = 2βγ
G2f = C(2z2 − 1)e−(x2+y2+z2) k(α, β, γ) = γ2
Table B.2: x-y-z separable basis set and interpolation functions for the secondderivative of the Gaussian. C = 0.41146 is a normalization constant introducedso that the integral over all space of the square of the function equals one. To con-struct a second derivative of a Gaussian where the axis of symmetry (i.e., x-axis)is mapped to the direction cosine Ω = (α, β, γ), use GΩ
2 =∑
i∈a,...,f ki(Ω)G2i.
1D Function Tap #0 1 2 3 4
f1 C(2t2 − 1)e−t2
-0.4115 -0.0268 0.1770 0.0513 0.0042
f2 e−t2
1.0000 0.6383 0.1660 0.0176 0.0008
f3 2Cte−t2
0 0.3519 0.1831 0.0291 0.0017
f4 te−t2
0 0.4277 0.2225 0.0354 0.0020
Table B.3: 9-tap filters for x-y-z separable basis set for G2. Filters f1 and f2 haveeven symmetry; f3 and f4 have odd symmetry. These filters were taken fromTable B.2, with a sample spacing of 0.67.
G2 Basis Filter Filter in x Filter in y Filter in zG2a f1 f2 f2G2b f3 f4 f2G2c f2 f1 f2G2d f3 f2 f4G2e f2 f3 f4G2f f2 f2 f1
Table B.4: G2 basis filters. Summarized is the construction of the G2 basis filters(a-f) using the filters given in Table B.3.
241
![Page 256: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/256.jpg)
Basis Interpolation
H2a = C(x3 − 2.254x)e−(x2+y2+z2) k(α, β, γ) = α3
H2b = Cy(x2 − 0.751333)e−(x2+y2+z2) k(α, β, γ) = 3α2β
H2c = Cx(y2 − 0.751333)e−(x2+y2+z2) k(α, β, γ) = 3αβ2
H2d = C(y3 − 2.254y)e−(x2+y2+z2) k(α, β, γ) = β3
H2e = Cz(x2 − 0.751333)e−(x2+y2+z2) k(α, β, γ) = 3α2γ
H2f = Cxyze−(x2+y2+z2) k(α, β, γ) = 6αβγ
H2g = Cz(y2 − 0.751333)e−(x2+y2+z2) k(α, β, γ) = 3β2γ
H2h = Cx(z2 − 0.751333)e−(x2+y2+z2) k(α, β, γ) = 3αγ2
H2i = Cy(z2 − 0.751333)e−(x2+y2+z2) k(α, β, γ) = 3βγ2
H2j = C(z3 − 2.254z)e−(x2+y2+z2) k(α, β, γ) = γ3
Table B.5: x-y-z separable basis set and interpolation functions for fit to Hilberttransform of the second derivative of the Gaussian. C = 0.877776 is a normal-ization constant introduced so that the integral over all space of the square ofthe function equals one. To construct a second derivative of a Gaussian Hilberttransform where the axis of symmetry (i.e., x-axis) is mapped to the directioncosine Ω = (α, β, γ), use HΩ
2 =∑
i∈a,...,j ki(Ω)H2i. Due to space limitations thefilters have been truncated.
1D Function Tap #0 1 2 3 4
f1 C(t3 − 2.254t)e−t2
0 -0.6776 -0.0895 0.0554 0.0088
f2 C(t2 − 0.751333)e−t2
-0.6595 -0.1695 0.1522 0.0508 0.0043
f3 e−t2
1.0000 0.6383 0.1660 0.0176 0.0008
f4 Cte−t2
0 0.3754 0.1953 0.0310 0.0018
f5 te−t2
0 0.4277 0.2225 0.0354 0.0020
Table B.6: 9-tap filters for x-y-z separable basis set for H2. Filters f2 and f3 haveeven symmetry; f1, f4 and f5 have odd symmetry. These filters were taken fromTable B.5, with a sample spacing of 0.67.
242
![Page 257: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/257.jpg)
H2 Basis Filter Filter in x Filter in y Filter in zH2a f1 f3 f3H2b f2 f5 f3H2c f5 f2 f3H2d f3 f1 f3H2e f2 f3 f5H2f f4 f5 f5H2g f3 f2 f5H2h f5 f3 f2H2i f3 f5 f2H2j f3 f3 f1
Table B.7: H2 basis filters. Summarized is the construction of the H2 basis filters(a-j) using the filters given in Table B.6.
Basis Interpolation
G3a = −4C(2x3 − 3x)e−(x2+y2+z2) k(α, β, γ) = α3
G3b = −4Cy(2x2 − 1)e−(x2+y2+z2) k(α, β, γ) = 3α2β
G3c = −4Cx(2y2 − 1)e−(x2+y2+z2) k(α, β, γ) = 3αβ2
G3d = −4C(2y3 − 3y)e−(x2+y2+z2) k(α, β, γ) = β3
G3e = −4Cz(2x2 − 1)e−(x2+y2+z2) k(α, β, γ) = 3α2γ
G3f = −8Cxyze−(x2+y2+z2) k(α, β, γ) = 6αβγ
G3g = −4Cz(2y2 − 1)e−(x2+y2+z2) k(α, β, γ) = 3β2γ
G3h = −4Cx(2z2 − 1)e−(x2+y2+z2) k(α, β, γ) = 3αγ2
G3i = −4Cy(2z2 − 1)e−(x2+y2+z2) k(α, β, γ) = 3βγ2
G3j = −4C(2z3 − 3z)e−(x2+y2+z2) k(α, β, γ) = γ3
Table B.8: x-y-z separable basis set and interpolation functions for the thirdderivative of the Gaussian, G3. C = 0.184 is a normalization constant introducedso that the integral over all space of the square of the function equals one. Toconstruct a third derivative of a Gaussian where the axis of symmetry (i.e., x-axis)is mapped to the direction cosine Ω = (α, β, γ), use GΩ
3 =∑
i∈a,...,j ki(Ω)G3i.
243
![Page 258: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/258.jpg)
1DF
unct
ion
Tap
#0
12
34
56
f1−
4C(2t3−
3t)e−t2
00.
7165
0.27
08-0
.174
5-0
.134
8-0
.033
7-0
.004
1
f2te−t2
00.
3894
0.36
790.
1581
0.03
660.
0048
0.00
04
f3−
4C(2t2−
1)e−
t20.
7360
0.28
66-0
.270
8-0
.271
5-0
.094
4-0
.016
3-0
.001
5
f4e−
t21.
0000
0.77
880.
3679
0.10
540.
0183
0.00
190.
0001
f5−
8Cte−t2
0-0
.573
2-0
.541
5-0
.232
7-0
.053
9-0
.007
1-0
.000
5
Tab
leB
.9:
13-t
apfilt
ers
forx
-y-z
separ
able
bas
isse
tfo
rG
3.
Filte
rsf3
and
f4hav
eev
ensy
mm
etry
;f1
,f2
and
f5hav
eodd
sym
met
ry.
Thes
efilt
ers
wer
eta
ken
from
Tab
leB
.8,
wit
ha
sam
ple
spac
ing
of0.
5.
244
![Page 259: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/259.jpg)
G3 Basis Filter Filter in x Filter in y Filter in zG3a f1 f4 f4G3b f3 f2 f4G3c f2 f3 f4G3d f4 f1 f4G3e f3 f4 f2G3f f5 f2 f2G3g f4 f3 f2G3h f2 f4 f3G3i f4 f2 f3G3j f4 f4 f1
Table B.10: G3 basis filters. Summarized is the construction of the G3 basisfilters (a-j) using the filters given in Table B.9.
245
![Page 260: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/260.jpg)
Basi
sIn
terp
ola
tion
H3a
=C
(−0.9
454
+2.9
59x2−
0.6
582x4)e−(x
2+y2+z2)
k(α,β,γ
)=α4
H3b
=Cy(1.4
795x−
0.6
582x3)e−(x
2+y2+z2)
k(α,β,γ
)=
4α3β
H3c
=C
(−√
0.6
582x2
+0.7
585√
0.6
582)(√
0.6
582y2−
0.7
585√
0.6
582)e−(x
2+y2+z2)
k(α,β,γ
)=
6α2β2
H3d
=Cx
(1.4
795y−
0.6
582y3)e−(x
2+y2+z2)
k(α,β,γ
)=
4αβ3
H3e
=C
(−0.9
454
+2.9
59y2−
0.6
582y4)e−(x
2+y2+z2)
k(α,β,γ
)=β4
H3f
=Cz(1.4
795x−
0.6
582x3)e−(x
2+y2+z2)
k(α,β,γ
)=
4α3γ
H3g
=Cyz(0.4
93168−
0.6
582x2)e−(x
2+y2+z2)
k(α,β,γ
)=
12α2βγ
H3h
=Cxz(0.4
93168−
0.6
582y2)e−(x
2+y2+z2)
k(α,β,γ
)=
12αβ2γ
H3i
=Cz(1.4
795y−
0.6
582y3)e−(x
2+y2+z2)
k(α,β,γ
)=
4β3γ
H3j
=C
(−√
0.6
582x2
+0.7
585√
0.6
582)(√
0.6
582z2−
0.7
585√
0.6
582)e−(x
2+y2+z2)
k(α,β,γ
)=
6α2γ2
H3k
=Cxy(0.4
93168−
0.6
582z2)e−(x
2+y2+z2)
k(α,β,γ
)=
12αβγ2
H3l
=C
(−√
0.6
582y2
+0.7
585√
0.6
582)(√
0.6
582z2−
0.7
585√
0.6
582)e−(x
2+y2+z2)
k(α,β,γ
)=
6β2γ2
H3m
=Cx
(1.4
795z−
0.6
582z3)e−(x
2+y2+z2)
k(α,β,γ
)=
4αγ3
H3n
=Cy(1.4
795z−
0.6
582z3)e−(x
2+y2+z2)
k(α,β,γ
)=
4βγ3
H3o
=C
(−0.9
454
+2.9
59z2−
0.6
582z4)e−(x
2+y2+z2)
k(α,β,γ
)=γ4
Tab
leB
.11:
x-y
-zse
par
able
bas
isse
tan
din
terp
olat
ion
funct
ions
for
the
(appro
xim
ate)
Hilb
ert
tran
sfor
mof
the
thir
dder
ivat
ive
ofth
eG
auss
ian,H
3.C
=0.
8944
8is
anor
mal
izat
ion
const
ant
intr
oduce
dso
that
the
inte
gral
over
all
spac
eof
the
squar
eof
the
funct
ion
equal
son
e.T
oco
nst
ruct
the
Hilb
ert
tran
sfor
mof
ath
ird
der
ivat
ive
ofa
Gau
ssia
nw
her
eth
eax
isof
sym
met
ry(i
.e.,x
-axis
)is
map
ped
toth
edir
ecti
onco
sine
Ω=
(α,β,γ
),useH
Ω 3=∑ i∈
a,...,oki(
Ω)H
3i.
Not
eth
atth
eH
3is
not
exac
tlyx
-y-z
separ
able
,sp
ecifi
cally
H3c,j,l
repre
sent
leas
t-sq
uar
esfits
;how
ever
,th
ese
separ
able
funct
ions
clos
ely
appro
xim
ateH
3,
cf.
the
2DH
4
filt
erin
Tab
le9
of(F
reem
an&
Adel
son,
1991
).
246
![Page 261: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/261.jpg)
1D
Fu
nct
ion
Tap
#0
12
34
56
f1C
(−0.9
454
+2.9
59t2−
0.6
582t4
)e−t2
-0.8
456
-0.1
719
0.4
460
0.2
244
0.0
059
-0.0
141
-0.0
030
f2C
(1.4
795t−
0.6
582t3
)e−t2
00.4
580
0.2
703
-0.0
002
-0.0
378
-0.0
114
-0.0
015
f3√C
(−√
0.6
582t2
+0.7
585√
0.6
582)e−t2
0.5
820
0.3
039
-0.0
682
-0.1
206
-0.0
456
-0.0
081
-0.0
008
f4√C
(√0.6
582t2−
0.7
585√
0.6
582)e−t2
-0.5
820
-0.3
039
0.0
682
0.1
206
0.0
456
0.0
081
0.0
008
f5C
(0.4
93168−
0.6
582t2
)e−t2
0.4
411
0.2
289
-0.0
543
-0.0
931
-0.0
351
-0.0
063
-0.0
006
f6te−t2
00.3
894
0.3
679
0.1
581
0.0
366
0.0
048
0.0
004
f7e−t2
1.0
000
0.7
788
0.3
679
0.1
054
0.0
183
0.0
019
0.0
001
Tab
leB
.12:
13-t
apfilt
ers
forx
-y-z
separ
able
bas
isse
tfo
rH
3.
Filte
rsf1
,f3
,f4
,f5
and
f7hav
eev
ensy
mm
etry
;f2
and
f6hav
eodd
sym
met
ry.
Thes
efilt
ers
wer
eta
ken
from
Tab
leB
.11,
wit
ha
sam
ple
spac
ing
of0.
5.
247
![Page 262: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/262.jpg)
H3 Basis Filter Filter in x Filter in y Filter in zH3a f1 f7 f7H3b f2 f6 f7H3c f3 f4 f7H3d f6 f2 f7H3e f7 f1 f7H3f f2 f7 f6H3g f5 f6 f6H3h f6 f5 f6H3i f7 f2 f6H3j f3 f7 f4H3k f6 f6 f5H3l f7 f3 f4H3m f6 f7 f2H3n f7 f6 f2H3o f7 f7 f1
Table B.13: H3 basis filters. Summarized is the construction of the H3 basisfilters (a-o) using the filters given in Table B.12.
248
![Page 263: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/263.jpg)
Basis Interpolation
G4a = C(16x4 − 48x2 + 12)e−(x2+y2+z2) k(α, β, γ) = α4
G4b = Cy(16x3 − 24x)e−(x2+y2+z2) k(α, β, γ) = 4α3β
G4c = C(−4x2 + 2)(−4y2 + 2)e−(x2+y2+z2) k(α, β, γ) = 6α2β2
G4d = Cx(16y3 − 24y)e−(x2+y2+z2) k(α, β, γ) = 4αβ3
G4e = C(16y4 − 48y2 + 12)e−(x2+y2+z2) k(α, β, γ) = β4
G4f = Cz(16x3 − 24x)e−(x2+y2+z2) k(α, β, γ) = 4α3γ
G4g = Cyz(16x2 − 8)e−(x2+y2+z2) k(α, β, γ) = 12α2βγ
G4h = Cxz(16y2 − 8)e−(x2+y2+z2) k(α, β, γ) = 12αβ2γ
G4i = Cz(16y3 − 24y)e−(x2+y2+z2) k(α, β, γ) = 4β3γ
G4j = C(−4x2 + 2)(−4z2 + 2)e−(x2+y2+z2) k(α, β, γ) = 6α2γ2
G4k = Cxy(16z2 − 8)e−(x2+y2+z2) k(α, β, γ) = 12αβγ2
G4l = C(−4y2 + 2)(−4z2 + 2)e−(x2+y2+z2) k(α, β, γ) = 6β2γ2
G4m = Cx(16z3 − 24z)e−(x2+y2+z2) k(α, β, γ) = 4αγ3
G4n = Cy(16z3 − 24z)e−(x2+y2+z2) k(α, β, γ) = 4βγ3
G4o = C(16z4 − 48z2 + 12)e−(x2+y2+z2) k(α, β, γ) = γ4
Table B.14: x-y-z separable basis set and interpolation functions for the fourthderivative of the Gaussian, G4. C = 0.0696 is a normalization constant introducedso that the integral over all space of the square of the function equals one. To con-struct a fourth derivative of a Gaussian where the axis of symmetry (i.e., x-axis)is mapped to the direction cosine Ω = (α, β, γ), use GΩ
4 =∑
i∈a,...,o ki(Ω)G4i.
249
![Page 264: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/264.jpg)
1DF
unct
ion
Tap
#0
12
34
56
f1C
(16t
4−
48t2
+12
)e−t2
0.83
520.
0542
-0.5
121
-0.1
100
0.09
690.
0453
0.00
75
f2C
(16t
3−
24t)e−
t20
-0.5
420
-0.2
048
0.13
200.
1020
0.02
550.
0031
f32√
C(−
2t2
+1)e−
t20.
5276
0.20
55-0
.194
1-0
.194
6-0
.067
6-0
.011
7-0
.001
1
f4C
(16t
2−
8)e−
t.2
-0.5
568
-0.2
168
0.20
480.
2054
0.07
140.
0124
0.00
12
f5e−
t21.
0000
0.77
880.
3679
0.10
540.
0183
0.00
190.
0001
f6te−t2
00.
3894
0.36
790.
1581
0.03
660.
0048
0.00
04
Tab
leB
.15:
13-t
apfilt
ers
forx
-y-z
separ
able
bas
isse
tfo
rG
4.
Filte
rsf1
,f3
,f4
and
f5hav
eev
ensy
mm
etry
;f2
and
f6hav
eodd
sym
met
ry.
Thes
efilt
ers
wer
eta
ken
from
Tab
leB
.14,
wit
ha
sam
ple
spac
ing
of0.
5.
250
![Page 265: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/265.jpg)
G4 Basis Filter Filter in x Filter in y Filter in zG4a f1 f5 f5G4b f2 f6 f5G4c f3 f3 f5G4d f6 f2 f5G4e f5 f1 f5G4f f2 f5 f6G4g f4 f6 f6G4h f6 f4 f6G4i f5 f2 f6G4j f3 f5 f3G4k f6 f6 f4G4l f5 f3 f3G4m f6 f5 f2G4n f5 f6 f2G4o f5 f5 f1
Table B.16: G4 basis filters. Summarized is the construction of the G4 basisfilters (a-o) using the filters given in Table B.15.
251
![Page 266: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/266.jpg)
Appendix C
Proof of the Smoothness of the
Ring-Shaped Function
Without loss of generality, the two-dimensional image case will be considered.
Claim: The level curves of the sum of the power responses of N + 1 Nth deriva-
tive of Gaussians forms a smooth ring-shaped function in the power spectrum.
Equivalently, the resulting summed energy measure is dependent on radial fre-
quency and independent of angular frequency.
Formally, the summed energy response is given by,
η(ωx, ωy) =N∑n=0
∣∣∣∣[(ωx, ωy)dn]NG(ωx, ωy)
∣∣∣∣2, (C.1)
252
![Page 267: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/267.jpg)
where G(ωx, ωy) represents the spectrum of the Gaussian, N denotes the deriva-
tive order and dn represents the unit direction vector defined as,
dn = (cos θn, sin θn)>, where θn =2πn
N + 1, 0 ≤ n ≤ N. (C.2)
Proof:
η(ωx, ωy) =N∑n=0
∣∣∣∣[(ωx, ωy)dn]NG(ωx, ωy)
∣∣∣∣2 (C.3)
=N∑n=0
[(ωx, ωy)dn]2NG(ωx, ωy)2 (C.4)
=N∑n=0
[(‖ω‖ω>)dn]2NG(ωx, ωy)2 (C.5)
=N∑n=0
[ω>dn]2N [‖ω‖NG(ωx, ωy)]2, (C.6)
where ω denotes the unit vector in the direction of (ωx, ωy)>. Since the last
(square) bracketed term is only a function of the radial component, it can be
omitted from further consideration. Thus, discussion is now restricted to
η(ωx, ωy) =N∑n=0
[ω>dn]2N . (C.7)
Let φ represent the angular frequency component of ω. Then since, ω>dn corre-
sponds to the inner-product of unit vectors, it can be rewritten as the cosine of
the angular difference between the unit vectors ω and d, as follows
η(ωx, ωy) =N∑n=0
[cos
(φ− 2πn
N + 1
)]2N
. (C.8)
253
![Page 268: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/268.jpg)
Based on a standard trigonometric identity, the cosine term can be rewritten in
terms of the sum of complex exponentials
η(ωx, ωy) =1
22N
N∑n=0
[exp
(j(φ− 2πn
N + 1)
)+ exp
(−j(φ− 2πn
N + 1)
)]2N
. (C.9)
Finally, the exponentiated term can be rewritten as a summation by applying
the binomial theorem (Rosen, 1998)
η(ωx, ωy) =1
22N
N∑n=0
2N∑k=0
(2N
k
)exp
(jk(φ− 2πn
N + 1)
)exp
(−j(2N − k)(φ− 2πn
N + 1)
),
(C.10)
and rearranged to yield the following
η(ωx, ωy) =1
22N
2N∑k=0
(2N
k
) N∑n=0
exp
(2j(k −N)(φ− 2πn
N + 1)
)(C.11)
=1
22N
[(2N
N
) N∑n=0
exp(j0) +2N∑
k=0,k 6=N
(2N
k
) N∑n=0
exp
(2j(k −N)(φ− 2πn
N + 1)
)](C.12)
=1
22N
[(2N
N
)(N + 1) +
2N∑k=0,k 6=N
(2N
k
) N∑n=0
exp
(2j(k −N)(φ− 2πn
N + 1)
)].
(C.13)
Since (k−N) is an integer, the second term in (C.13) corresponds to the discrete
Fourier transform (DFT) of a constant. As such, the second term is equal to
zero. Thus, the remaining portion is dependent only on the radial frequency, as
sought.
254
![Page 269: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/269.jpg)
Appendix D
Summary of Alternative
Approaches Compared
D.1 The state-space model of dynamic texture
The standard dynamic texture state-space model (Doretto et al., 2003a) is a
stochastic generative model that treats video as a sample from a linear dynamical
system. This model forms the basis of the Kalman filter and can be considered as
a (continuous-state) extension of the hidden Markov model (HMM). The model
255
![Page 270: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/270.jpg)
is defined as xt+1 = Axt + vt, vt ∼ N (0,Σv)
yt = Cxt + wt, wt ∼ N (0,Σw)
, (D.1)
where xt ∈ Rn is a temporal sequence of hidden state random variables, yt ∈ Rm
a temporal sequence of generated video frames, A ∈ Rn×n the hidden state
transition matrix, C ∈ Rm×n the observation matrix and vt, wt independent and
identically normally distributed noise processes.
The standard approach to learning the model parameters (Doretto et al.,
2003a) considers the columns of C as the principal components of the video
frames, and the state vector as a set of principal component coefficients for the
video frame that evolve according to a Gauss-Markov process. A variant of the
model (Chan & Vasconcelos, 2007) replaces the observation matrix, C, with a
non-linear function C(xt).
For the task of recognition, further choices are made concerning the distance
metric between learned dynamic texture models and the classifier. In both cases,
a variety of alternatives have been proposed. For example, proposed distance met-
rics include, a measure of the principal angles between subspaces of the model
parameters (Saisan et al., 2001; Chan & Vasconcelos, 2007) and a measure of
the divergence between the distributions of the model realizations (Chan & Vas-
256
![Page 271: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/271.jpg)
concelos, 2005b; Woolfe & Fitzgibbon, 2006). As classifiers, the simple nearest
neighbour (Saisan et al., 2001; Woolfe & Fitzgibbon, 2006) and support vector
machines (SVM) (Chan & Vasconcelos, 2005b; Chan & Vasconcelos, 2007) have
been popular choices.
D.2 Motion boundary detection using optical flow
Several early efforts in the literature for identifying motion boundaries considered
the application of edge detectors to a given optical flow field. A major requirement
of these motion boundary detectors is that they are sensitive to rapid spatial (and
possibly temporal) changes in both the magnitude and direction of the flow. In
the following, details are provided regarding the recovery of motion boundaries
via the application of an edge detector to a given optic flow field, see (Thompson
et al., 1985) for additional discussion.
Conceptually, motion boundary detection using optical flow in conjunction
with an edge detector proceeds as follows. The optical flow field is first split
into two separate scalar image sequences corresponding to motion along the hor-
izontal and vertical dimensions. An edge detector is then applied to each image
separately. Finally, a pixel is marked as a boundary if an edge is found in either
flow component.
257
![Page 272: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/272.jpg)
Let u(x, y, t) = (u(x, y, t), v(x, y, t))> denote the optical flow at pixel (x, y) in
time t. A motion boundary point is formally defined as a point where the gradient
magnitude assumes a maximum in the gradient direction in either component of
the flow field, cf. (Canny, 1986; Lindeberg, 1998). In terms of directional deriva-
tives, a motion boundary is defined as a point where the second-order directional
derivative in the gradient direction is zero and the third-order directional deriva-
tive in the same direction is negative in either component of the flow:udd = 0
uddd < 0
or
vd′d′ = 0
vd′d′d′ < 0
, (D.2)
where d and d′ denote the gradient direction in the corresponding flow compo-
nent.
A major drawback of this scheme is that flow discontinuities reside at exactly
those image locations where the computation of optical flow is least reliable.
D.3 Motion boundary detection using the structure ten-
sor
The following provides details regarding the recovery of motion boundaries via
an eigenvalue analysis of the gradient structure tensor defined about a neighbour-
hood at each image point.
258
![Page 273: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/273.jpg)
Let I(x, y, t) denote the intensity at pixel (x, y) in frame t. The average of
the second moment matrix over a neighbourhood, Ω, around a pixel is given by,
G(x, y, t) =∑
Ω
∇I(∇I)> =∑
Ω
I2x IxIy IxIt
IxIy I2y IyIt
IxIt IyIt I2t
(D.3)
This matrix (or tensor) is commonly referred to as the gradient structure tensor.
In (Middendorf & Nagel, 2001), it was suggested that the eigenvalues of G can
be used as an indicator of motion discontinuities (i.e., neighbourhoods containing
multiple motions). To detect these pixels, the smallest eigenvalue was compared
with the sum of all eigenvalues:
S =λ3
λ1 + λ2 + λ3
> τ, (D.4)
where λ1 ≥ λ2 ≥ λ3 and τ is a user-defined threshold. High values of this
measure (i.e., values closer to one) are indicative of the presence of a motion
boundary. As noted by the authors, high responses are not exclusively due to
motion boundaries. For example, short-time specular reflections may also trig-
ger high responses in (D.4). More generally, (D.4) responds to deviations from
the brightness constancy constraint used to recover flow. In a different context,
(Laptev, 2005) used these “boundary points” as interest points to seed further
analysis.
259
![Page 274: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/274.jpg)
Given that the boundary saliency measure, S, can be modeled locally as a
ridge around motion discontinuities (Feldman & Weinshall, 2008), (potential)
boundary points can be defined as points where S assumes a maximum in the
main principal curvature direction, p (Lindeberg, 1998):Sp = 0
Spp < 0
. (D.5)
D.4 Flow-based action spotting
In (Shechtman & Irani, 2007), a scanning volume search approach was proposed
for action spotting based on the pointwise motion consistency between the tem-
plate and the search video under the template. In the following, details of this
approach (Shechtman & Irani, 2007) are provided.
Let I(x, y, t) denote the intensity at pixel (x, y) in frame t. The sum of the
second moment matrix over a spacetime neighbourhood, Ω, around a pixel is
given by,
G(x, y, t) =∑
Ω
∇I(∇I)> =∑
Ω
I2x IxIy IxIt
IxIy I2y IyIt
IxIt IyIt I2t
. (D.6)
This matrix (or tensor) is commonly referred to as the gradient structure ten-
sor. Further, let G♦(x, y, t) define the upper-left minor of the gradient structure
260
![Page 275: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/275.jpg)
tensor, (D.6):
G♦(x, y, t) =∑
Ω
I2x IxIy
IxIy I2y
. (D.7)
To determine whether the motions contained in two spacetime neighbour-
hoods (volumetric spacetime patches, e.g., 7× 7× 3), Gi and Gj, are consistent,
Shechtman and Irani considered whether the concatenation of the two patches
(i.e., Gi,j = Gi+Gj) contains more than one motion, or equivalently, determined
if there is a rank increase between G♦i,j and Gi,j. Ideally, the rank increase could
be measured as the difference of the individual ranks of G and G♦, where the
rank of a matrix is based on the number of non-zero eigenvalues present; how-
ever, due to noise and the fact that consistent motions are generally never exact,
zero-valued eigenvalues are not observed in practice. Instead, Shechtman and
Irani proposed the following continuous rank increase measure,
∆r =λ2λ3
λ♦1λ♦2
=det(G)
det(G♦)λ1
, (D.8)
where λ1 ≥ λ2 ≥ λ3 and λ♦1 ≥ λ♦2 denote the eigenvalues of G and G♦, respec-
tively. Notice that 0 ≤ ∆r ≤ 1, with ∆r = 0 corresponding to the ideal case
of no rank increase and ∆r = 1 a definite rank increase. (In practice, an ap-
proximation of Eq. (D.8) is used to avoid the relatively expensive computation
of the eigenvalue λ1.) The motion consistency between two patches was defined
261
![Page 276: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/276.jpg)
as 1/mi,j where
mi,j =∆ri,j
min(∆ri,∆rj) + ε, (D.9)
and ε is a small constant introduced to avoid division by zero.
Finally, to compute the global motion consistency for a template centered over
a given position within the search video, the pointwise motion consistencies be-
tween the template and the corresponding points in the video were summed over
the entire template volume. This is done at each spacetime position of the search
video in order to recover a volumetric similarity map. Locations of consistency
peaks (above a given threshold) were deemed as containing the action.
D.5 Parts-based shape plus flow action spotting
In (Ke et al., 2007), a scanning volume search approach was proposed for action
spotting based on the amalgamation of the following three ideas: (i) motion
consistency was computed between each point in the template with corresponding
points in the search video and aggregated over the template volume, (ii) shape-
based consistency was computed by matching the silhouette of the volumetric
action template with the over-segmented input video and (iii) the template was
represented as a set of volumetric parts. For motion consistency, the approach
of Shechtman and Irani (Shechtman & Irani, 2007) was used; see Appendix D.4
262
![Page 277: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/277.jpg)
for details. The remainder of this section contains a brief description of (ii) and
(iii).
The shape similarity measure was based on the region intersection between
the template volume centered about a point in the search volume and the set
of color-based over-segmented volumes in the search video under the template.
More specifically, given two binary shapes, T and S, representing the template
and search silhouettes (assuming for the moment that the search volume under
the template contains a single silhouette), respectively, their distance was defined
as the set difference between the union and the intersection of the regions: |T ∪
S \ T ∩ S| with | · | denoting set cardinality. To accommodate the fact that the
search video is over-segmented, an additional step is needed to determine the set
of K regions under the template in the search volume (i.e., S = ∪Kk=1Sk) that
minimize the set difference. Generally, brute-force enumeration of all possible
subsets would be computationally prohibitively expensive. Ke et al. proposed a
relatively efficient approach to determine the optimal set.
To compute the final similarity score at a particular position in the search
video, the motion and shape-based consistency measures were embedded in a
deformable parts-based framework (Felzenszwalb & Huttenlocher, 2005). This
allowed for a degree spatiotemporal variability of the query action.
263
![Page 278: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/278.jpg)
Finally, to detect and localize the action, the similarity score was computed
at each spacetime position of the search video. Locations with a similarity score
below a specified threshold were deemed as matches.
264
![Page 279: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/279.jpg)
Appendix E
Crop Windows for Shift-Invariant
Recognition Experiment
The “shift-invariant recognition” experiment reported in Chapter 4 - Sec. 4.3.2
closely follows previous shift-invariant experiments using the UCLA data set
(Saisan et al., 2001), as described in (Woolfe & Fitzgibbon, 2006). Each se-
quence was spatially partitioned into left and right halves (window pairs), with a
few exceptions. The exceptions arise as several of the imaged dynamic textures
are not spatially stationary; therefore, the cropping regimen described above
would result in left and right views of different dynamic textures for these cases.
For instance, in several of the fire and candle samples, one view would capture
265
![Page 280: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/280.jpg)
Filename prefix Frame index
Crop windowsView 1 View 2
Top left Bottom right Top left Bottom right(x, y) (x, y) (x, y) (x, y)
fountain-a-far-11 to 75 (1,20) (80,80) (81,30) (160,110)
76 to 150 (1,20) (80,80) (81,30) (160,110)
fountain-a-far-21 to 75 (1,20) (80,80) (81,30) (160,110)
76 to 150 (1,20) (80,80) (81,30) (160,110)
fountain-a-mid-11 to 75 (1,10) (60,110) (70,30) (140,110)
76 to 150 (1,10) (60,110) (70,30) (140,110)
fountain-a-mid-21 to 75 (1,10) (60,110) (70,30) (140,110)
76 to 150 (1,10) (60,110) (70,30) (140,110)
fire-31 to 75 (70,1) (105,110) (106,1) (160,110)
76 to 150 (70,1) (110,110) (111,1) (160,110)
fire-71 to 75 (1,1) (50,110) (51,1) (101,110)
76 to 150 (1,1) (50,110) (51,1) (101,110)
candle-41 to 75 (1,14) (160,48) (1,49) (160,83)
76 to 150 (1,14) (160,48) (1,49) (160,83)
candle-51 to 75 (1,14) (160,53) (1,54) (160,83)
76 to 150 (1,14) (160,53) (1,54) (160,83)
Table E.1: Special crop windows for non-stationary cases in UCLA data set.“Filename prefix” refers to the image sequence filename used in the original UCLAdata set. The origin is located at the top left corner of the image sequences withthe x and y axes increasing in the right and down directions, respectively.
a static background, while the other would capture the flame. In evaluation, all
cases in the data set were retained with special manual cropping introduced to the
non-stationary cases to include their key dynamic features; Table E.1 documents
the special crop windows used for these cases.
266
![Page 281: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/281.jpg)
Bibliography
Aach, T., Kaup, A. & Mester, R. (1995). On texture analysis: Local energytransforms versus quadrature filters. Signal Processing, 45(2), 173–181.
Adelson, E. & Anandan, P. (1990). Ordinal characteristics of transparency. Tech-nical Report TR-150, Media Lab, Massachusetts Institute of Technology.
Adelson, E. & Bergen, J. (1985). Spatiotemporal energy models for the perceptionof motion. Journal of the Optical Society of America - A, 2(2), 284–299.
Adelson, E. & Bergen, J. (1986). The extraction of spatio-temporal energy inhuman and machine vision. In Proceedings of IEEE Workshop on Motion(pp. 151–155).
Aggarwal, J. (1986). Motion and time-varying imagery: An overview. In Pro-ceedings of Workshop on Motion: Representation and Analysis (pp. 1–6).
Aggarwal, J. & Cai, Q. (1999). Human motion analysis: A review. ComputerVision and Image Understanding, 73(3), 428–440.
Aggarwal, J. & Nandhakumar, N. (1988). On the computation of motion fromsequences of images: A review. Proceedings of the IEEE, 76(8), 917–935.
Ali, S., Basharat, A. & Shah, M. (2007). Chaotic invariants for human actionrecognition. In Proceedings of IEEE International Conference on ComputerVision.
Allmen, M. & Dyer, C. (1991). Long-range spatiotemporal motion understandingusing spatiotemporal flow curves. In Proceedings of IEEE Conference onComputer Vision and Pattern Recognition (pp. 303–309).
267
![Page 282: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/282.jpg)
Anandan, P. (1984). Computing dense fields displacement with confidence mea-sures in scenes containing occlusion. In Proceedings of Image UnderstandingWorkshop (pp. 236–246).
Apostoloff, N. & Fitzgibbon, A. (2005). Learning spatiotemporal T-junctions forocclusion detection. In Proceedings of IEEE Conference on Computer Visionand Pattern Recognition (pp. II: 553–559).
Ayer, S. & Sawhney, H. (1995). Layered representation of motion video using ro-bust maximum-likelihood estimation of mixture models and MDL encoding.In Proceedings of IEEE International Conference on Computer Vision (pp.777–784).
Bainbridge-Smith, A. & Lane, R. (1997). Determining optical-flow using a dif-ferential method. Image and Vision Computing, 15(1), 11–22.
Baker, S., Roth, S., Scharstein, D., Black, M., Lewis, J. & Szeliski, R. (2007).A database and evaluation methodology for optical flow. In Proceedings ofIEEE International Conference on Computer Vision.
Bar-Joseph, Z., El-Yaniv, R., Lischinski, D. & Werman, M. (2001). Texture mix-ing and texture movie synthesis using statistical learning. IEEE Transactionson Visualization and Computer Graphics, 7(2), 120–135.
Barnea, D. & Silverman, H. (1972). A class of algorithms for fast digital imageregistration. IEEE Transactions on Computers, 21(2), 179–186.
Barnum, P., Narasimhan, S. & Kanade, T. (2010). Analysis of rain and snow infrequency space. International Journal of Computer Vision, 86(2-3), 256–274.
Barron, J., Beauchemin, S. & Fleet, D. (1994a). On optical flow. In Proceedings ofInternational Conference on Artificial Intelligence and Information-ControlSystems of Robots (pp. 3–14).
Barron, J., Fleet, D. & Beauchemin, S. (1994b). Performance of optical flowtechniques. International Journal of Computer Vision, 12(1), 43–77.
Beauchemin, S. & Barron, J. (1995). The computation of optical flow. ACMComputing Surveys, 27(3), 433–467.
268
![Page 283: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/283.jpg)
Beauchemin, S. & Barron, J. (2000a). The frequency structure of 1D occludingimage signals. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 22(2), 200–206.
Beauchemin, S. & Barron, J. (2000b). On the Fourier properties of discontinuousmotion. Journal of Mathematical Imaging and Vision, 13(3), 155–172.
Beil, W. (1994). Steerable filters and invariance theory. Pattern RecognitionLetters, 15(5), 453–460.
de Berg, M., Cheong, O., van Kreveld, M. & Overmars, M. (2008). ComputationalGeometry: Algorithms and Applications, third edition. New York, New York:Springer.
Bergen, J. (1991). Theories of visual texture perception. In D. Regan (Ed.),Vision and Visual Dysfunction, Volume 10B (pp. 114–134). New York, NewYork: Macmillan.
Bergen, J. & Adelson, E. (1988). Early vision and texture perception. Nature,333(6171), 363–367.
Bergen, J., Burt, P., Hingorani, R. & Peleg, S. (1992). A three-frame algorithmfor estimating two-component image motion. IEEE Transactions on PatternAnalysis and Machine Intelligence, 14(9), 886–896.
Bhattacharyya, A. (1943). On a measure of divergence between two statisticalpopulations defined by their probability distribution. Bulletin of CalcuttaMathematical Society, 35, 99–110.
Bigun, J. (2006). Vision with Direction: A Systematic Introduction to ImageProcessing and Computer Vision. New York, New York: Springer.
del Bimbo, A., Pala, P. & Tanganelli, L. (2000). Video retrieval based on dy-namics of color flows. In Proceedings of IEEE International Conference onPattern Recognition (pp. I: 851–854).
Black, M. & Anandan, P. (1990). Constraints for the early detection of dis-continuity from motion. In Proceedings of AAAI Conference on ArtificialIntelligence (pp. 1060–1066).
269
![Page 284: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/284.jpg)
Black, M. & Anandan, P. (1996). The robust estimation of multiple motions:Parametric and piecewise-smooth flow-fields. Computer Vision and ImageUnderstanding, 63(1), 75–104.
Black, M. & Fleet, D. (2000). Probabilistic detection and tracking of motionboundaries. International Journal of Computer Vision, 38(3), 231–245.
Bober, M. & Kittler, J. (1994). Estimation of complex multimodal motion: Anapproach based on robust statistics and Hough transform. Image and VisionComputing, 12(10), 661–668.
Bobick, A. & Davis, J. (2001). The recognition of human movement using tem-poral templates. IEEE Transactions on Pattern Analysis and Machine In-telligence, 23(3), 257–267.
Bouthemy, P. & Fablet, R. (1998). Motion characterization from temporal cooc-currences of local motion-based measures for video indexing. In Proceedingsof IEEE International Conference on Pattern Recognition (pp. I: 905–908).
Bracewell, R. (2000). The Fourier Transform and Its Applications. New York,New York: McGraw-Hill.
Brodatz, P. (1966). Textures: A Photgraphic Album for Artists and Designers.New York, New York: Dover.
Brox, T., Bruhn, A. & Weickert, J. (2006). Variational motion segmentation withlevel sets. In Proceedings of European Conference on Computer Vision (pp.I: 471–483).
Bruce, N. & Tsotsos, J. (2008). Spatiotemporal saliency: Towards a hierarchicalrepresentation of visual saliency. In Proceedings of International Workshopon Attention in Cognitive Systems.
Burt, P., Hingorani, R. & Kolczynski, R. (1991). Mechanisms for isolating com-ponent patterns in the sequential analysis of multiple motion. In Proceedingsof Motion Workshop (pp. 187–193).
Burt, P., Yen, C. & Xu, X. (1982). Local correlation measures for motion anal-ysis: A comparative study. In Proceedings of IEEE Conference on PatternRecognition and Image Processing (pp. 269–274).
270
![Page 285: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/285.jpg)
Campbell, F. & Robson, J. (1968). Application of Fourier analysis to the visibilityof grating. Journal of Physiology, 197(3), 551–566.
Cannons, K., Gryn, J. & Wildes, R. (2010). Visual tracking using a pixelwisespatiotemporal oriented energy representation. In Proceedings of EuropeanConference on Computer Vision.
Canny, J. (1986). A computational approach to edge detection. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 8(6), 679–698.
Chan, A. & Vasconcelos, N. (2005a). Classification and retrieval of traffic videousing auto-regressive stochastic processes. In Proceedings of IEEE IntelligentVehicle Symposium (pp. 771–776).
Chan, A. & Vasconcelos, N. (2005b). Probabilistic kernels for the classificationof auto-regressive visual processes. In Proceedings of IEEE Conference onComputer Vision and Pattern Recognition (pp. I: 846–851).
Chan, A. & Vasconcelos, N. (2007). Classifying video with kernel dynamic tex-tures. In Proceedings of IEEE Conference on Computer Vision and PatternRecognition.
Chan, A. & Vasconcelos, N. (2008). Modeling, clustering, and segmenting videowith mixtures of dynamic textures. IEEE Transactions on Pattern Analysisand Machine Intelligence, 30(5), 909–926.
Chan, A. & Vasconcelos, N. (2009). Layered dynamic textures. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 31(10), 1862–1879.
Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transactionson Pattern Analysis and Machine Intelligence, 17(8), 790–799.
Chetverikov, D. & Peteri, R. (2005). A brief survey of dynamic texture descriptionand recognition. In Proceedings of International Conference on ComputerRecognition Systems (pp. 17–26).
Chomat, O. & Crowley, J. (1999). Probabilistic recognition of activity using localappearance. In Proceedings of IEEE Conference on Computer Vision andPattern Recognition (pp. II: 104–109).
271
![Page 286: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/286.jpg)
Comaniciu, D. & Meer, P. (2002). Mean shift: A robust approach toward fea-ture space analysis. IEEE Transactions on Pattern Analysis and MachineIntelligence, 24(5), 603–619.
Comaniciu, D., Ramesh, V. & Meer, P. (2003). Kernel-based object tracking.IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5),564–577.
Cormen, T., Leiserson, C. & Rivest, R. (1990). Introduction to algorithms. Cam-bridge, Massachusetts: MIT Press.
Costeira, J. & Kanade, T. (1998). A multibody factorization method for inde-pendently moving-objects. International Journal of Computer Vision, 29(3),159–179.
Cremers, D. & Soatto, S. (2005). Motion competition: A variational approach topiecewise parametric motion segmentation. International Journal of Com-puter Vision, 62(3), 249–265.
Cucchiara, R., Piccardi, M. & Mello, P. (2000). Image analysis and rule-basedreasoning for a traffic monitoring system. IEEE Transactions on IntelligentTransportation Systems, 1(2), 119–130.
Cula, O. & Dana, K. (2004). 3D texture recognition using bidirectional featurehistograms. International Journal of Computer Vision, 59(1), 33–60.
Danielsson, P. & Seger, O. (1990). Rotation invariance in gradient and higherorder derivative detectors. Computer Vision Graphics and Image Processing,49(2), 198–221.
Darrell, T. & Pentland, A. (1995). Cooperative robust estimation using layers ofsupport. IEEE Transactions on Pattern Analysis and Machine Intelligence,17(5), 474–487.
DeMenthon, D. (2002). Spatio-temporal segmentation of video by hierarchicalmean shift analysis. In Proceedings of Statistical Methods in Video ProcessingWorkshop.
Deng, Y. & Manjunath, B. (1998). NeTra-V: Toward an object-based video repre-sentation. IEEE Transactions on Circuits and Systems for Video Technology,8(5), 616–627.
272
![Page 287: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/287.jpg)
Deng, Y. & Manjunath, B. (2001). Unsupervised segmentation of color-textureregions in images and video. IEEE Transactions on Pattern Analysis andMachine Intelligence, 23(8), 800–810.
Derpanis, K. & Gryn, J. (2005). Three-dimensional nth derivative of Gaussianseparable steerable filters. In Proceedings of IEEE International Conferenceon Image Processing (pp. III: 553–556).
Derpanis, K., Sizintsev, M., Cannons, K. & Wildes, R. (2010). Efficient actionspotting based on a spacetime oriented structure representation. In Proceed-ings of IEEE Conference on Computer Vision and Pattern Recognition.
Derpanis, K. & Wildes, R. (2009a). Detecting spatiotemporal structure bound-aries: Beyond motion discontinuities. In Proceedings of Asian Conference onComputer Vision.
Derpanis, K. & Wildes, R. (2009b). Early spatiotemporal grouping with a dis-tributed oriented energy representation. In Proceedings of IEEE Conferenceon Computer Vision and Pattern Recognition.
Derpanis, K. & Wildes, R. (2010a). Dynamic texture recognition based on distri-butions of spacetime oriented structure. In Proceedings of IEEE Conferenceon Computer Vision and Pattern Recognition.
Derpanis, K. & Wildes, R. (2010b). The structure of multiplicative motionsin natural imagery. IEEE Transactions on Pattern Analysis and MachineIntelligence, 32(7).
Di Zenzo, S. (1986). A note on the gradient of a multi-image. Computer VisionGraphics and Image Processing, 33(1), 116–125.
Dickinson, S. (2009). The evolution of object categorization and the challengeof image abstraction. In S. Dickinson, A. Leonardis, B. Schiele & M. Tarr(Eds.), Object Categorization: Computer and Human Vision Perspectives(pp. 1–37). Cambridge: Cambridge University Press.
Dollar, P., Rabaud, V., Cottrell, G. & Belongie, S. (2005). Behavior recognitionvia sparse spatio-temporal features. In Proceedings of IEEE Workshop onPerformance Evaluation of Tracking and Surveillance (pp. 65–72).
273
![Page 288: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/288.jpg)
Doretto, G., Chiuso, A., Wu, Y. & Soatto, S. (2003a). Dynamic textures. Inter-national Journal of Computer Vision, 51(2), 91–109.
Doretto, G., Cremers, D., Favaro, P. & Soatto, S. (2003b). Dynamic texture seg-mentation. In Proceedings of IEEE International Conference on ComputerVision (pp. 1236–1242).
Duc, B., Schroeter, P. & Bigun, J. (1995). Spatio-temporal robust motion esti-mation and segmentation. In Proceedings of Computer Analysis of Imagesand Patterns (pp. 238–245).
Duda, R., Hart, P. & Stork, D. (2001). Pattern Classification. New York, NewYork: Wiley.
DuFaux, F. & Moscheni, F. (1995). Motion estimation techniques for digital TV:A review and a new contribution. Proceedings of the IEEE, 83(6), 858–876.
Edwards, M. & Greenwood, J. (2005). The perception of motion transparency:A signal-to-noise limit. Vision Research, 45(14), 1877–1884.
Efros, A., Berg, A., Mori, G. & Malik, J. (2003). Recognizing action at a distance.In Proceedings of IEEE International Conference on Computer Vision (pp.726–733).
Efros, A. & Freeman, W. (2001). Image quilting for texture synthesis and transfer.In Proceedings of SIGGRAPH.
Estrada, F. & Jepson, A. (2009). Benchmarking image segmentation algorithms.International Journal of Computer Vision, 85(2), 167–181.
Fahle, M. & Poggio, T. (1981). Visual hyperacuity: Spatio-temporal interpolationin human vision. Proceedings of the Royal Society of London - B, 213(1193),451–477.
Fanti, C., Zelnik Manor, L. & Perona, P. (2005). Hybrid models for humanmotion recognition. In Proceedings of IEEE Conference on Computer Visionand Pattern Recognition (pp. I: 1166–1173).
Feldman, D. & Weinshall, D. (2008). Motion segmentation and depth orderingusing an occlusion detector. IEEE Transactions on Pattern Analysis andMachine Intelligence, 30(7), 1171–1185.
274
![Page 289: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/289.jpg)
Felzenszwalb, P. & Huttenlocher, D. (2005). Pictorial structures for object recog-nition. International Journal of Computer Vision, 61(1), 55–79.
Fennema, C. & Thompson, W. (1979). Velocity determination in scenes con-taining several moving objects. Computer Graphics Image Processing, 9(4),301–315.
Fitzgibbon, A. (2001). Stochastic rigidity: Image registration for nowhere-staticscenes. In Proceedings of IEEE International Conference on Computer Vi-sion (pp. I: 662–669).
Fleet, D. (1992). Measurement of Image Velocity. Norwell, Massachusetts:Kluwer.
Fleet, D., Black, M. & Jepson, A. (1998). Motion feature detection using steerableflow fields. In Proceedings of IEEE Conference on Computer Vision andPattern Recognition (pp. 274–281).
Fleet, D. & Jepson, A. (1985). A cascaded filter approach to the construction ofvelocity selective mechanisms. Technical Report RBV TR-85-6, Departmentof Computer Science, University of Toronto.
Fleet, D. & Jepson, A. (1990). Computation of component image velocity fromlocal phase information. International Journal of Computer Vision, 5(1),77–104.
Fleet, D. & Langley, K. (1994). Computational analysis of non-Fourier motion.Vision Research, 34(22), 3057–3079.
Fleet, D. & Weiss, Y. (2005). Optical flow estimation. In N. Paragios, Y. Chen &O. Faugeras (Eds.), Handbook of Mathematical Methods in Computer Vision(pp. 241–260). New York, New York: Springer.
Fogel, I. & Sagi, D. (1989). Gabor filters as texture discriminator. BiologicalCybernetics, 61, 102–113.
Fothergill, A. (2007). Planet Earth [DVD]. BBC Warner.
Fowlkes, C., Belongie, S. & Malik, J. (2001). Efficient spatiotemporal groupingusing the Nystrom method. In Proceedings of IEEE Conference on ComputerVision and Pattern Recognition (pp. I:231–238).
275
![Page 290: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/290.jpg)
Freeman, W. & Adelson, E. (1991). The design and use of steerable filters. IEEETransactions on Pattern Analysis and Machine Intelligence, 13(9), 891–906.
Fukunaga, K. & Hostetler, L. (1975). The estimation of the gradient of a densityfunction, with applications in pattern recognition. IEEE Transactions onInformation Theory, 21(1), 32–40.
Galun, M., Sharon, E., Basri, R. & Brandt, A. (2003). Texture segmentation bymultiscale aggregation of filter responses and shape elements. In Proceedingsof IEEE International Conference on Computer Vision (pp. 716–723).
Galvin, B., McCane, B., Novins, K., Mason, D. & Mills, S. (1998). Recoveringmotion fields: An evaluation of eight optical flow algorithms. In Proceedingsof British Machine Vision Conference (pp. 1563–1567).
Gao, D., Mahadevan, V. & Vasconcelos, N. (2008). Static and space-time visualsaliency detection by self-resemblance. Journal of Vision, 8(7), 1–18.
Gavrila, D. (1999). The visual analysis of human movement: A survey. ComputerVision and Image Understanding, 73(1), 82–98.
Gear, C. (1998). Multibody grouping from motion images. International Journalof Computer Vision, 29(2), 133–150.
Ghoreyshi, A. & Vidal, R. (2006). Segmenting dynamic textures with Ising de-scriptors, ARX models and level sets. In Proceedings of Workshop on Dy-namical Vision (pp. 127–141).
Gibson, J. (1950). The Perception of the Visual World. New York, New York:Houghton Mifflin.
Gorelick, L., Blank, M., Shechtman, E., Irani, M. & Basri, R. (2007). Actionsas space-time shapes. IEEE Transactions on Pattern Analysis and MachineIntelligence, 29(12), 2247–2253.
Granlund, G. & Knutsson, H. (1995). Signal Processing for Computer Vision.Norwell, Massachusetts: Kluwer.
Greenspan, H., Belongie, S., Perona, P., Goodman, R., Rakshit, S. & Anderson,C. (1994). Overcomplete steerable pyramid filters and rotation invariance.In Proceedings of IEEE Conference on Computer Vision and Pattern Recog-nition (pp. 222–228).
276
![Page 291: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/291.jpg)
Greenspan, H., Goldberger, J. & Mayer, A. (2004). Probabilistic space-time videomodeling via piecewise GMM. IEEE Transactions on Pattern Analysis andMachine Intelligence, 26(3), 384–396.
Grimson, W. (1991). Object Recognition by Computer: The Role of GeometricConstraints. Cambridge, Massachusetts: MIT Press.
Hansen, M., Anandan, P., Dana, K., van der Wal, G. & Hansen, M. (1994).Real-time scene stabilization and mosaic construction. In IEEE Workshopon Applications of Computer Vision (pp. 54–62).
Haralick, R. (1979). Statistical and structural approaches to texture. Proceedingsof the IEEE, 67(5), 786–804.
Haralick, R. (1984). Digital step edges from zero crossing of second directionalderivatives. IEEE Transactions on Image Processing, 11(1), 58–68.
Harris, F. (1978). On the use of windows for harmonic analysis with the discreteFourier transform. Proceedings of the IEEE, 66(1), 51–83.
Haußecker, H. & Spies, H. (1999). Motion. In B. Jahne, H. Haußecker & P. Geißler(Eds.), Handbook of Computer Vision and Applications (pp. 309–396). NewYork, New York: Academic Press.
Heeger, D. (1988). Optical flow from spatiotemporal filters. International Journalof Computer Vision, 1(4), 279–302).
Heeger, D. (1992). Normalization of cell responses in cat striate cortex. VisualNeuroscience, 9(2), 181–198.
Heeger, D. & Pentland, A. (1986). Seeing structure through chaos. In Proceedingsof Workshop on Motion (pp. 131–136).
Heitz, F. & Bouthemy, P. (1993). Multimodal estimation of discontinuous opticalflow using Markov random fields. IEEE Transactions on Pattern Analysisand Machine Intelligence, 15(12), 1217–1232.
Hildreth, E. (1984). Measurement of Visual Motion. Cambridge: MIT Press.
Hildreth, E. & Koch, C. (1987). The analysis of visual motion: From computa-tional theory to neuronal mechanisms. Annual Review of Neuroscience, 10,477–533.
277
![Page 292: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/292.jpg)
Hoffman, D. (2000). Visual Intelligence: How We Create What We See. NewYork, New York: W.W. Norton and Company.
Horn, B. (1986). Robot Vision. Cambridge, Massachusetts: MIT Press.
Horn, B. & Schunck, B. (1981). Determining optical flow. Artificial Intelligence,17(1-3), 185–203.
Huang, C. & Chen, Y. (1995). Motion estimation method using a 3D steerablefilter. Image and Vision Computing, 13(1), 21–32.
Huang, T. & Netravali, A. (1994). Motion and structure from feature correspon-dences: A review. Proceedings of the IEEE, 82(2), 252–268.
Huang, T. & Tsai, R. (1981). Image sequence analysis: Motion estimation. InT. Huang (Ed.), Image Sequence Analysis (pp. 1–18). New York, New York:Springer.
Ichimura, N. (2000). A robust and efficient motion segmentation based on orthog-onal projection matrix of shape space. In Proceedings of IEEE Conferenceon Computer Vision and Pattern Recognition (pp. II: 446–452).
Irani, M., Rousso, B. & Peleg, S. (1994). Computing occluding and transparentmotions. International Journal of Computer Vision, 12(1), 5–16.
Jain, R. & Nagel, H. (1979). On the analysis of accumulative difference picturesfrom image sequences of real world scenes. IEEE Transactions on PatternAnalysis and Machine Intelligence, 1(2), 206–213.
Jahne, B. (2005). Digital Image Processing, sixth edition. Berlin: Springer.
Jepson, A. & Black, M. (1993). Mixture models for optical flow computation. InProceedings of IEEE Conference on Computer Vision and Pattern Recogni-tion (pp. 760–761).
Jepson, A., Richards, W. & Knill, D. (1996). Modal structure and reliable infer-ence. In D. Knill & W. Richards (Eds.), Perception as Bayesian Inference(pp. 63–92). Cambridge: Cambridge University Press.
Jhuang, H., Serre, T., Wolf, L. & Poggio, T. (2007). A biologically inspired systemfor action recognition. In Proceedings of IEEE International Conference onComputer Vision.
278
![Page 293: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/293.jpg)
Jojic, N. & Frey, B. (2001). Learning flexible sprites in video layers. In Proceedingsof IEEE Conference on Computer Vision and Pattern Recognition (pp. I:199–206).
Jung, Y., Lee, K. & Ho, Y. (2001). Content-based event retrieval using semanticscene interpretation for automated traffic surveillance. IEEE Transactionson Intelligent Transportation Systems, 2(3), 151–163.
Kass, M. & Witkin, A. (1987). Analyzing oriented patterns. Computer VisionGraphics and Image Processing, 37(3), 362–385.
Ke, Y., Sukthankar, R. & Hebert, M. (2005). Efficient visual event detectionusing volumetric features. In Proceedings of IEEE International Conferenceon Computer Vision (pp. I: 166–173).
Ke, Y., Sukthankar, R. & Hebert, M. (2007). Event detection in crowded videos.In Proceedings of IEEE International Conference on Computer Vision.
Kersten, D. (1991). Transparency and the cooperative computation of sceneattributes. In M. Landy & J. Movshon (Eds.), Computational Models ofVisual Processing (pp. 209–228). Cambridge, Massachusettes: MIT Press.
Knutsson, H. & Granlund, G. (1983). Texture analysis using two-dimensionalquadrature filters. In Proceedings of IEEE Workshop on Computer Architec-ture for Pattern Analysis and Image Database Management (pp. II: 768–784).
Knutsson, H., Wilson, R. & Granlund, G. (1983). Anisotropic nonstationaryimage estimation and its applications part I: Restoration of noisy images.IEEE Transactions on Communications, 31(3), 388–397.
Koenderink, J. (1984). What does the occluding contour tell us about solidshape? Perception, 13(3), 321–330.
Koenderink, J. (1991). Solid Shape. Cambridge, Massachusetts: MIT Press.
Koenderink, J. & van Doorn, A. (1987). Representation of local geometry in thevisual system. Biological Cybernetics, 55(6), 367–375.
Koenderink, J. J. & Richards, W. (1988). Two-dimensional curvature operators.Journal of the Optical Society of America-A, 5(7), 1136–1141.
279
![Page 294: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/294.jpg)
Konishi, S., Yuille, A., Coughlan, J. & Zhu, S. (2003). Statistical edge detection:Learning and evaluating edge cues. IEEE Transactions on Pattern Analysisand Machine Intelligence, 25(1), 57–74.
Kreyszig, E. (1959). Differential Geometry. Toronto: University of Toronto Press.
Kung, T. & Richards, W. (1988). Inferring “water” from images. In W. Richards(Ed.), Natural Computation (pp. 224–233). Cambridge, Massachusetts: MITPress.
Kwatra, V., Schodl, A., Essa, I., Turk, G. & Bobick, A. (2003). Graphcut tex-tures: Image and video synthesis using graph cuts. ACM Transactions onGraphics, 22(3), 277–286.
Landy, M. & Bergen, J. (1991). Texture segregation and orientation gradient.Vision Research, 31(4), 679–691.
Langer, M. & Mann, R. (2003). Optical snow. International Journal of ComputerVision, 55(1), 55–71.
Langley, K., Fleet, D. & Atherton, T. (1992). Multiple motions from instanta-neous frequency. In Proceedings of IEEE Conference on Computer Visionand Pattern Recognition (pp. 846–849).
Laptev, I. (2005). On space-time interest points. International Journal of Com-puter Vision, 64(2-3), 107–123.
Laptev, I., Marszalek, M., Schmid, C. & Rozenfeld, B. (2008). Learning realistichuman actions from movies. In Proceedings of IEEE Conference on ComputerVision and Pattern Recognition.
Lenz, R. (1990). Group Theoretic Methods in Image Processing. New York, NewYork: Springer.
Leung, T. & Malik, J. (2001). Representing and recognizing the visual appear-ance of materials using three-dimensional textons. International Journal ofComputer Vision, 43(1), 29–44.
Li, X. & Porikli, F. (2004). A hidden Markov model framework for traffic eventdetection using video features. In Proceedings of IEEE International Con-ference on Image Processing (pp. V: 2901–2904).
280
![Page 295: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/295.jpg)
Lin, W., Hays, J., Wu, C., Kwatra, V. & Liu, Y. (2006). Quantitative evaluationon near regular texture synthesis. In Proceedings of IEEE Conference onComputer Vision and Pattern Recognition (pp. I:427–434).
Lindeberg, T. (1997). Linear spatio-temporal scale-space. In Proceedings of Scale-Space (pp. 113–127).
Lindeberg, T. (1998). Feature detection with automatic scale selection. Interna-tional Journal of Computer Vision, 30(2), 79–116.
Little, J. & Verri, A. (1989). Analysis of differential and matching methodsfor optical flow. In Proceedings of IEEE Workshop on Visual Motion (pp.173–180).
Liu, C., Freeman, W., Adelson, E. & Weiss, Y. (2008). Human-assisted motionannotation. In Proceedings of IEEE Conference on Computer Vision andPattern Recognition.
Liu, H., Hong, T., Herman, M., Camus, T. & Chellappa, R. (1996). Accuracy vs.efficiency trade-offs in optical flow algorithms. In Proceedings of EuropeanConference on Computer Vision (pp. II:174–183).
Liu, J., Luo, J. & Shah, M. (2009). Recognizing realistic actions from videos‘in the wild’. In Proceedings of IEEE Conference on Computer Vision andPattern Recognition.
Lowe, D. (1985). Perceptual Organization and Visual Recognition. Norwell, Mas-sachusetts: Kluwer.
Lu, Z., Xie, W., Pei, J. & Huang, J. (2005). Dynamic texture recognition byspatio-temporal multiresolution histograms. In Proceedings of Workshop onMotion (pp. II: 241–246).
Lubin, J. (1992). Interactions among motion-sensitive mechanisms in humanvision. PhD thesis, University of Pennsylvania, Department of Psychology.
Lucas, B. & Kanade, T. (1981). An iterative registration technique with anapplication to stereo vision. In Proceedings of International Joint Conferenceon Artificial Intelligence (pp. 674–679).
281
![Page 296: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/296.jpg)
Malik, J., Belongie, S., Leung, T. & Shi, J. (2001). Contour and texture analysisfor image segmentation. International Journal of Computer Vision, 43(1),7–27.
Malik, J. & Perona, P. (1990). Preattentive texture discrimination with earlyvision mechanism. Journal of the Optical Society of America - A, 7(5), 923–932.
Marr, D. (1982). Vision: A Computational Investigation into the Human Repre-sentation and Processing of Visual Information. New York, New York: W.H.Freeman.
Marr, D. & Ullman, S. (1981). Directional selectivity and its use in early visualprocessing. Proceedings of the Royal Society of London - B, 211, 151–180.
Martin, D., Fowlkes, C. & Malik, J. (2004). Learning to detect natural imageboundaries using local brightness, color, and texture cues. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 26(5), 530–549.
McCane, B., Novins, K., Crannitch, D. & Galvin, B. (2001). On benchmarkingoptical flow. Computer Vision and Image Understanding, 84(1), 126–143.
Meer, P., Mintz, D., Kim, D. & Rosenfeld, A. (1991). Robust regression methodsfor computer vision: A review. International Journal of Computer Vision,6(1), 59–70.
Megret, R. & DeMenthon, D. (2002). A survey of spatio-temporal grouping tech-niques. Technical Report CS-TR-4403, Department of Computer Science,University of Maryland.
Michaelis, M. & Sommer, G. (1995). A Lie group approach to steerable filters.Pattern Recognition Letters, 16(11), 1165–1174.
Middendorf, M. & Nagel, H. (2001). Estimation and interpretation of discontinu-ities in optical flow fields. In Proceedings of IEEE International Conferenceon Computer Vision (pp. I: 178–183).
Min, C. & Medioni, G. (2008). Inferring segmented dense motion layers using5D tensor voting. IEEE Transactions on Pattern Analysis and MachineIntelligence, 30(9), 1589–1602.
282
![Page 297: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/297.jpg)
Mitiche, A. & Bouthemy, P. (1996). Computation and analysis of image mo-tion: A synopsis of current problems and methods. International Journal ofComputer Vision, 19(1), 29–55.
Moeslund, T. & Granum, E. (2001). A survey of computer vision-based humanmotion capture. Computer Vision and Image Understanding, 81(3), 231–268.
Moeslund, T., Hilton, A. & Kruger, V. (2006). A survey of advances in vision-based human motion capture and analysis. Computer Vision and ImageUnderstanding, 103(2-3), 90–126.
Morrone, M. & Owens, R. (1987). Feature detection from local energy. PatternRecognition Letters, 6(5), 303–313.
Mounts, F. (1969). A video encoding system with conditional picture-elementreplenishment. The Bell Systems Technical Journal, 48(7), 2545–2554.
Mulligan, J. (1992). Motion transparency is restricted to two planes. InvestigativeOphthalmology and Visual Science Supplemental (ARVO), 33, 1049.
Mutch, K. & Thompson, W. (1985). Analysis of accretion and deletion at bound-aries in dynamic scenes. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 7(2), 133–138.
Nagel, H. (1986). Image sequences - ten (octal) years - from phenomenology to-wards a theoretical foundation. In Proceedings of IEEE International Con-ference on Pattern Recognition (pp. 1174–1185).
Nakayama, K. & Loomis, J. (1974). Optical velocity patterns, velocity sensitiveneurons, and space perception: A hypothesis. Perception, 3(1), 63–80.
Nelson, R. & Polana, R. (1992). Qualitative recognition of motion using temporaltexture. Computer Vision Graphics and Image Processing, 56(1), 78–89.
Nicolescu, M. & Medioni, G. (2003). Layered 4D representation and voting forgrouping from motion. IEEE Transactions on Pattern Analysis and MachineIntelligence, 25(4), 492–501.
Niebles, J., Wang, H. & Fei-Fei, L. (2008). Unsupervised learning of human actioncategories using spatial-temporal words. International Journal of ComputerVision, 79(3), 299–318.
283
![Page 298: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/298.jpg)
Ning, H., Han, T., Walther, D., Liu, M. & Huang, T. (2009). Hierarchical space-time model enabling efficient search for human actions. IEEE Transactionson Circuits and Systems for Video Technology, 19(6), 808–820.
Niyogi, S. (1995). Detecting kinetic occlusion. In Proceedings of IEEE Interna-tional Conference on Computer Vision (pp. 1044–1049).
Odobez, J. & Bouthemy, P. (1995). Robust multiresolution estimation of para-metric motion models. Journal of Visual Communication and Image Repre-sentation, 6(4), 348–365.
Oikonomopoulos, A., Patras, I. & Pantic, M. (2006). Spatiotemporal salientpoints for visual recognition of human actions. IEEE Transactions on Sys-tems, Man, and Cybernetics, Part B, 36(3), 710–719.
Otte, M. & Nagel, H. (1994). Optical flow estimation: Advances and comparisons.In Proceedings of European Conference on Computer Vision (pp. A:51–60).
Parent, P. & Zucker, S. (1989). Trace inference, curvature consistency, and curvedetection. IEEE Transactions on Pattern Analysis and Machine Intelligence,11(8), 823–839.
Pentland, A. (1986). Perceptual organization and the representation of naturalform. Artificial Intelligence, 28(3), 293–331.
Perona, P. (1995). Deformable kernels for early vision. IEEE Transactions onPattern Analysis and Machine Intelligence, 17(5), 488–499.
Polana, R. & Nelson, R. (1997). Temporal texture and activity recognition. InM. Shah & R. Jain (Eds.), Motion-based recognition. Norwell, Massachusetts:Kluwer.
Poppe, R. (2010). Higher order differential structure of images. Image and VisionComputing, 28(6), 976–990.
Porikli, F. & Wang, Y. (2001). An unsupervised multi-resolution object ex-traction algorithm using video-cube. In Proceedings of IEEE InternationalConference on Image Processing (pp. II: 359–362).
Potter, J. (1977). Scene segmentation using motion information. ComputerGraphics and Image Processing, 6(6), 558–581.
284
![Page 299: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/299.jpg)
Ramanan, D. Forsyth, D. (2003). Automatic annotation of everyday movements.In Proceedings of Neural Information Processing Systems.
Renninger, L. & Malik, J. (2004). When is scene recognition just texture recog-nition? Vision Research, 44(19), 2301–2311.
Rodriguez, M., Ahmed, J. & Shah, M. (2008). Action MACH a spatio-temporalmaximum average correlation height filter for action recognition. In Proceed-ings of IEEE Conference on Computer Vision and Pattern Recognition.
Rosch, E. & Mervis, C. (1975). Family resemblances: Studies in the internalstructure of categories. Cognitive Psychology, 7, 573–605.
Rosen, K. (1998). Discrete Mathematics and its Applications. Boston, Mas-sachusetts: McGraw-Hill.
Rousson, M., Brox, T. & Deriche, R. (2003). Active unsupervised texture segmen-tation on a diffusion based feature space. In Proceedings of IEEE Conferenceon Computer Vision and Pattern Recognition (pp. II: 699–704).
Rubner, Y., Puzicha, J., Tomasi, C. & Buhmann, J. (2001). Empirical evaluationof dissimilarity measures for color and texture. Computer Vision and ImageUnderstanding, 84(1), 25–43.
Rubner, Y. & Tomasi, C. (1996). Coalescing texture descriptors. In Proceedingsof ARPA Image Understanding Workshop (pp. 927–935).
Rubner, Y., Tomasi, C. & Guibas, L. (2000). The Earth Mover’s Distance as ametric for image retrieval. International Journal of Computer Vision, 40(2),99–121.
Saisan, P., Doretto, G., Wu, Y. & Soatto, S. (2001). Dynamic texture recogni-tion. In Proceedings of IEEE Conference on Computer Vision and PatternRecognition (pp. II:58–63).
Sapiro, G. & Ringach, D. (1996). Anisotropic diffusion of multivalued imageswith applications to color filtering. IEEE Transactions on Image Processing,5(11), 1582–1586.
Sawhney, H. & Ayer, S. (1996). Compact representations of videos throughdominant and multiple motion estimation. IEEE Transactions on PatternAnalysis and Machine Intelligence, 18(8), 814–830.
285
![Page 300: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/300.jpg)
Schuldt, C., Laptev, I. & Caputo, B. (2004). Recognizing human actions: A localSVM approach. In Proceedings of IEEE International Conference on PatternRecognition (pp. III: 32–36).
Seo, H. & Milanfar, P. (2009a). Detection of human actions from a single example.In Proceedings of IEEE International Conference on Computer Vision.
Seo, H. & Milanfar, P. (2009b). Static and space-time visual saliency detectionby self-resemblance. Journal of Vision, 9(12), 1–27.
Sharon, E., Galun, M., Sharon, D., Basri, R. & Brandt, A. (2006). Hierarchyand adaptivity in segmenting visual scenes. Nature, 442(17), 810–813.
Shechtman, E. & Irani, M. (2007). Space-time behavior-based correlation - OR- How to tell if two underlying motion fields are similar without computingthem? IEEE Transactions on Pattern Analysis and Machine Intelligence,29(11), 2045–2056.
Sheikh, Y., Haering, N. & Shah, M. (2006). Shape from dynamic texture forplanes. In Proceedings of IEEE Conference on Computer Vision and PatternRecognition (pp. II: 2285–2292).
Shi, J. & Malik, J. (1998). Motion segmentation and tracking using normalizedcuts. In Proceedings of IEEE International Conference on Computer Vision(pp. 1154–1160).
Shizawa, M. & Mase, K. (1991). A unified computational theory for motiontransparency and motion boundaries based on eigenenergy analysis. In Pro-ceedings of IEEE Conference on Computer Vision and Pattern Recognition(pp. 289–295).
Shroff, N., Turaga, P. & Chellappa, R. (2010). Motion vistas: Exploiting mo-tion for describing scenes. In Proceedings of IEEE Conference on ComputerVision and Pattern Recognition.
Shy, D. & Perona, P. (1994). X-Y separable pyramid steerable scalable kernels.In Proceedings of IEEE Conference on Computer Vision and Pattern Recog-nition (pp. 237–244).
Simoncelli, E. (1993). Distributed Analysis and Representation of Visual Motion.PhD thesis, Massachusetts Institute of Technology, Department of ElectricalEngineering and Computer Science.
286
![Page 301: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/301.jpg)
Simoncelli, E. & Freeman, W. (1995). The steerable pyramid: A flexible ar-chitecture for multi-scale derivative computation. In Proceedings of IEEEInternational Conference on Image Processing (pp. 444–447).
Simoncelli, E. & Heeger, D. (1998). A model of neuronal responses in visual areaMT. Vision Research, 38(5), 743–761.
Simoncelli, E. & Olshausen, B. (2001). Natural image statistics and neural rep-resentation. Annual Review of Neuroscience, 24, 1193–1216.
Spoerri, A. & Ullman, S. (1987). The early detection of motion boundaries.In Proceedings of IEEE International Conference on Computer Vision (pp.209–218).
Stein, A. & Hebert, M. (2007). Combining local appearance and motion cuesfor occlusion boundary detection. In Proceedings of British Machine VisionConference.
Stein, A. & Hebert, M. (2009a). Local detection of occlusion boundaries in video.Image and Vision Computing, 27(5).
Stein, A. & Hebert, M. (2009b). Occlusion boundaries from motion: Low-level de-tection and mid-level reasoning. International Journal of Computer Vision,82(3), 325–357.
Stevens, S. (1974). Patterns in Nature. Boston, Massachusetts: Little Brown.
Stiller, C. & Konrad, J. (1999). Estimating motion in image sequences: A tu-torial on modeling and computation of 2D motion. IEEE Signal ProcessingMagazine, 16(4), 70–91.
Szeliski, R., Avidan, S. & Anandan, P. (2000). Layer extraction from multi-ple images containing reflections and transparency. In Proceedings of IEEEConference on Computer Vision and Pattern Recognition (pp. I: 246–253).
Szummer, M. & Picard, R. (1996). Temporal texture modeling. In Proceedingsof IEEE International Conference on Image Processing (pp. III: 823–826).
Teo, P. & Hel Or, Y. (1998). Lie generators for computing steerable functions.Pattern Recognition Letters, 19(1), 7–17.
287
![Page 302: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/302.jpg)
Theodoridis, S. & Koutroumbas, K. (2006). Pattern Recognition, third edition.New York, New York: Academic Press.
Thompson, D. (1952). On Growth and Form. Cambridge: University Press.
Thompson, W., Mutch, K. & Berzins, V. (1985). Dynamic occlusion analysisin optical flow fields. IEEE Transactions on Pattern Analysis and MachineIntelligence, 7(4), 374–383.
Torr, P. & Zisserman, A. (1998). Concerning Bayesian motion segmentation,model averaging, matching and the trifocal tensor. In Proceedings of Euro-pean Conference on Computer Vision (pp. I: 511–527).
Tschumperle, D. & Deriche, R. (2005). Vector-valued image regularization withPDEs: A common framework for different applications. IEEE Transactionson Pattern Analysis and Machine Intelligence, 27(4), 506–517.
Tuceryan, M. & Jain, A. (1998). Texture analysis. In C. Chen, L. Pau & P. Wang(Eds.), Handbook of Pattern Recognition and Computer Vision (2nd Edi-tion). River Edge, New Jersey: World Scientific Publishing.
Turaga, P., Chellappa, R., Subrahmanian, V. & Udrea, O. (2008). Machinerecognition of human activities: A survey. IEEE Transactions on Circuitsand Systems for Video Technology, 18(11), 1473–1488.
Ullman, S. (1979). The Interpretation of Visual Motion. Cambridge, Mas-sachusetts: MIT Press.
Vapnik, V. (1995). The Nature of Statistical Learning Theory. New York, NewYork: Springer.
Varma, M. & Zisserman, A. (2005). A statistical approach to texture classificationfrom single images. International Journal of Computer Vision, 62(1-2), 61–81.
Varma, M. & Zisserman, A. (2009). A statistical approach to material classifica-tion using image patch exemplars. IEEE Transactions on Pattern Analysisand Machine Intelligence, 31(11), 2032–2047.
Vega-Riveros, J. & Jabbour, K. (1989). Review of motion analysis techniques.IEE Proceedings, 136(6), 397–404.
288
![Page 303: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/303.jpg)
Verri, A. & Poggio, T. (1989). Motion field and optical flow: Qualitative proper-ties. Pattern Analysis and Machine Intelligence, 11(5), 490–498.
Vishwanathan, S., Smola, A. & Vidal, R. (2007). Binet-Cauchy kernels on dy-namical systems and its application to the analysis of dynamic scenes. In-ternational Journal of Computer Vision, 73(1), 95–119.
Wang, J. & Adelson, E. (1994). Representing moving images with layers. IEEETransactions on Image Processing, 3(5), 625–638.
Wang, J., Thiesson, B., Xu, Y. & Cohen, M. (2004). Image and video segmenta-tion by anisotropic kernel mean shift. In European Conference on ComputerVision (pp. II: 238–249).
Wang, Y. & Zhu, S. (2003). Modeling textured motion: Particle, wave and sketch.In Proceedings of IEEE International Conference on Computer Vision (pp.213–220).
Watson, A. & Ahumada, A. (1983). A look at motion in the frequency domain.In Proceedings of Motion Workshop (pp. 1–10).
Watson, A. & Ahumada, A. (1985). Model of human visual-motion sensing.Journal of the Optical Society of America - A, 2(2), 322–342.
Weinland, D., Ronfard, R. & Boyer, E. (2006). Free viewpoint action recognitionusing motion history volumes. Computer Vision and Image Understanding,103(2-3), 249–257.
Weiss, Y. (1997). Smoothness in layers: Motion segmentation using nonparamet-ric mixture estimation. In Proceedings of IEEE Conference on ComputerVision and Pattern Recognition (pp. 520–526).
Weisstein, E. (2003). CRC Concise Encyclopedia of Mathematics. Boca Raton,Florida: Chapman and Hall/CRC.
Wildes, R. & Bergen, J. (2000). Qualitative spatiotemporal analysis using anoriented energy representation. In Proceedings of European Conference onComputer Vision (pp. 768–784).
Willems, G., Tuytelaars, T. & Van Gool, L. (2008). An efficient dense and scale-invariant spatio-temporal interest point detector. In Proceedings of EuropeanConference on Computer Vision (pp. II: 650–663).
289
![Page 304: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/304.jpg)
Williams, D. & Sekuler, R. (1984). Coherent global motion percepts from stochas-tic local motions. Vision Research, 24(1), 55–62.
Williams, D., Tweten, S. & Sekuler, R. (1991). Using metamers to explore motionperception. Vision Research, 31(2), 275–286.
Willick, D. & Yang, Y. (1991). Experimental evaluation of motion constraintequations. Computer Vision Graphics and Image Processing, 54(2), 206–214.
Winston, P. (1994). Artificial Intelligence. Don Mills, Ontario: Addison-Wesley.
Witkin, A. & Tenenbaum, J. (1983). On the role of structure in vision. In J. Beck,B. Hope & A. Rosenfeld (Eds.), Human and Machine Vision (pp. 481–543).New York, New York: Academic Press.
Wong, S., Kim, T. & Cipolla, R. (2007). Learning motion categories using bothsemantic and structural information. In Proceedings of IEEE Conference onComputer Vision and Pattern Recognition.
Woolfe, F. & Fitzgibbon, A. (2006). Shift-invariant dynamic texture recognition.In Proceedings of European Conference on Computer Vision (pp. II: 549–562).
Yacoob, Y. & Black, M. (1999). Parameterized modeling and recognition ofactivities. Computer Vision and Image Understanding, 73(2), 232–247.
Yilmaz, A. & Shah, M. (2008). A differential geometric approach to representingthe human actions. Computer Vision and Image Understanding, 109(3),335–351.
Yu, W., Daniilidis, K., Beauchemin, S. & Sommer, G. (1999). Detection and char-acterization of multiple motion points. In Proceedings of IEEE Conferenceon Computer Vision and Pattern Recognition (pp. I: 171–177).
Yu, W., Daniilidis, K. & Sommer, G. (2001). Approximate orientation steerabilitybased on angular Gaussians. IEEE Transactions on Image Processing, 10(2),193–205.
Yu, W., Sommer, G., Beauchemin, S. & Daniilidis, K. (2002). Oriented structureof the occlusion distortion: Is it reliable? IEEE Transactions on PatternAnalysis and Machine Intelligence, 24(9), 1286–1290.
290
![Page 305: On the Role of Representation in the Analysis of Visual ...kosta/Links/Derpanis_Spactime_representation… · Abstract The problems under consideration in this dissertation centre](https://reader035.fdocuments.us/reader035/viewer/2022070822/5f26d4abc727f408af554301/html5/thumbnails/305.jpg)
Yuen, J., Russell, B., Liu, C. & Torralba, A. (2009). LabelMe video: Buildinga video database with human annotations. In Proceedings of IEEE Interna-tional Conference on Computer Vision.
Zaharescu, A. & Wildes, R. (2010). Anomalous behaviour detection using spa-tiotemporal oriented energies, subset inclusion histogram comparison andevent-driven processing. In Proceedings of European Conference on Com-puter Vision.
Zalesny, A., Ferrari, V., Caenen, G. & Van Gool, L. (2005). Composite texturesynthesis. International Journal of Computer Vision, 62(1-2), 161–176.
Zelnik-Manor, L. & Irani, M. (2006). Statistical analysis of dynamic actions.IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9),1530–1535.
Zucker, S., Rosenfeld, A. & Davis, L. (1975). General purpose models: Expecta-tions about the unexpected. In Proceedings of International Joint Conferenceon Artificial Intelligence (pp. 716–721).
291