Content-based Retrieval of Digital Video

294
CONTENT-BASED RETRIEVAL OF DIGITAL VIDEO Jolon Faichney, BIT (Hons.) September 2004 A research thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy School of Information Technology Gold Coast Campus Principal Supervisor: Dr. Ruben Gonzalez Associate Supervisor: Dr. Wayne Pullan

description

Jolon Faichney - PhD Thesis

Transcript of Content-based Retrieval of Digital Video

Page 1: Content-based Retrieval of Digital Video

CONTENT-BASED RETRIEVAL OF DIGITAL VIDEO

Jolon Faichney, BIT (Hons.)

September 2004

A research thesis submitted in

fulfilment of the requirements for the degree of

Doctor of Philosophy

School of Information Technology

Gold Coast Campus

Principal Supervisor: Dr. Ruben Gonzalez

Associate Supervisor: Dr. Wayne Pullan

Page 2: Content-based Retrieval of Digital Video
Page 3: Content-based Retrieval of Digital Video

This work has not previously been submitted for a degree or diploma in any university.

To the best of my knowledge and belief, the thesis contains no material previously published or

written by another person except where due reference is made in the thesis itself.

———————————————–

Jolon Faichney

September 2004

Page 4: Content-based Retrieval of Digital Video
Page 5: Content-based Retrieval of Digital Video

Abstract

In the next few years consumers will have access to large amounts of video and image data either

created by themselves with digital video and still cameras or by having access to other image

and video content electronically. Existing personal computer hardware and software has not been

designed to manage large quantities of multimedia content. As a result, research in the area of

content-based video retrieval (CBVR) has been underway for the last fifteen years. This research

aims to improve CBVR by providing an accurate and reliable shape-colour representation and

by providing a new 3D user interface called DomeWorld for the efficient browsing of large video

databases.

Existing feature extraction techniques designed for use in large databases are typically simple

techniques as they must conform to the limited processing and storage constraints that are exhibited

by large scale databases. Conversely, more complex feature extraction techniques provide higher-

level descriptions of the underlying data but are time consuming and require large amounts of

storage making them less useful for large databases. In this thesis a technique for medium to high-

level shape representation is presented that exhibits efficient storage and query performance. The

technique uses a very accurate contour detection system that incorporates a new asymmetry edge

detector which is shown to perform better than other contour detection techniques combined with

a new summarisation technique to efficiently store contours. In addition, contours are represented

by histograms further reducing space requirements and increasing query performance. A new type

of histogram is introduced called the fuzzy histogram and is applied to content-based retrieval

systems for the first time. Fuzzy histograms improve the ranking of query results over non-fuzzy

techniques especially in low bin-count histogram configurations. The fuzzy contour histogram

approach is compared with an exhaustive contour comparison technique and is found to provide

equivalent or better results.

A number of colour distribution representation techniques were investigated for integration with

the contour histogram and the fuzzy HSV histogram was found to provide the best performance.

When the colour and contour histograms were integrated less overall bins were required as each

histogram compensates for the other’s weaknesses. The result is that only a quarter of the bins

were required than either colour or contour histogram alone further reducing query times and

storage requirements.

v

Page 6: Content-based Retrieval of Digital Video

This research also improves the user experience with a new user interface called DomeWorld

that uses three-dimensional translucent domes. Existing user interfaces are either designed for

image databases, for browsing videos, or for browsing large non-multimedia data sets. DomeWorld

is designed to be able to browse both image and video databases through a number of innovative

techniques including hierarchical clustering, radial space-filling layout of nodes, three-dimensional

presentation, and translucent domes that allow the hierarchical nature of the data to be viewed

whilst also seeing the relationship between child nodes a number of levels deep.

A taxonomy of existing image, video, and large data set user interfaces is presented and the

proposed user interface is evaluated within the framework. It is found that video database user

interfaces have four requirements: context and detail, gisting, clustering, and integration of video

and images. None of the 27 evaluated user interfaces satisfy all four requirements. The DomeWorld

user interface is designed to satisfy all of the requirements and presents a step forward in CBVR

user interaction.

This thesis investigates two important areas of CBVR, structural indexing and user interaction,

and presents techniques which advance the field. These two areas will become very important in

the future when users must access and manage large collections of image and video content.

vi

Page 7: Content-based Retrieval of Digital Video

Acknowledgements

I would like to thank my supervisor Dr. Ruben Gonzalez for his invaluable guidance and contri-

bution to this research.

Portions of this research have been published in the following articles: [1, 2, 3].

vii

Page 8: Content-based Retrieval of Digital Video

viii

Page 9: Content-based Retrieval of Digital Video

Contents

Abstract v

Acknowledgements vii

1 Introduction 1

1.1 Requirements of a Content-based Retrieval System . . . . . . . . . . . . . . . . . . 1

1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Conceptual View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.2 Logical View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3.1 Ideal Conceptual View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3.2 Ideal Logical View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Background 17

2.1 Content-based Image and Video Retrieval Systems . . . . . . . . . . . . . . . . . . 17

2.1.1 Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.1.2 Video Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2 User Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.1 Query-Result User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.2 Browsing User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

ix

Page 10: Content-based Retrieval of Digital Video

2.2.3 User Interaction Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.3.1 Temporal Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3.2 Spatial Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.3.3 Colour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.3.4 Texture and Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.3.5 Contour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.3.6 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.3.7 Combining Points, Lines, and Surfaces . . . . . . . . . . . . . . . . . . . . . 43

2.3.8 Shape from Contour, Shading, and Texture . . . . . . . . . . . . . . . . . . 44

2.4 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.4.1 Shape Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.4.2 Spatial Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.4.3 MPEG-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.5 Psychology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.6 Computational Models of the Visual Cortex . . . . . . . . . . . . . . . . . . . . . . 51

2.6.1 Primal Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.6.2 Grossberg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.6.3 Heitger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2.6.4 Walters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

2.6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3 Colour 55

3.1 Colour Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.2 Colour Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.3 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.3.1 Colour Histogram Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.3.2 RGB Histogram Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.3.3 HSV Histogram Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.4 Fuzzy Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

x

Page 11: Content-based Retrieval of Digital Video

3.4.1 Fuzzy Histogram Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.5 Colour Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.5.1 Colour Set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.6 Prominent Colours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.6.1 Prominent Colours Storage and Querying . . . . . . . . . . . . . . . . . . . 69

3.6.2 Prominent Colours Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.6.3 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4 Edge and Texture 75

4.1 Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.2 Edge Detector Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.3 Multi-orientation Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.3.1 Multi-orientation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.3.2 Multi-orientation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.4 Asymmetry Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.4.1 Asymmetry Detector Results . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.5 Thinning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.5.1 Gaussian Thinning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.5.2 Gaussian Thinning Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.6 Asymmetry Edge Detector as a Computational Model of the Visual Cortex . . . . 94

4.7 Texture Inhibition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.7.1 Psychological and Perceptual Basis . . . . . . . . . . . . . . . . . . . . . . . 96

4.7.2 Texture Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.7.3 Texture Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.7.4 Texture Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.7.5 Texture Inhibition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.7.6 Comparison with Other Techniques . . . . . . . . . . . . . . . . . . . . . . . 105

4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5 Contour 109

xi

Page 12: Content-based Retrieval of Digital Video

5.1 Contour Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.1.1 Local Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.1.2 Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.2 Contour Extraction Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.3 Identifying Edge Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.4 True Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.5 Edge Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.5.1 Edge Linking Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

5.5.2 Edge Linking Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.5.3 Edge Linking Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.6 Vertices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.6.1 Contour-ends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5.6.2 Thinning End-stopped Responses . . . . . . . . . . . . . . . . . . . . . . . . 125

5.6.3 Vertex Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5.7 Contour Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5.8 Hausdorff Distance Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.8.1 Hausdorff Distance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.8.2 Hausdorff Distance Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.9 Contour Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5.9.1 Contour Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.9.2 Contour Similarity Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 132

5.9.3 Contour Similarity Experiments and Results . . . . . . . . . . . . . . . . . 136

5.9.4 Contour Similarity Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 137

5.10 Contour Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

5.10.1 Contour Histogram Experiments . . . . . . . . . . . . . . . . . . . . . . . . 140

5.10.2 Contour Histogram Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

5.10.3 Contour Histogram Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 141

5.11 Combined Contour and Colour Histograms . . . . . . . . . . . . . . . . . . . . . . 142

5.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

xii

Page 13: Content-based Retrieval of Digital Video

6 Video 145

6.1 Video Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.1.1 Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.1.2 Camera Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.1.3 Shots and Cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.1.4 Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

6.2 Video Retrieval Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

6.3 Shot Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

6.3.1 Template Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

6.3.2 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

6.3.3 Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

6.3.4 Compressed Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

6.3.5 X-ray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

6.3.6 Fast X-ray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

6.3.7 Colour + Contour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

6.3.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

6.3.9 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

6.3.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

6.4 Scene Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

6.4.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

6.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

6.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

7 User Interaction 171

7.1 Existing Content-based Retrieval User Interfaces . . . . . . . . . . . . . . . . . . . 171

7.2 User Interface Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

7.2.1 Browsing User Interface Requirements . . . . . . . . . . . . . . . . . . . . . 173

7.2.2 Content-based Video Retrieval User Interface Requirements . . . . . . . . . 174

7.3 Visualisation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

xiii

Page 14: Content-based Retrieval of Digital Video

7.3.1 2D Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

7.3.2 Distortion-oriented Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 175

7.3.3 3D Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

7.3.4 Hypermedia Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

7.4 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

7.4.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

7.5 New Video User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

7.6 MountainView . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

7.7 Disc Tree and Goldleaf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

7.8 DomeWorld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

7.8.1 Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

7.8.2 Representative Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

7.8.3 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

7.8.4 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

7.9 VideoBrowser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

7.10 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

7.10.1 Context+Detail Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

7.10.2 Gisting Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

7.10.3 Clustering Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

7.10.4 Video and Image Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 197

7.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

8 Clustering 201

8.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

8.2 Weighted Springs Spatial Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 203

8.2.1 Hooke’s Weighted Springs Approach . . . . . . . . . . . . . . . . . . . . . . 206

8.2.2 Logarithmic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

8.2.3 Summing Attractive and Repulsive Forces Individually . . . . . . . . . . . . 208

8.2.4 Energy-based Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

8.2.5 Inserting Dummy Vertices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

xiv

Page 15: Content-based Retrieval of Digital Video

8.2.6 MountainView Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

8.3 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

8.3.1 Multidimensional Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

8.3.2 Agglomeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

8.3.3 Hierarchical Divisive Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 220

8.3.4 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

8.3.5 DomeWorld Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

8.4 Clustering Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

9 Conclusions and Future Work 229

9.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

9.2 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

9.3 User Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

9.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

9.4.1 Structural Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

9.4.2 Spatial Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

9.4.3 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

9.5 Content-based Retrieval of Digital Video . . . . . . . . . . . . . . . . . . . . . . . . 234

A Human Vision 237

A.1 Visual System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

A.2 Retina – Colour and luminance reception . . . . . . . . . . . . . . . . . . . . . . . 239

A.2.1 Retinal Neurones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

A.2.2 Ganglion Cells – Non-directional Edge Detectors . . . . . . . . . . . . . . . 241

A.3 Lateral Geniculate Nucleus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

A.4 Primary Visual Cortex (VI, Area 17) . . . . . . . . . . . . . . . . . . . . . . . . . . 242

A.4.1 Simple Cells – Line and Bar Detectors . . . . . . . . . . . . . . . . . . . . . 244

A.4.2 Complex Cells – Movement Detectors . . . . . . . . . . . . . . . . . . . . . 245

A.4.3 End-inhibited Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

A.4.4 Spatial Frequency Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

xv

Page 16: Content-based Retrieval of Digital Video

A.5 V2 and V3 (Areas 18 and 19) - Line-end and Corner Detection . . . . . . . . . . . 246

A.6 V4 and Inferotemporal Cortex (IT) - Shape, Colour, and Texture Detection . . . . 247

A.7 Medial Temporal Area (MT) - Global and Local Motion Detection . . . . . . . . . 248

A.8 High Level Vision Processing Theories . . . . . . . . . . . . . . . . . . . . . . . . . 248

A.8.1 Primal Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

A.8.2 Recognition-By-Components . . . . . . . . . . . . . . . . . . . . . . . . . . 249

A.8.3 High Level Theory for Seeing and Imagining . . . . . . . . . . . . . . . . . 249

A.8.4 Features of Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

A.8.5 Motion Processing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

A.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

B Texture 253

B.1 Texture Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

B.1.1 Harmonic Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

B.1.2 Evanescent Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

B.1.3 Indeterministic Component . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

B.2 Texture Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

xvi

Page 17: Content-based Retrieval of Digital Video

List of Figures

1.1 Conceptual and logical views of a content-based video retrieval system . . . . . . . 4

1.2 Object diagram for the structural elements and attributes of video. . . . . . . . . . 4

1.3 The DomeWorld user interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 The feature extraction process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1 Content-based retrieval system architecture. . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Image segmentation algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.3 Scale invariance by storing the angle between tangent vectors. . . . . . . . . . . . . 46

2.4 2D string. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.1 Colour wheel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.2 Colour histogram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.3 Test images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.4 Histogram results for a search on the Car image. . . . . . . . . . . . . . . . . . . . 61

3.5 Histogram results for a search on the Wedding image. . . . . . . . . . . . . . . . . 62

3.6 Histogram results for a search on the Bush image. . . . . . . . . . . . . . . . . . . 63

3.7 Fuzzy histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.8 Colour Set and Prominent Colours Results. . . . . . . . . . . . . . . . . . . . . . . 68

3.9 Prominent colours of the three car images. . . . . . . . . . . . . . . . . . . . . . . . 70

3.10 Results for the Car, Wedding, and Bush images using Gong’s histogram [4]. . . . . 72

4.1 Application of common edge detectors . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.2 Some common edge detectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.3 Kirsch mask example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

xvii

Page 18: Content-based Retrieval of Digital Video

4.4 Filters tested. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.5 Gabor tuning response curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.6 Canny tuning response curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.7 Asymmetry detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.8 Asymmetry tuning curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.9 Combined edge detector and asymmetry inhibitor . . . . . . . . . . . . . . . . . . . 86

4.10 Tuned edge detector at 7.5 orientation offset. . . . . . . . . . . . . . . . . . . . . 87

4.11 Possible problem when tuned edge detector is placed over a corner. . . . . . . . . . 88

4.12 Corner tuning curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.13 Edge results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.14 Cube test image and edge responses . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.15 Thinning results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.16 Thinning neighbourhood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.17 Position of Gaussian filters used for thinning. . . . . . . . . . . . . . . . . . . . . . 93

4.18 Potential double pixel lines after Gaussian thinning. . . . . . . . . . . . . . . . . . 93

4.19 Diagonal removal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.20 Asymmetry edge detector model of the visual cortex. . . . . . . . . . . . . . . . . . 95

4.21 Edge responses of a texture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.22 Patch-suppressed cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.23 Moving average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.24 Variance of moving average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.25 The SAR moving window effect on the X ′ matrix. . . . . . . . . . . . . . . . . . . 103

4.26 The SAR parameters of Figure 4.21 (a). . . . . . . . . . . . . . . . . . . . . . . . . 105

4.27 Variance of SAR parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.1 Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.2 Multiple orientation response scenarios . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.3 Edge linking scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.4 Difference between relative location and orientation . . . . . . . . . . . . . . . . . . 116

5.5 Input images for edge linking experiments . . . . . . . . . . . . . . . . . . . . . . . 119

xviii

Page 19: Content-based Retrieval of Digital Video

5.6 Edge linking results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.7 Edge linking comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.8 Contours extracted from the Plane image . . . . . . . . . . . . . . . . . . . . . . . 123

5.9 Vertex types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5.10 Vertices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.11 Hausdorff results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5.12 Colinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

5.13 Contour histogram and contour similarity results . . . . . . . . . . . . . . . . . . . 138

5.14 Combined colour and contour results . . . . . . . . . . . . . . . . . . . . . . . . . . 142

6.1 Temporal structure of a video sequence. . . . . . . . . . . . . . . . . . . . . . . . . 147

6.2 Optical flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.3 Template matching intensity graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 151

6.4 Histogram matching intensity graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 153

6.5 Intensity graph for the difference between frames using optical flow analysis. . . . . 155

6.6 X-ray images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

6.7 X-ray process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

6.8 X-ray intensity graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

6.9 Colour + Contour histogram intensity graphs . . . . . . . . . . . . . . . . . . . . . 161

6.10 Scene intensity graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

7.1 MountainView concept rendering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

7.2 MountainView user interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

7.3 Disc Tree user interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

7.4 Goldleaf user interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

7.5 DomeWorld user interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

7.6 Circle layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

7.7 VideoBrowser user interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

7.8 DomeWorld presenting the ‘Spy Game’ movie . . . . . . . . . . . . . . . . . . . . . 198

8.1 Basic weighted springs implementation based on Hooke’s Law. . . . . . . . . . . . 210

xix

Page 20: Content-based Retrieval of Digital Video

8.2 Other weighted springs implementations . . . . . . . . . . . . . . . . . . . . . . . . 211

8.3 Weighted springs with feature distance cubed . . . . . . . . . . . . . . . . . . . . . 212

8.4 Weighted springs with feature distance threshold . . . . . . . . . . . . . . . . . . . 214

8.5 Weighted springs with relaxed springs . . . . . . . . . . . . . . . . . . . . . . . . . 215

8.6 SS Tree structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

8.7 DomeWorld agglomeration grouping rules. . . . . . . . . . . . . . . . . . . . . . . . 223

8.8 DomeWorld agglomeration clustering technique. . . . . . . . . . . . . . . . . . . . . 225

8.9 DomeWorld agglomeration clustering of ‘Spy Game’ shots. . . . . . . . . . . . . . . 227

A.1 Kaniza triangle and optic nerve pathway . . . . . . . . . . . . . . . . . . . . . . . . 238

A.2 Visual pathway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

A.3 Opponent colour. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

A.4 Ganglion cell receptive field. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

A.5 Primary visual cortex. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

A.6 Blob cell receptive fields. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

A.7 Simple cells. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

A.8 Orientation tuning curve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

A.9 Kosslyn’s [5] high level theory for seeing and imagining. . . . . . . . . . . . . . . . 250

B.1 Autocorrelation function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

B.2 Wavelet decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

B.3 Co-occurrence matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

xx

Page 21: Content-based Retrieval of Digital Video

List of Tables

2.1 Levels of understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1 Range of each of the colour zones used by Gong [4]. . . . . . . . . . . . . . . . . . 72

5.1 Bin parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.1 Peak detection convolution kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

6.2 Video cut detection results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

6.3 Video scene detection results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

7.1 Taxonomy of existing visualisation user interfaces. . . . . . . . . . . . . . . . . . . 180

7.2 Requirements for a content-based video retrieval user interface and their relationship

with the taxonomy attributes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

7.3 Features of new video browsing user interfaces. . . . . . . . . . . . . . . . . . . . . 195

8.1 Clustering properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

8.2 Agglomeration implementation comparison. . . . . . . . . . . . . . . . . . . . . . . 224

xxi

Page 22: Content-based Retrieval of Digital Video

xxii

Page 23: Content-based Retrieval of Digital Video

Chapter 1

Introduction

The recent conversion of video storage from analogue to digital has generated a greater need

for content-based video retrieval (CBVR) research. This thesis progresses the field of CBVR by

addressing a number existing problems. In this chapter the requirements of a CBVR system

are presented and the problems facing CBVR are identified. The scope of the project is defined

followed by a description of the vision for CBVR. Finally an overview of each chapter in the thesis

is presented.

1.1 Requirements of a Content-based Retrieval System

Content-based video retrieval has three requirements:

1. Users must be able to communicate their query through an interface;

2. Relationships between queries and content must be understood;

3. The system must be able to automatically decompose content for requirement (2).

Each requirement is deeply dependent on the query. The form of the query is a compromise

between the capabilities of the user and the capabilities of the retrieval system. As will be revealed

in the following paragraphs all challenges surrounding content-based video retrieval result from

our limited understanding of how the human brain works.

Ideally the user would present their query in a simple natural language expression and the

computer would give the exact result the user is after. With this approach the computer takes the

role of an expert. The expert must have the same frame of reference as the user. The expert’s frame

of reference has resulted from many years of professional experience combined with a lifetime of

human experience allowing the expert to relate very efficiently to the user. Giving a computer the

same professional experience as the expert is a challenge because many of the expert’s skills may

1

Page 24: Content-based Retrieval of Digital Video

be undocumented and difficult to define. Even more difficult is giving the computer the ability to

communicate to humans with the same level of understanding that another human would possess

as computers currently do not have the same experiences that humans have. Therefore there is a

compromise between the understanding of the computer system and the understanding of the user

when the user delivers the query to the system.

A retrieval system must be able to process the representation it has of the content with a query

and present the results. Ideally, the computer’s representation of the content would be the same

as ours allowing similar methods to execute the query as we do. Unfortunately, it is only partially

known how humans internally represent visual content [6, 5] and how queries on that content

are performed [7, 8, 9]. Furthermore, it is only partially known how humans process content to

achieve an internal representation [10, 6]. Therefore, a CBVR system will contain limited feature

extraction, representation, and query execution compared with human beings.

A CBVR system achieves its internal representation of content through feature extraction and

coding. The result is an image or video described by a series of numbers. The computer must

take these numbers and provide meaningful results based on the query. The algorithms used to

form the results must provide similar results to human perception and cognition even though the

techniques used to arrive at the results may differ from those used in human vision.

To achieve an internal representation of the content the retrieval system must first process

the content. Existing feature extraction techniques include neurophysiological [11, 12], signal

processing [13], and statistical approaches [14, 15]. Due to limitations of computers such as storage,

processing power, and working memory the feature extraction techniques must be optimised to

provide a succinct description of the content that is going to be relatively fast to extract and will

provide sufficient information to accurately perform the user’s query.

Therefore a CBVR system must provide retrieval techniques that operate similarly to human

perception but if this can not be achieved then the CBVR system must at least provide simi-

lar results to what is expected by human perception. In addition these techniques are impeded

by existing technological capabilities including processing power, storage, and working memory.

Therefore the problems facing content-based video retrieval are based on the requirement to sim-

ulate human perception whilst working within the technological constraints of existing computer

architectures.

1.2 Problem Definition

This section describes the problems currently facing the field of CBVR in light of the requirements

identified in the previous section. A CBVR system can be viewed both conceptually and logically.

The conceptual view of a CBVR system describes the characteristics of each component and the

relationship with other parts of the system. The logical view is closer to the actual implementation

and deals with the processes involved in each component. Therefore the conceptual view can be

2

Page 25: Content-based Retrieval of Digital Video

considered the ‘what’ whilst the logical view can be considered the ‘how’. The next two sections

describe the components of the conceptual and logical views of a CBVR system and highlight

problems currently facing each component.

1.2.1 Conceptual View

The conceptual view deals with the characteristics of each component and how they relate to

other components and is less concerned with the processes involved. Figure 1.1 (a) shows the

three components of the conceptual view: feature representation, query representation, and user

interaction. These components are described in the following sections.

Feature Representation

Feature representation forms the crux of a content-based retrieval system. The capabilities of all

of the other components are dependent on the feature representation. The representation includes

the types of features that will be extracted and the detail with which they will be represented.

This directly affects the queries that the user can perform.

The feature representation depends on what the user wants to query and how the user concep-

tualises content. Immediately, we find a difference between how image and video retrieval systems

represent features. Content-based image retrieval (CBIR) systems such as the QBIC system [16]

often represent each image as a feature vector. Each element in the feature vector is a number

describing one feature, for example, colour, texture, or shape. Video retrieval systems may also

include simple image features but also place a focus on the video structure [17]. For example, a

video may consist of episodes, scenes, shots, and camera operations. A structural representation

is object-oriented in nature and each object may contain attributes. A generalised view of the

integrated structure of video and images is shown in Figure 1.2.

Existing research has generally focused on the structural aspects of video [18] and the attributes

of images [15, 19]. These are highlighted with thick borders and bold italics in Figure 1.2. Less

research has represented images as object-oriented structures because object-oriented structures

are difficult to extract and comparing object-oriented structures at query time can become very

complex.

The feature representation can also affect the user interface. Structural representations lend

themselves to browsing user interfaces where the querying is implicit in the interaction. Conversely,

feature vector representations lend themselves to conventional parametric query user interfaces and

linear ordering of results ranked by similarity.

3

Page 26: Content-based Retrieval of Digital Video

(a) Conceptual View

FeatureRepresentation

QueryRepresentation

User Interaction

UserInterface

QueryExecution

FeatureIndexing

FeatureExtraction

(b) Logical View

Figure 1.1: (a) Conceptual view of a CBVR system and (b) Logical view of a CBVR system

Video

TitleDurationDirector, etc.

Episode

TitleDurationDirector, etc.

Scene

LocationColour Distribution

Shot

StartEnd

Transition

Type (Fade, Wipe, etc.)Duration

Camera Operation

Rotation PathTranslation PathZoom Path

Frame

Colour DistributionTexture DistributionOptical FlowShape Distribution

3D Object

PositionOrientationMotion

3D Surface

OrientationPositionNormalised TextureNormalised Colour

2D Region

PositionShapeColour DistributionTexture DistributionMotion

Contour/Vertex

OrientationsPositionStrength

Figure 1.2: Object diagram for the structural elements and attributes of video.

4

Page 27: Content-based Retrieval of Digital Video

Query Representation

The query forms the link between how the computer understands the content and how the user

understands the content. The query is limited by the feature representation as this is the com-

puter’s understanding of the content. A query can be formed to compare structural relationships

between objects as well as the attributes of each object. Queries on image attributes are common in

CBIR systems as they are simple to implement [16]. However, less research has been performed in

structural queries for two reasons. The first reason is that images are rarely described structurally

because of limitations in feature extraction techniques. Secondly, the possible permutations be-

tween image objects makes query execution inefficient. However, structural queries are important.

As an example, a user may want to search for a bus which is identifiable by an elongated rectangle

with small wheels at the front and rear base of the rectangle. Without the structural relationship

of “front and rear base” a query system would return all images containing rectangles and circles

as opposed to only those images that contain rectangles and circles in an arrangement similar to

a bus.

User Interaction

The user interaction component of the conceptual view defines how the user will communicate the

query to the CBVR system and how the computer will query the results. CBVR systems generally

use a browsing interface to browse the hierarchical structure of video [17, 18]. This approach has

proved successful as there has been considerable research into browsing hierarchical structures on

computer systems. Communicating the query and results in image retrieval systems has been less

successful. CBIR systems usually require the user to present a query image or enter some query

parameters that return a list of results ranked by similarity [16, 20]. There are issues if the user is

required to present a query image. Where does the query image come from and how does the user

search for it? If the user is required to draw the query image then the query process could become

very time consuming and the results will depend on their drawing ability. If the user is required to

enter query parameters there are problems with selecting values that have visual meaning. If the

user is required to select from a visual palette of default query parameters it can also be difficult

for the user to perceive how those visual features will integrate into the image they are searching

for. Therefore the current state of image retrieval user interfaces has a number of issues. CBVR

systems have less issues, however little work has been done to integrate both image and video user

interfaces.

1.2.2 Logical View

The logical view deals primarily with the processes required to achieve the characteristics of the

conceptual view. Each conceptual component consists of one or more logical components as is

shown in Figure 1.1 (b). This section describes the four logical components and current issues

5

Page 28: Content-based Retrieval of Digital Video

facing each component.

Feature Extraction

The types of features to be extracted are defined by the feature representation of the conceptual

view. Whether these features can be extracted depends on the feature extraction techniques

available. Feature extraction in images and video covers a multitude of techniques to extract

colour [21, 15], texture [14], shape [20], objects [16], motion [17], relationships between objects

[22], camera operations, shots, and scenes [23]. The purpose of feature extraction is to construct

a feature representation that is similar to the way humans represent features. Researcher’s have

taken three approaches to solving this problem:

1. Physical

2. Physiological

3. Statistical

The physical approach considers that an image is formed by detecting light rays in a scene.

Therefore, by reversing the light paths the original scene can be reconstructed. Techniques for

determining shape from shading use this approach [24]. The physiological approach models human

vision and can be argued to be one step better than the physical approach. Even though it may

not accurately reconstruct the original physical scene it should still give similar results to what

a human would perceive [25, 26]. The third approach is usually resorted to when the other two

approaches are either too complex or the process is unknown. In this case the focus is on providing

similar results rather than simulating the physical or physiological process. For example, a colour

histogram is neither a physical approach nor a physiological approach but is simple to implement

on a computer and provides good results [21]. The only limitation to the implementation of a

full physiological approach is our limited understanding of how human vision works. Therefore

most content-based retrieval systems use a combination of physical, physiological, and statistical

approaches [16, 27]. Existing content-based retrieval systems are lacking in providing feature

extraction techniques that are physiologically similar to human vision.

Another considerable issue in feature extraction is performance. Feature extraction often in-

volves computationally intensive techniques such as frequency transforms, convolutions, and sta-

tistical analysis. These techniques will be applied to every image in an image database or to a

significant number of images in a video database numbering into the hundreds of thousands. How-

ever, since feature extraction is generally an offline process more time can be afforded in generating

a more accurate representation of the content which will provide better query results later.

6

Page 29: Content-based Retrieval of Digital Video

Feature Indexing

Feature indexing is performed to improve query performance. Featuring indexing generally involves

two steps. The first step is to transform the features extracted into a compact form that is

easily indexed. The second step is to index the features so that common content-based querying

techniques can be performed efficiently, such as nearest neighbour searches. Most feature indexing

techniques employed today represent a feature vector as a point in multi-dimensional space [28,

29, 30]. The similarity between images is described by the Euclidean distance in multi-dimensional

space. The advantage of this approach is that very efficient multi-dimensional indexing techniques

that are derived from B-tree [31, 32] like structures can be applied. The problem is that a restriction

is now applied to the query representation in that only Euclidean distance comparisons are possible

or measures based on a monotonic space. This is a problem because many features, such as

histograms, do not perform well when a Euclidean distance comparison is used. In addition,

structural relationships between objects can not be mapped to a multi-dimensional feature vector.

Therefore, even though existing indexing techniques are very efficient, if the field is to progress

forward new feature indexing techniques are required.

Query Execution

As described in the previous section, query execution is dependent on the feature indexing tech-

nique. If multi-dimensional indexing is employed, the query execution component has little work to

do as most of the work is performed by the feature index. Without an index the query component

must perform the query itself on every image and object in the database. For large databases

this can be very time consuming. For structural queries between video and/or image objects the

permutations can be large even for just one pair of images. More research needs to be conducted

in both the feature indexing and query execution components of content-based retrieval systems

to allow complex queries to be efficiently executed.

User Interface

User interaction has already been discussed at the conceptual level. At the logical level the content-

based retrieval system must be able to quickly return the results of a user’s query. Generally a

user interface should take no longer than 2 seconds to execute. The execution time depends upon

other components such as the query execution component. User interface responsiveness is also

dependent on the ability to access the images in the result set. For browsing user interfaces

there are two major performance focal points: the layout of the images on the screen and the

simultaneous display of a large number of images. Image layout may transform dynamically as

the user adjusts query parameters. If the layout uses techniques such as force-directed springs

then permutations of calculations can be quite large [33]. Techniques are required to localise the

spatial layout calculations. The other issue is displaying a large number of objects simultaneously

7

Page 30: Content-based Retrieval of Digital Video

and in realtime. Usually context+detail techniques are employed which involve a transform of

the viewing plane [34] or alternatively the layout is viewed directly in three dimensions [35]. A

three dimensional interaction involves further rendering complexities. Hardware acceleration can

be used to aid in rendering the scene, however hardware accelerators come with a limited amount of

memory for the simultaneous storage of a large number of images requiring regular access to images

stored in main memory restricting the memory bandwidth required by the hardware accelerator.

Existing browsing techniques are quite limited due to these challenges.

1.3 Vision

In this section our vision for the ideal solution to the problems of CBVR identified in the previous

sections is presented. The ideal content-based retrieval system begins with the conceptual and

logical components of Figure 1.1.

1.3.1 Ideal Conceptual View

Feature Representation

It is our view that a content-based retrieval system should be able to represent features to the

level of Figure 1.2. Therefore a video is represented structurally and each video object contains

images that are also further decomposed into two and three dimensional objects. Each image may

contain a number of objects that may persist between images in a video sequence. Each object

in the structure is called a visual object and may contain both attributes and further constituent

visual objects.

Query Representation

Queries should be represented both parametrically and structurally. Parametric queries can be

performed more optimally, whilst structural queries require further complexity. Structural queries

operate on the relationships between visual objects. The primary structural relationship is spatial,

both in two and three dimensions.

User Interaction

There are sufficient problems with the current querying approach in image retrieval that we believe

that the optimal approach is to browse a data space rather than enter in query criteria. This

approach has not been greatly investigated for image retrieval. There are two important aspects

to presenting a video database to the user. Firstly, the visual objects must be organised by

similarity in a layout that allows them to locate clusters of visual objects that they are searching

8

Page 31: Content-based Retrieval of Digital Video

Figure 1.3: The DomeWorld user interface.

for. Secondly, since the data set is large, the user interface must be able to present a large number

of visual objects simultaneously with context and detail, allowing the user to see what it is they

are currently looking at and where they can go from their current location. The next two sections

describe our solutions to these two problems.

Layout The layout of the information space should be proportional to the similarity between

visual objects. There are a number of ways to achieve this. One is to construct an hierarchical

clustering of visual objects and arrange these clusters by their similarity. Alternatively, no explicit

groupings are formed, instead using a force-directed springs approach the visual objects automat-

ically align themselves into clusters. The problem with the force-directed springs approach is that

its ability to form visible clusters relies on the feature axes being uncorrelated which is rarely the

case in image and video retrieval. However, clustering may not be as effective as force-directed

springs in representing a global arrangement of visual object similarity.

Presentation Either technique can allow the user to navigate using both context and detail. We

believe the best approach for presenting a database that has both similarity between objects as well

as a hierarchical structure is in three dimensions. The similarity between objects can be represented

on a two dimensional plane whilst the hierarchical clustering can be represented in the third

dimension. We have developed two user interfaces to achieve this. One is called MountainView

which produces mountain peaks around dense clusters and the other is called DomeWorld where

translucent domes encapsulate clusters of images (Figure 1.3). MountainView is used for data sets

that don’t have an explicit clustering whilst DomeWorld is for those that do.

9

Page 32: Content-based Retrieval of Digital Video

1.3.2 Ideal Logical View

Feature Extraction

A feature extraction process is required that is able to produce the feature representation of

Figure 1.2. The process we have devised is based on the neurophysiological path in the human

vision system up to the point where less is known about how the brain performs vision processing

functions. At this point other physical techniques are employed as well as high-level psychological

grouping theories. The process is shown in Figure 1.4.

Low-level Processing The low-level processing stages aim to follow the human vision pathway

as closely as possible from the retina to the visual cortex. These stages include opponent colour

representation, ganglion cells, simple cells, and complex cells. The low-level processing stage occurs

at 12 orientations 15 apart approximating the 10 separation of simple and complex cells in the

visual cortex.

Medium-level Processing There are two paths at the medium-level processing stage. One

path is to process the contours in the image. The other path is to process the textures. As contour

processing is time consuming the texture processing stage is used to detect areas of simple texture

and subdue them so that they are not processed by contour following algorithms.

Texture processing occurs at multiple resolutions using the simple and complex cells as the

harmonic and directional components of a 2-D Wold decomposition [36]. The indeterministic

component is determined using MR-SAR (Multi-Resolution Simultaneous Auto Regression).

The contour following technique follows contours only 15 apart resulting in contours that

have no sudden change in contour direction. This results in a large number of contours but

provides a very accurate representation for higher level processing. Additionally the medium-level

processing stage continues to approximate the visual cortex by modelling V2 which is the area of

the visual cortex that detects corners and vertices. These vertices are extracted and the number

and orientation of connecting edges is extracted.

High-level Processing High-level processing groups the medium-level components together.

All objects are formed by contours. Contours are firstly grouped into 2D regions by combining

vertex, contour, colour, and texture information. It is at this stage that attributes are extracted for

objects such as colour, texture, shape, and motion. These 2D regions may then form larger com-

posite objects through perceptual groupings such as linked movement. Using physical techniques,

the angle and shape of the 2D region in three dimensions can be extracted.

Video is segregated into shots and scenes using these high-level visual object descriptions. By

comparing the objects within a video frame more reliable scene extraction can be performed.

10

Page 33: Content-based Retrieval of Digital Video

Raw Image Data

Local Edge Detection

Texture Identification and Inhibition

Contour Extraction

End-stopped Detection

Vertex Extraction

2D Region Formation

Shape from Shading Shape from Texture

Surface Extraction

3D Object Extraction

Motion Identification

Shot Extraction

Scene Extraction

Low-

leve

lM

ediu

m-le

vel

High

-leve

l

Figure 1.4: The feature extraction process.

11

Page 34: Content-based Retrieval of Digital Video

Feature Indexing

As noted earlier existing multidimensional indexing techniques based on the Euclidean distance

are not suitable for histograms or structural comparisons. A new technique is presented that works

solely on the similarity between objects. A hierarchy is created with nodes that contain similar

objects as well as a representative object. Finding a similar visual object is performed by finding

the most similar object in the first node and drilling down until the leaf node is found. Such a

technique is less deterministic than multidimensional indexing techniques but is more flexible and

is well suited for browsing user interfaces.

User Interface

The user interface should employ good user interaction techniques such as animation, zooming,

point and click interaction, and simplicity. The user should know where they are, what they can

do, and where they can go. The user interface should be presented simply in three dimensions so

that ornaments don’t distract from the purpose of the user interface. Images should be displayed

clearly as billboards always facing the user. Ideally, smooth shaped objects should be used rather

than polygon approximations such as the mountains in MountainView and the translucent domes

in DomeWorld (see Figure 1.3).

1.4 Scope

The ideal CBVR system presented in the previous section consists of multiple components and

each component consists of many complex subcomponents. Investigating all of these components

and subcomponents is beyond the scope of this research. Boundaries have been set on the scope

of the research so that a depth may be achieved rather than a shallow breadth. The scope of each

component of this research is defined below.

Feature extraction Ideally a complete spatio-temporal three dimensional decomposition would

be provided as the output from a feature extraction process. Even though the ultimate target

is for a complete decomposition, it is beyond the scope of this research. Therefore this research

will aim to improve current two dimensional feature extraction techniques providing accurate and

robust two dimensional representations that will support further research into the stages of feature

extraction that provide a complete three dimensional decomposition.

Feature indexing Our approach to user interaction is different to existing content-based re-

trieval systems and therefore a different emphasis is placed on the feature indexing phase. Rather

than providing a technique that allows for efficient behind the scenes indexing and retrieval, a

technique is required that allows for efficient structuring of information for presentation to the

12

Page 35: Content-based Retrieval of Digital Video

user. Therefore our goal is not improve on the efficiency of existing multidimensional indexing

techniques that are used in existing CBVR systems but instead to devise a hierarchical clustering

scheme for the main purpose of presentation rather than retrieval.

Query representation and execution Ideally a CBVR system would allow the user to query

based on all object attributes and spatial relationships. Even though this research investigates

spatial similarities between images it does not attempt to investigate queries that involve the

spatial relationships between objects. A major contribution of this research is the ability for the

query to be implicit in the browsing of the information space.

User interaction In terms of the user interface this research attempts to address as many of

the existing user interaction problems that currently exist with CBVR systems. This is achieved

by providing a user interface that combines the best features of existing video and image retrieval

user interfaces and also user interfaces that are designed for browsing large data spaces. However,

it is beyond the scope of this research to support the editing and authoring of video and image

content within the user interface.

1.5 Contributions

The goal of this research is not to solve all of the problems facing content-based video retrieval but to

lay a foundation that will pave the way for the ideal system described in Section 1.3. As a result this

research has made a number of major contributions in edge detection, contour extraction, histogram

representation, and user interaction along with minor contributions in texture identification and

interaction, vertex extraction, colour representation, and video segregation.

A major contribution of this research is the edge detector which has been designed to provide

tight positional and orientation tuning. It is unique in that it combines an asymmetry detector

with a standard edge detector. The precision of the new edge detector allows thinning techniques

to be developed based on assumptions of how and where edge responses will occur resulting in

an edge map that is ideal for contour following. The edge responses are also useful for texture

identification and can be combined with other texture processing techniques.

A contour following algorithm has been developed which also makes assumptions about the

types of edge responses that will occur based on the new edge detector. The contour following

technique is designed to detect sharp changes in contour direction to separate out mostly straight

and curved contours at their junctions. The result is a contour extraction method that is very

accurate and robust and extracts contours that rarely contain false junctions. A vertex extraction

technique is also presented based on the physiological evidence of contour-end detectors. Once

again the vertices are very accurate and can detect vertices with small angles due to the orientation

tuning and thinning of the new edge detector.

13

Page 36: Content-based Retrieval of Digital Video

A new histogram construction technique is presented called the fuzzy histogram that allows

colour, contour, and other distribution information to be represented with a smaller number of

bins and is less sensitive to small changes between images. For colour representation the fuzzy

histogram allows just as accurate results but with less bins than standard colour histograms. For

contour representation fuzzy histograms provide just as good results as a brute force comparison

of individual contours. Combining colour and contour information allows even smaller fuzzy his-

tograms to be used. The advantage of fuzzy histograms is that existing histogram comparison

techniques can be used as it is only the method of constructing the histogram which is different.

A new user interface called DomeWorld is presented that takes a unified approach to both video

and image retrieval. For image retrieval the DomeWorld user interface is significantly different to

existing user interfaces taking a browsing approach rather than a query-results approach avoiding

the many problems facing query-result user interfaces. For video retrieval DomeWorld provides

the advantages of a three dimensional user interface as well as a hierarchical presentation which is

ideal for the temporal structure of video content.

When combined the new feature extraction, representation, user interaction techniques provide

a better experience for the user and allow images and video content to be more accurately organised

by shape, texture, and colour information.

1.6 Thesis Outline

This introduction chapter and the following background chapter provide a basis for the rest of

the thesis. Each chapter after the background chapter focuses on one main component of the

research performed for this thesis including colour processing, edge and texture detection, contour

extraction, video representation, user interaction, and clustering. The last chapter provides a

discussion on the research including conclusions and future directions. Additional information on

human vision and texture processing is provided in the appendices. Below is a summary of each

chapter in the thesis.

Chapter 2 - Background Chapter 2 presents the background literature relevant to this the-

sis. It begins by reviewing existing content-based image and video retrieval systems. It then

investigates each component of a CBVR system including user interaction, feature extraction, and

representation. A review of human vision research is presented to provide a basis for new techniques

presented in the following chapters.

Chapter 3 - Colour Colour is the most basic form of feature extraction. This chapter inves-

tigates existing colour models and distribution representations. Two new distribution representa-

tions are presented, fuzzy histograms and prominent colours, and these are compared with existing

techniques.

14

Page 37: Content-based Retrieval of Digital Video

Chapter 4 - Edge and Texture Chapter 4 investigates the detection of edges for the purpose

of contour extraction. A new edge detector is developed and is shown to be superior to existing

edge detection techniques in terms of positional and orientation tuning. The edges are used to

identify texture regions and the boundaries between them. Texture regions are inhibited to allow

more reliable higher level processing of edges.

Chapter 5 - Contour Chapter 5 presents a contour following technique that takes advantage

of the positional and orientation tuning of the edge detector presented in the previous chapter.

Techniques for representing and comparing contours are investigated including contour summaries

and fuzzy histograms. It is shown that fuzzy histogram representation and comparison performs

as well as contour summary representation and comparison. Vertices are extracted using the edge

detector of the previous chapter arranged in a physiological form to detect contour ends. Contour-

ends are combined to form vertices which can than be linked to contours extracted using the

contour following technique.

Chapter 6 - Video We present a video segregation technique to separate video into shots and

scenes that is based on high-level image features. Our technique is compared with other techniques

such as colour histogram and X-ray. We also present an optimised X-ray approach which performs

better than existing techniques.

Chapter 7 - User Interaction In Chapter 7 existing CBVR and information space user inter-

faces are investigated and a taxonomy of user interfaces is produced based on attributes relevant to

CBVR. The taxonomy identifies the weaknesses of existing image and video retrieval user interfaces

and how the feature sets of the two user interfaces are largely disjoint. Three new browser-based

user interfaces are presented to solve the problems of interacting with a CBVR system. The Dome-

World user interface is found to address more of the CBVR user interface issues than existing user

interfaces.

Chapter 8 - Clustering The DomeWorld and MountainView user interfaces of Chapter 7

require spatial and hierarchical clustering techniques. This chapter investigates many forms of

clustering data including conventional multidimensional indexing techniques for the purposes of

visualising the CBVR information space. New spatial clustering and hierarchical clustering tech-

niques are presented for both the MountainView and DomeWorld user interfaces.

Chapter 9 - Conclusions and Future Work In Chapter 9 we collate the results of the previous

chapters discussing the contributions that have been made and discuss future directions for this

research.

15

Page 38: Content-based Retrieval of Digital Video

Appendix A - Human Vision Appendix A provides a detailed review of human vision pro-

cessing from low-level neurophysiological processing to high-level theories of vision.

Appendix B - Texture Appendix B provides a detailed review of various techniques and models

for extracting and segmenting texture.

16

Page 39: Content-based Retrieval of Digital Video

Chapter 2

Background

This chapter presents the state of the art in content-based video retrieval. Since there are overlaps

in functionality between CBIR and CBVR systems both will be presented here. Our review begins

with complete CBIR and CBVR systems followed by a more detailed review of the components

of a CBVR system. This chapter is completed with a review of physiological and psychological

knowledge of the workings of human vision as a basis for image processing and matching.

2.1 Content-based Image and Video Retrieval Systems

The focus of this research is on content-based video retrieval which has many different aspects to

content-based image retrieval, but even so, there is a great deal of shared functionality between

image and video retrieval systems. These common components and their interactions are shown in

Figure 2.1. A CBIR system deals with a large collection of potentially independent images which

have no temporal information or relationship. Therefore, the user may not have any preconception

of a structural relationship between images in the database. The result is that the user is looking

for either a particular image or a particular type of image and is not concerned with that image’s

relationship with other images. Content-based video retrieval could also be approached in the

same way, however, since video adds the temporal dimension the user’s interaction can be entirely

different.

Videos and movies are a form of communication where the story is told or portrayed over a

period of time. A video is a logical progression of elements from start to finish and is generally

intended to be viewed in that order. Therefore, if a user has seen a video before and is searching

for a part of the video, then the frames, shots, and scenes surrounding the target image are helpful

in the user’s quest for the desired portion of video. The temporal hierarchy of frames, shots, and

scenes also allows for another form of query where the user searches for the collection of images

which forms a particular scene.

17

Page 40: Content-based Retrieval of Digital Video

Feature Extraction

Data Representationand Access

Video to be Indexed Query Video

Query

Browse

Figure 2.1: Content-based retrieval system architecture.

The most important difference between CBVR and CBIR systems is that the user’s intentions

are most likely to be different for each system. The user’s intentions will affect the type of user

interface, the features that need to be extracted, and the form of data representation and access.

Even so, many of the techniques used in CBIR systems can also be used in CBVR systems. In this

section a representative selection of CBIR systems are presented as well as a selection of existing

CBVR systems.

2.1.1 Image Retrieval

The World Wide Web is well suited to CBIR applications due to the large store of images readily

accessible by web crawlers and the accessibility of CBIR servers by millions of clients all over the

world through a common and simple HTML user interface. As a result many CBIR systems today

are web-accessible [37]. CBIR systems designed to search the web must be able to handle all types

of images including natural photos, cartoons, paintings, technical drawings, objects, landscapes,

medical imagery, and so on. Therefore most CBIR systems are very generic in the types of images

they support. Even though there is a broad variety of images available on the web, existing systems

primarily cater for natural photos focusing on colour, texture, shape, and locality query attributes.

A recent survey on CBIR systems reviewed 58 systems from both research and industry [38].

The vast number of essentially complete systems indicates that the field of CBIR is no longer in

its infancy. However, this is not to suggest that existing systems have been perfected, in fact if

18

Page 41: Content-based Retrieval of Digital Video

the human brain is to be used as a benchmark then there is a great deal of research still to be

performed. Below is a review of six prominent CBIR systems that are representative of the field

because of their distinct approaches.

QBIC

IBM’s Query By Image Content (QBIC) [16] system is perhaps the oldest and most well known

complete CBIR system. QBIC addresses most of the issues of a CBIR system including database

indexing, colour and texture feature extraction, object identification, and querying.

The QBIC system allows user annotation of images when they are added to the database.

The annotation may take the form of a simple text description or semi-automatic identification

of objects within the image. The semi-automatic identification of objects requires the user to

select the object with the mouse whilst the computer shrinks the outline to the edges of the object

through a process called interactive outlining. Colour, texture, and shape features are computed

from each object. Average RGB, Y iq, Lab, and MTM co-ordinates are computed for each object.

An RGB histogram is constructed and the colours are clustered based on the MTM values of each

bin producing 256 representative colours. Texture is represented using Tamura’s texture features

[39] which include the three dimensions of coarseness, contrast, and directionality. Finally, shape is

represented by area, circularity, eccentricity, major axis orientation, and a set of algebraic moment

invariants.

QBIC allows the user to perform nearest neighbour queries where the system returns the N

most similar images to the query parameters. The query parameters may be specified through

pickers such as colour and texture pickers, by selecting an object and querying by the attributes

of the identified object, through a sketch, or by an entire image search. Features are treated

as points in a Euclidean space and Euclidean distances between the points are used as feature

distances. Euclidean feature distances have the advantage of allowing the data to be stored in

multi-dimensional indexes such as the R*-tree used in the QBIC system. The access efficiencies

provided by the R*-tree are analogous to the access efficiency B-trees provide for one-dimensional

databases.

The QBIC system is notable due to being one of the first systems to integrate many now-

standard CBIR techniques which continue to be used in more recent systems and also due to its

ongoing use in industry.

CBVQ, SaFe, VisualSEEK, and WebSEEK

CBVQ [27], SaFe [22], VisualSEEK [40], and WebSEEK [37] are all based around a similar CBIR

platform. These systems are unique in that both colour and texture are represented in relatively

simple one bit forms. Instead of representing colour distribution with a histogram, a representation

called colour sets is used. A colour set is a colour histogram where the value of each bin can only

19

Page 42: Content-based Retrieval of Digital Video

be one or zero. A large number of bins are used to compensate for the highly quantised nature of

the binary bin count representation. 166 bins are used in the HSV colour space resulting in 21

bytes required to represent the colour set. Since natural images contain subtle changes in colour

the images are subsampled and a median filter is applied before the image is quantised into the

166 bin HSV colour space. Regions are extracted by selecting one colour bin at a time and must

meet certain criteria such as containing a minimum number of pixels.

Textures are also represented in a one bit form called texture sets. Texture regions are identified

from the grey-level image through a QMF subband decomposition using the Haar filter followed by

thresholding. Regions are extracted by merging textures identified in each subband. The binary

texture set represents which subbands are active for a region of texture resulting in a total of 9

bits to represent texture.

Even though the original CBVQ system [27] allowed the user to select colours for image querying

its primary purpose was for finding similar images to a query image. Subsequent systems such as

SaFe (Integrated Spatial and Feature Image Query) [22] supported more advanced forms of region

queries. A region is defined by its centroid, width, and height, whilst spatial relationships between

regions are represented by 2D strings. The SaFe region querying system is quite flexible in allowing

the user to specify rectangular regions of colour as part of the query. The user can weight the

importance of spatial relationship, features, size, and region properties as part of the query.

The CBVQ and SaFe approaches were extended to the web through the WebSEEK system [37].

WebSEEK provides an HTML interface allowing the user to specify a query image, enter query

parameters, and view results. The major contributions of the CBVQ and related systems are the

relatively simple and compact one bit feature representation and the relatively advanced spatial

querying abilities.

ARBIRS

Unlike QBIC and CBVQ where texture querying is merely an appendage to the system, the Ad-

vanced Region-Based Image Retrieval System (ARBIRS) [4] considers texture to be a fundamental

component in the understanding of an image. The first stage of feature extraction identifies tex-

tured areas. The image is subdivided into 24 × 24 pixel blocks and edge density and coarseness

measures are computed from the first-order local derivative operator. Blocks with an edge density

less than 25% are discarded. Texture regions are identified by joining blocks with similar colour

histograms.

ARBIRS pays particular attention to the colour model to compensate for lighting effects such

as shadows and reflectance. The HV C colour model is used and Gong [4] shows that when a

colour undergoes illumination changes the hue (H) remains constant while both the chroma (C)

and value (V ) will change but are linearly correlated. These properties of illumination are used

by the segmentation method which groups pixels which have the same hue and a linear C − V

correlation.

20

Page 43: Content-based Retrieval of Digital Video

Each colour region extracted is represented by the following features: average HV C colour,

bounding box, number of pixels, circularity, eccentricity, orientation, and 10 element x and y

region shape profiles. Texture regions are represented by HV C colour histograms where the limits

of each bin are set to match 11 perceptual colours including red, orange, yellow, skin colour, green,

cyan, blue, purple, black, grey, and white. The features are indexed using an SR-tree [30] which is

a combination of an SS-tree [29] and an R*-tree [41].

The ARBIRS user interface allows users to select regions within an image for querying. The

user simply draws a bounding box around the desired region and the image segmentation module

processes the selected area to automatically extract the desired region. The system allows the user

to search by colour, texture, shape, and compound regions.

ARBIRS is unique in that it focuses on identifying texture first, supports an illumination model

to form regions that contain variation in shade and reflectance, and uses a perceptual set of colour

histogram bins rather than a uniform set.

Virage

The Virage Image Search Engine [42] is the most well-known commercial CBIR system. It provides

similar features using similar techniques to the systems described above and is most notable because

of its successful commercialisation through the extensible VIR Image Engine framework.

The Virage image search engine provides an open framework where primitives can either be

very general, such as colour, shape, and texture, or domain specific, such as face recognition or

cancer cell detection. The major contribution of the Virage image search engine is that the open

framework allows developers to plug on primitives to solve specific image management problems.

Therefore Virage can be used as a tool for researchers investigating just one portion of the CBIR

problem.

Photobook

Photobook [20] is actually three or more different image retrieval systems including Face Photo-

book, Shape Photobook, and Texture Photobook. Face Photobook employs eigenimage represen-

tations for face matching. The eigenimage is formed from the eigenvectors of the normalised image

covariance. Different orientations of each face image are stored for reliable comparison. Shape

Photobook allows shapes of differing deformations such as stretch, bent, tapered, or dented to be

matched to non-deformed objects. Shape Photobook uses Finite Element Method (FEM) models

of objects to align, compare, and describe objects despite both rigid and non-rigid deformations.

Texture Photobook is used for finding similar textures. Whole textures are represented through a

2D Wold-like decomposition into harmonic, evanescent, and indeterministic components.

Photobook is a disjoint system with a limited user interface. However, the face, shape, and

texture extraction methods are well constructed and are still used today. For example, their face

21

Page 44: Content-based Retrieval of Digital Video

detection method is used in several US police departments [38].

NeTra

NeTra [43] is a toolbox for navigating large image databases. NeTra supports colour, shape, and

texture features. Colour is indexed using a one bit representation similar to the Colour Sets used by

CBVQ [27]. However, the one bit representation is only used to improve efficiency by simplifying

the search for the most common colours and is only the first stage of the feature distance evaluation.

The binary AND operator is used to detect the similarity of two binary feature vectors. The binary

result of the AND operation must contain greater than a predetermined threshold of ones. The

set of similar binary vectors is then further analysed to determine the colour feature distance

between the two images. A colour histogram is stored with each image representing the percentage

of colours that fall into each bin. The colour histogram contains 256 bins and the bin limits are

calculated using the Generalised Lloyd Algorithm (GLA) and a training set of image samples. The

final feature distance is the sum of the smallest distances between each colour in each histogram.

Texture is represented using Gabor filters at six orientations and four scales. Both colour and

texture are used to detect region boundaries. A zero crossing edge detector is used to detect the

boundaries between areas of homogenous colour and texture. Connected boundaries are extracted

as regions and the shape of the region is represented through Fourier descriptors. The region

centroid and minimum bounding box are also stored.

Three indices are used for the region features. The first is the colour existence table which

contains the binary colour feature vectors described above and the second is an SS-tree [29] which

stores the texture and shape feature vector. The SS-tree is created using a modified k-means

clustering algorithm to balance the tree so that more efficient browsing can be achieved. The third

index contains four sets of sorted image region lists which represent the region’s top, bottom, left,

and right minimum bounding box co-ordinates.

2.1.2 Video Retrieval

The temporal dimension of video allows for additional feature extraction, data representation, and

user interaction techniques to image retrieval. However, existing techniques tend to either take

a purely image retrieval approach or video retrieval approach. For example, some systems may

support the automatic extraction of temporal features from video yet the feature extraction, repre-

sentation, and user interaction stages are identical to a conventional image database. Conversely,

existing video retrieval systems may provide excellent support for the temporal structure of video

yet provide no useful support for still images which lack a temporal structure. In this section a

representative sample of both types of systems are presented.

22

Page 45: Content-based Retrieval of Digital Video

CueVideo

CueVideo [44] is an extension of QBIC [16] to support video retrieval. Shots are extracted from

video sequences by detecting abrupt and gradual changes between frames, however the technique

used is not described. Representative frames are extracted from each shot and stored in a database.

The resulting database is much smaller than the original compressed video source and the CueV-

ideo team envision the thumbnails being used as an efficient video table of contents that can be

transmitted over the web. The representative frames are presented to the user in a chronologi-

cal, tabular, 2D layout called the storyboard. The user has the opportunity to manually remove

frames from the storyboard to provide a higher level of temporal granularity. Also the user has

the ability to order frames based on similarity to a selected frame. The similarity between frames

is determined using the QBIC image retrieval system.

Even though the CueVideo system is one of the few CBVR systems that integrate CBIR

techniques it does it in a very limited way. There is no support in the system to view the temporal

hierarchy of video sequences nor is there any way for the system to be used more extensively as a

conventional CBIR system taking advantage of the QBIC features.

Other Systems

Zhang et al. [17, 45] have developed a system that incorporates many of the features unique

to video. The system extracts shots by processing video in the compressed domain using DCT

coefficients for spatial characteristics and motion vectors for temporal characteristics. Processing

the video in the compressed domain can be faster as decompression is not required. The motion

vectors are also used to determine camera operations such as panning, tilting, and zooming. Key-

frames are extracted from shots and are indexed based on colour, texture, shape, and features.

Colour is represented by mean brightness, colour histogram, dominant colours, and statistical

moments. Texture is represented by Tamura features and SAR coefficients. Key-frames are either

automatically or manually segmented and region shape is represented using cumulative turning

circles which are invariant to translation, rotation, and scaling. Key-frames are also processed

using the Sobel filter and the frames are compared by calculating the correlation between the

binary edge maps, however these comparisons are limited by their dependency on image resolution,

size, and orientation. In addition to spatial features, temporal features such as camera operations,

and temporal changes in brightness and colours are also extracted for each key-frame.

The system developed by Zhang et al. allows a number of methods for interacting with the

video database. The first is a more traditional CBIR technique of treating each key-frame in

the database as an independent image and allowing the user to perform similarity queries using

predefined template images such as forest, bush, and grass. The second type of querying allows the

user to specify temporal attributes of a shot such as camera operations and temporal variations

in colour. The third form of interacting with the video database involves video browsing. A user

interface is provided that allows the user to progressively drill down through the temporal hierarchy

23

Page 46: Content-based Retrieval of Digital Video

of video key-frames until the desired shot is found. Content features are not used in constructing

the hierarchy but instead the hierarchy is constructed at regular intervals of shots.

The system developed by Zhang et al. is different to other CBVR systems in that it extracts and

allows for the query of temporal features such as camera operations and colour changes. The system

also allows for a method of interaction that takes advantage of the inherent temporal hierarchy in

video. However, the browsing and query interfaces are not united.

Other systems that have been presented in the literature have not been mentioned here either

because they are incomplete as CBVR systems or because they are merely CBIR systems that also

support the indexing of key-frames. Therefore, the review of these systems is left to the following

sections of this chapter and following chapters that discuss in more detail the components of a

CBVR system including user interaction, feature extraction, and representation.

2.2 User Interaction

The form of user interaction employed by a content-based retrieval system usually depends on

whether image or video content is being retrieved. Traditionally CBIR systems use a query-result

user interface where the user inputs query parameters and the system returns an ordered list of

images based on similarity to the query parameters. CBVR systems on the other hand have focussed

more on browsing the temporal hierarchy of video. In the next two sections, user interfaces that

employ query-result and browsing user interfaces are discussed.

2.2.1 Query-Result User Interfaces

In query-result user interfaces there are two phases of interaction: presenting the query, and viewing

the results. The query phase of interacting with a content-based retrieval system presents more

challenges than the second phase of viewing the results. Conventional text database query interfaces

are limited only by the user’s typing ability and knowledge of spelling. In contrast, visual content-

based retrieval systems require the user to be able to convert the internal visual representation of the

query within their minds into numeric parameters or to similar visual representations in a graphical

user interface. The visual skills required of users working with CBVR systems is much greater than

those working with conventional databases. The challenge for content-based retrieval systems is to

present a query interface that makes the mapping of the user’s internal representation of the query

to the user interface’s representation as simple as possible reducing the skill requirements of the

end user.

There are primarily two forms of querying: query by specification and query by example. Query

by specification requires the user to enter parameters that describe features of the target image.

Query by example on the other hand requires the user to present an existing image or part of an

existing image or a sketch which is analysed and used as the query parameters. Both techniques

24

Page 47: Content-based Retrieval of Digital Video

as well as techniques for presenting the results are described in the following subsections.

Query By Specification

Query by specification can be especially challenging for the user as it requires the user to convert

their internal representation of the target image into numeric or visual parameters afforded by the

query user interface. Content-based retrieval systems attempt to use visual rather than numeric

widgets where possible. For example, Zhang et al. [17] used a colour picker instead of requiring the

user to enter an RGB value in numeric form making the process of specifying a colour far more

intuitive. However, such an approach can still be challenging for the user. For example, if a user is

searching for a red car they may enter an RGB value of (255, 0, 0) but due to lighting conditions

in the database’s red car photos, the images contain large amounts of unsaturated red and low

intensity red which are significantly different to the colour specified by the user. This problem can

be avoided by searching on hue rather than RGB colour, however, it highlights the need for the

user to have an understanding of colour models and lighting interactions which would be beyond

most users. This issue is present in all forms of query by specification. The next few subsections

discuss techniques that have been presented for querying by specification with colour, texture,

spatial, and motion features.

Colour Query The example given above shows how a single colour may be specified using a

colour picker. A content-based retrieval system may also store the distribution of colours within

an image. The QBIC system [19] stores a colour histogram for each image and allows the user to

specify a number of colours in the query as well as the amount of each colour in the histogram.

For example, a user may specify a query to search for images with 50% blue and 50% green to find

green landscape images with blue skies. Smith and Chang [27] provide a similar system to be used

with the colour sets method of colour distribution representation. Colour sets use only a single bit

to represent the presence of each colour in an image but compensate by having 166 HSV bins. For

querying, the user simply selects which of the 166 HSV colours to query with.

Both querying approaches are affected by the histogram comparison technique used. Most

histogram comparison techniques do not compare adjacent bins, therefore if the user selects the

wrong bin then the results will not be what the user expects. To avoid this problem the user must

specify many colours and distributions. However, this would be tedious in the QBIC system. The

colour sets system makes it slightly easier as the user only has to check whether colours are on or

off, although the user must still select a range of colours.

Texture Query Textures consist of harmonic, directional, and noise components. Smith and

Chang [27] use a similar technique to colour sets to represent texture called texture sets. Each

bin reflects an orientation and spatial frequency of the texture. The user can search for textures

by turning bins on or off. The problem with this technique is the user’s ability to map individual

25

Page 48: Content-based Retrieval of Digital Video

texture components to a complete texture in a target image. Most other texture query systems only

allow texture queries to be specified by using a predefined texture image [16, 20] which falls into

the query by example category of querying techniques. Kato [46] also allowed users to describe the

texture they are looking for subjectively using keywords such as “lively” to describe patterns. Such

a system requires personal annotation and suffers from the problems associated with manually

annotated content-based retrieval systems.

Spatial Query Spatial queries allow the user to specify the locality of features within a query and

also the spatial relationships between the features. Smith and Chang [22] developed the Spatial and

Feature (SaFe) query system which allows the user to specify a query by drawing rectangles that

represent regions of similar features. For each region the user can specify the colour of the region

and also whether the region’s location is absolute or relative. The user also has the ability to weight

the different aspects of the query such as spatial relationship, features, size, and region. Zhang et

al. [17] allowed the user to associate template images with nine subdivisions of an image. Spatial

queries are useful for locating images where certain features must occur in a certain location. For

example, the sky is usually in the top half of the image whilst the landscape occupies the bottom

half. Spatial queries can also be useful for finding images that contain certain objects in them

regardless of the absolute location of the object within the image. For example, a car consists of

primarily a solid block of colour and normally two visible black wheels at the base of the car. A

spatial query can specify the relative size of the three objects as well as the relative location whilst

the absolute location of the object may vary. Specifying a spatial query can be more natural for the

user as the user is beginning to interact with ‘pseudo-objects,’ however, as with the other query

by specification techniques, the success of spatial queries is limited by the user’s visual skills.

Query By Example

Querying by example requires the user to select a pre-existing image to use as the basis for a

query. Querying by example avoids some of the problems associated with query by specification as

the user only needs to present a pre-existing image, however, querying by example also introduces

some new problems. There are three approaches to query by example. The first is for the user to

present a complete image of their own or out of the database. The second is for the user to select

one out of a set of predefined images that characterise certain features such as texture. The third

is for the user to sketch the image they are looking for. Each of these forms of querying by example

is discussed in the following subsections.

User Image Query The user image approach requires the user to find a pre-existing image to

search with. Systems such as Photobook [20] and QBIC [16] use this approach. The QBIC system

[19] also allows the user to specify parts of the image to search by, allowing the background of the

image to be ignored by the query engine. This is achieved by allowing the user to identify objects

in an image by using any of nine drawing tools: polygon, rectangle, ellipse, paint brush, eraser, line

26

Page 49: Content-based Retrieval of Digital Video

draw, object move, fill area, and snake outline. Of these the most useful is the snake tool which

allows an outline to be shrink wrapped to the edges of an object. The primary problem with this

approach is that the user must already have ready access to an image to search with. This can

be difficult if the user, for example, has no landscape images but is looking for some in the image

database. If the user does have access to some images to search with the user must spend time

finding an initial query image to begin querying, which could result in a time consuming process.

In reality user-image querying can not be considered a primary query technique but it is a useful

addition to other techniques.

Template Query The template query approach differs from the user image approach in that

the system contains a set of predefined images that the user can select for the query. For example,

Zhang et al. [17] used predefined templates of common images such as plants, grass, and rocks. The

advantage of the template query approach is that the user does not need an image of their own to

begin querying the database. However, a primary limitation is that the user’s query abilities are

limited by the template set of images.

Sketch Query The sketch query approach requires the user to sketch a line drawing of what

the target image will look like. Since only line information is being drawn the approach does not

allow colour or texture to be specified. QBIC [16] allows the user to sketch the outline of an image

and uses template matching to compare the sketch with the edges of images in the database.

Kato’s TRADEMARK system [46] allows the user to provide a sketch of a trademark and the

system will find similar images based on spatial distribution, spatial frequency, local correlation

and contrast. The obvious challenge with sketch-based queries is the user’s drawing ability which

can vary dramatically from user to user.

Presenting Query Results

The second phase of a query-result user interface is presenting the results. The majority of content-

based retrieval systems return results ordered by a single similarity value. The effect is a one-

dimensional ordering of results for which there are few alternatives for presenting to the user. Most

systems use a two-dimensional flow layout with thumbnails being ordered from left to right then

top to bottom [17, 16, 20, 22, 46].

Arman et al. [47] used an innovative approach for displaying the results of a video search.

Thumbnails of the result set are laid out horizontally for the user to scroll through. The relevance

of each thumbnail to the query is represented by the width of the thumbnail. In addition motion

in the scene is indicated with motion tracks displayed in a thick border around each thumbnail

allowing the user to quickly grasp the motion within a shot.

In general little research has been conducted in presenting query results. There are most likely

two factors that contribute to the lack of research in this area. Firstly, the current one-dimensional

27

Page 50: Content-based Retrieval of Digital Video

ordering of results in a two-dimensional flow layout of thumbnails seems to be satisfactory for

current queries. Secondly, the querying phase presents arguably more important challenges that

need to be addressed. Nonetheless, more work can be done in improving the presentation of query

results. Currently no systems indicate which features in the resulting images were the primary

contributors to the similarity value. Also the current techniques for presenting the results do not

indicate relationships between images in the result set. Since viewing the query results is essentially

an act of browsing it is possible that browsing techniques could be applied to the results phase to

improve the user’s experience.

2.2.2 Browsing User Interfaces

The difference between browsing user interfaces and query-result user interfaces is that in the brows-

ing user interface the ‘query’ is represented by the user’s location and the ‘results’ are represented

by the layout of the data. The effect is that the query and the results are viewed simultaneously and

in real time. Most of the browsing research for content-based retrieval systems has been applied

to browsing video sequences. Even though both a video sequence and an image database consist

of a large number of individual images, video sequences have the advantage of an implicit hierar-

chical structure consisting of scenes, shots, and camera operations. A browsing user interface may

present one or more of these hierarchical levels to the user. In this section a number of techniques

for browsing video sequences are presented, classified into 2D, hierarchical, temporal, distortion,

mosaic, and 3D techniques.

2D Video Browsing

The 2D video browsing technique is much the same as the flow layout technique used for presenting

query results in most content-based image retrieval systems. An example of this technique is

PaperVideo [18] where thumbnails of shots are laid out in two dimensions flowing left to right

then top to bottom on a page the size of a piece of paper. The advantage of this technique is that

it can be printed out and stored with a video cassette [18]. Other researchers have also used 2D

layout techniques for browsing video sequences [48, 47]. The main limitation with two-dimensional

techniques is that they only represent one level of the video hierarchy.

Hierarchical Video Browsing

Since video sequences have an implicit hierarchical structure it appears logical for the video to

be presented in a hierarchical browser. Existing techniques for browsing video hierarchies differ

from typical tree layouts and instead display a fixed number of rows of video thumbnails. Selecting

a thumbnail from one row changes the contents of the rows below it. The Hierarchical Video

Magnifier [49] uses this concept and rather than performing shot analysis the magnifier selects

thumbnails representing equidistant points within the video sequence. The range of the magnifier

28

Page 51: Content-based Retrieval of Digital Video

can be adjusted to fine tune the amount of data displayed. Zhang et al. [17] proposed a similar

system to the Hierarchical Video Magnifier, the main difference being that shots are used rather

than equidistant frames. The result is more efficient use of the available screen real estate. One of

the benefits of hierarchical video browsers is that the user is able to simultaneously view the detail

of individual frames whilst having the context of shots and scenes.

Temporal Video Browsing

Temporal video browsers attempt to present the temporal changes within a video to the user. The

result is a summary of the video that Christel et al. [50] have termed gisting. Christel et al. have

explored video skims which are short video presentations that provide a summary of a much longer

video sequence. The video sequence is comprised of short video segments from the original video

source appended together.

Ueda et al. [48] used a different approach for the IMPACT multimedia authoring system that

allows users to visualise the motion of an object within a video sequence using lines drawn in

three dimensions between frames. Temporal video browsing techniques are still in their infancy

and more research should be conducted into integrating temporal browsing techniques with other

video browsing techniques.

Distortion-based Video Browsing

Distortion can be used to provide context and detail simultaneously. The VideoStreamer micro-

viewer [51] uses distortion to allow previous and future frames to be displayed alongside the current

frame. The adjacent frame widths are reduced in a fish-eye manner providing the user with detail

in the current frame and the context of future and previous frames. The micro-viewer does not

exploit the hierarchical structure of video but could be integrated with the Hierarchical Video

Magnifier [49].

Mosaic Video Browsing

Mosaicking is the process of creating one image from a series of images. For browsing video se-

quences, mosaics are used to represent an entire shot. The camera motion within the shot is

detected and the images are warped and aligned together to provide a form of panoramic image.

Moving objects can be eliminated or shown separately. Many systems have been proposed for mo-

saicking or constructing panoramas [18, 52]. Mosaics provide an efficient method of viewing the

spatial contents of a shot but cannot represent temporal characteristics of a shot such as camera

operations.

29

Page 52: Content-based Retrieval of Digital Video

3D Video Browsing

Three-dimensional video browsing techniques have attempted to use the third dimension to simul-

taneously display both the spatial and temporal characteristics of a video sequence. Tonomura et

al. [53] proposed the Video Icon which displays a video sequence as a 3D icon where the depth

of the icon represents the duration of the video sequence. The Video Icon was later revised [23]

to vanish at a single point to handle video sequences with large differences in duration. A similar

technique was used in the Video Streamer [51] user interface, however, the time axis also contained

an indication of content. Scene changes and motion within shots could easily be seen by viewing

the time axis. Another three dimensional technique is the VideoSpaceIcon [18] which is like a three-

dimensional mosaic. The primary advantage over two-dimensional mosaicking techniques is that

camera motion is easier to grasp for the user interacting with the VideoSpaceIcon. Ueda et al. [48]

used a more direct approach for displaying motion by drawing a line between frames arranged in

three dimensions. The lines track the placement of moving objects between frames. Even though

there have been a number of three-dimensional video browsers presented in the literature none

take advantage of the hierarchical nature of video sequences.

2.2.3 User Interaction Summary

The current state of user interaction in content-based retrieval systems is disjoint between CBIR

and CBVR systems where two distinct modes of interaction are used depending on whether images

or video are being queried. In Chapter 7 the browsing approach is investigated in more detail and

user interfaces that employ browsing techniques that have not been applied to CBVR are reviewed

for possible integration into a CBVR system. Much research has been performed in representing

the temporal aspects of video, however, little work has been done to incorporate these temporal

representations into other aspects of the user interface such as the query mechanism or the temporal

hierarchy browser. In Chapter 7 a taxonomy of existing CBVR and non-CBVR user interfaces is

presented that highlights the features that are currently being integrated by current user interfaces.

Using the taxonomy new user interfaces have been developed to overcome the weaknesses of existing

approaches.

2.3 Feature Extraction

Content-based retrieval is about retrieving multimedia objects based on their content. The feature

extraction stage produces a representation of the content that is useful for retrieval. Usually more

than one type of feature is extracted and each feature representation is kept as compact as possible

for the purposes of efficient storage and retrieval. The types of features extracted for content-based

retrieval systems are similar to perceptual features that allow the human brain to discriminate

between images, such as colour, texture, shape, and arrangement of objects. Video sequences

30

Page 53: Content-based Retrieval of Digital Video

Statistical Probability Mean, variance, distribution

Syntactical Requires structural constraints Regions, shape

Semantic Requires prior knowledge Names, categories

Table 2.1: Levels of understanding

incorporate the temporal dimension which allows further features such as object trajectories and

global motion to be extracted. There is also a temporal structure to video sequences that includes

camera operations, shots, and scenes.

CBVR attempts to determine the similarity between visual objects to satisfy the user’s query.

The CBVR similarity ordering should be analogous to the similarity ordering performed naturally

by a human being. Therefore, the features used to determine measures of similarity must be based

on features used to measure similarity in the human brain. The brain however is a complex organ

that changes over time based on environmental experience. Providing a computer system with the

knowledge that is learnt by a human being in their lifetime is not a trivial task and presents some

monumental challenges. Since providing a computer with human knowledge is a difficult task, the

question is whether human knowledge is required for determining similarity between images?

The human brain is both structural and soft-wired. Provided with good nutrition and the

absence of genetic defects the vision system of all human beings will develop to form roughly

the same vision processing structure. However, the environment that the human is brought up in

may affect their classification of visual objects. For example, somebody raised in the modern era

may classify a mobile phone and a payphone as being similar since they have a similar function.

However, for someone raised in a time without telephony a mobile phone may appear more similar

to a compact camera since the shapes of the two objects are similar. The difference is that the

person first person has associated meaning with the objects whereas the second person can only

base their terms of similarity on the structural similarity of the objects. This simple example shows

that experience is not required to determine the structural similarity of objects but is required to

determine the semantic similarity of objects. Research in the field of psychology has identified

that many visual characteristics that are used by the brain are formed by the brain’s structure

as opposed to its soft-wiring [10]. Therefore it should be possible using only visual processing

techniques to perform structural similarity between objects. However, visual processing techniques

by themselves can not determine the semantic similarity between objects.

The difference between the types of processes used to determine similarity have been classified as

levels of understanding by Gonzalez [54]. The levels of understanding distinguish between the pro-

cesses involved to determine structural and semantic understanding and also distinguish between

the base level of statistical understanding. Table 2.1 presents the three levels of understanding.

Statistical understanding treats each sample of data as being completely independent from

every other sample. An image consists of thousands of samples of colour information in the form of

pixels. A statistical process would result in the same value regardless of the order of the pixels as

31

Page 54: Content-based Retrieval of Digital Video

long as the pixels contain the same values. Therefore statistical processes extract information that

represent the distribution of pixel data. Examples include mean, variance, moments, and frequency

histograms. Simple statistical measures such as mean applied to global image data provide little

value as images with varying colours may result in the same average colour. Frequency histograms

are more useful as they can represent groups of dominant colours.

Syntactical understanding considers the relationship between a sample and its neighbours. The

neighbourhood may be small, such as just the next pixel, or it may be large including the en-

tire image, video sequence, or database. Syntactical processes consider the spatial and temporal

alignment and occurrence of pixels. Syntactical processes are not only limited to the relation-

ship between individual pixels but can be used to construct an entire structural representation of a

multimedia object. Examples of syntactical processes include filtering, edge detection, morphology,

edge linking, region extraction, texture processing, and high-level perceptual grouping. Syntactical

understanding can be achieved through using localised statistical processes. For example, statisti-

cal modelling of local neighbourhoods can provide an indication of the structural components of

texture.

Semantic understanding classifies an object from a predetermined framework of classes. Se-

mantic processes form an association between objects and meanings. Examples include object

recognition, face recognition, landscape classification, and medical classification.

There is also a fine line between what is considered syntactical and what is considered semantic.

If scenes need to be identified in a video sequence as being camera shots of the same location then

from a syntactical point of view it can be said that each shot in the scene must be of the same

location. But how does the system know that the shots are in fact of the same location? One shot

may be a close up of an object on a table and the next shot may not include the table at all. In

this example, some semantic information must be provided to allow the physical link between the

two shots to be determined. Therefore purely syntactical processes are not always able to fully

decompose a scene into a structural representation. This is due to a lack of physical information

in the source content. In such cases a semantic process may be able to provide a more complete

decomposition.

Semantic processes are largely different to syntactical processes and generally involve taking the

output of a syntactical process and matching it with a database of template features. Therefore the

performance of semantic understanding depends on the performance of syntactical understanding.

In this research the emphasis is on improving the current state of syntactical understanding which

by itself provides a sufficient platform for CBVR but will also aid systems that incorporate semantic

processes. In this section statistical and syntactical techniques for extracting temporal and spatial

features are reviewed.

32

Page 55: Content-based Retrieval of Digital Video

2.3.1 Temporal Features

Figure 1.4 shows that the intention of this research is to provide a temporal decomposition based

on the spatial features of video. For performance reasons, existing systems usually have a two part

spatial feature extraction mechanism. A low complexity mechanism is used to analyse the plethora

of images in a video to produce the temporal structure while a high complexity mechanism is used

to represent the higher level intra-image objects for querying.

The temporal structure is produced by finding the boundaries between temporal video objects

such as shots and scenes. For abrupt cuts between shots the cut can be detected by looking for

a sharp change in the content of adjacent frames. Statistical techniques that compare the colour

distribution of pixels between frames are often used [55]. Statistical techniques are often combined

with spatial techniques to provide a localisation of the change in colour distribution [55]. These

techniques are relatively simple to implement and are discussed in more detail in Chapter 6.

Low complexity statistical and spatial techniques can be affected by motion in the scene. To

avoid false classification of motion as a cut, motion can be modelled in the scene. Determining the

optical flow in a scene can be very slow and much research has been conducted in using motion

vectors that are already stored in a compressed video sequence rather than computing them at run

time [45]. Using compressed motion vectors can be very fast as the video sequence does not need

to be decompressed however these motion vectors are optimised for compression as opposed to an

optimal representation of the motion in the scene and can be unreliable for video segmentation.

Most research into the temporal decomposition of video sequences has focussed on extracting

shots by detecting cuts, analysing camera motion, and grouping shots into scenes [18]. Other higher

level temporal groupings such as episodes or acts require semantic knowledge and are outside the

scope of this research. In many respects the process of temporal decomposition appears simpler

than spatial decomposition in the sense that only the way the video was authored needs to be

understood in terms of camera operations and editing rather than the seemingly more complex

spatial decomposition which requires a detailed understanding of human perception. Even though

this is largely true, there are also spatio-temporal aspects of video which investigate object motion

and changes throughout the video sequence. A complex spatial decomposition must take place

initially followed by a potentially complex temporal tracking of objects as they undergo various

transformations within the scene. Since a complex spatial decomposition must take place initially,

many of the existing low complexity approaches to temporal decomposition are not suitable. In

this research the goal is to completely understand the contents of a frame before performing

temporal decomposition pathing the way for a complete spatio-temporal decomposition, however

a full spatio-temporal decomposition is beyond the scope of this research.

33

Page 56: Content-based Retrieval of Digital Video

2.3.2 Spatial Features

As mentioned in the last section the spatial decomposition of an image can be considered more

complex than the temporal decomposition of a video sequence. The concept of complexity however

is different in both scenarios and also depends on how complete the decomposition is required to

be. For example, temporal decomposition can be considered complex due to the possible thousands

of frames in a video sequence even if the individual frame processing technique is quite simple.

Conversely, spatial decomposition could be considered simple if the techniques used to extract the

spatial features are also simple. However, if a complete structural decomposition of an image is

required then the complexity of the problem increases.

The purpose of image representation is to determine image similarity in the same way that a

human would perceive images as being similar. Therefore the visual aspects that involve processing,

representing, and comparing images in the human brain must be understood. The major challenge

facing spatial feature extraction today is that it is not fully known how the human brain processes

images. So the problem is complex firstly because we don’t fully know how the human brain

works, which is our benchmark, and secondly because there are many complex components in

vision processing involving colour processing, edge detection, contour and shape extraction, texture

representation, image segmentation, perceptual illusions and accounting for partially represented

objects that may be occluded, compensating for lighting effects such as highlights and shadows,

and determining three dimensional shape from two dimensional features such as texture. The

human brain is able to employ billions of individual processors in parallel to tackle these tasks.

The computer on the other hand is largely serial and such parallelism is not available without

specialised hardware. Therefore the problem is also complex in the shear amount of processing

power required to achieve the representation that the human brain achieves so effortlessly. Faced

with these complexities researchers have decided to focus on only one or a few features at a time or

support many features but in a simplified manner. In this section techniques for visual processing

that have been used in CBIR research are reviewed.

2.3.3 Colour

Colour feature extraction involves analysing the absolute colour value of each pixel. Colour is

generally represented by the colour distribution of the image. Colour distribution is a statistical

feature and techniques such as moments and colour histograms are commonly used. In this section

moments, colour histograms, and methods for comparing colour distributions are discussed.

34

Page 57: Content-based Retrieval of Digital Video

Moments

Moments are a generalised form of statistical features such as average, standard deviation, and

kurtosis. Moments have the general form:

Mn =∑ (x− x)n

N(2.1)

where N is the number of data points and n is the order of the moment. The first moment is

related to the mean, the second to the variance, the third determines the skew, and the fourth can

be used to calculate kurtosis. Stricker and Orengo [15] proposed using the first three moments for

colour representation using the following equations:

Ei =1N

N∑j=1

pij , σi =

1N

N∑j=1

(pij − Ei)2

12

, si =

1N

N∑j=1

(pij − Ei)3

13

(2.2)

where pij is the j-th pixel of the i-th colour channel. The moments for each colour channel are

stored separately resulting in only 9 floating point numbers per image.

Histograms

A more common form of colour representation is through colour frequency histograms. A colour

histogram consists of three axes, one for each colour channel. Each axis is quantised into a series of

ranges. The intersection of ranges from each axis produces the histogram bins. Rather than storing

each colour pixel value, a histogram only needs to store the number of pixels that have landed

in each bin. The number of bins is determined by the number of divisions on each axis. A highly

quantised colour space will result in fewer bins and hence less storage but can also result in poorer

retrieval performance.

Colour histogram comparison involves comparing corresponding bins from each histogram.

Problems can arise if many pixels vary in colour only slightly between two images which causes

them to enter another bin. A comparison technique that only compares corresponding bins will

indicate that there is a vast difference between the two images when in fact the colour difference

is only small and unfortunately straddles the boundary between two bins. One solution to this

problem is to increase the number of bins. However, since the colour space is three dimensional

even only 8 bins per axis results in a total of 512 values needing to be stored.

The advantage with colour histograms over moments is that integer values can be used to

represent the contents of each bin rather than the floating point values required to represent

statistical moments. Smith and Chang’s [27] colour sets expanded on this concept using 166 bins

but each bin is only represented by a single bit, indicating simply whether the bin has pixels from

the image or not. Since natural images generally have slight variations in colour, a single colour

would fill a few adjacent bins, if a different image had a slightly different colour there is a greater

chance that the bins would overlap. The problem with Smith and Chang’s approach is that there is

35

Page 58: Content-based Retrieval of Digital Video

no indication of how much of each colour there is in a scene, although for their application colour

sets were only used to represent the contents of relatively homogeneous regions as opposed to an

entire image.

Histogram Comparison

There are a number of ways to compare histograms. Two simple methods include the absolute

difference between two histograms (Equation 2.3) or the Euclidean distance (Equation 2.4). In

these two cases a lower distance value represents a greater similarity between images.

dRGB(Ii, Ij) =n∑

k=1

(|Hri (k)−Hr

j (k)|+ |Hgi (k)−Hg

j (k)|+ |Hbi (k)−Hb

j (k)|) (2.3)

d2RGB(Ii, Ij) =

n∑k=1

((Hr

i (k)−Hrj (k)

)2 +(Hg

i (k)−Hgj (k)

)2 +(Hb

i (k)−Hbj (k)

)2)

(2.4)

Another method for comparing histograms is to use the histogram intersection [21] (Equation

2.5). The histogram intersection adds up the minimum values from each corresponding bin in the

histograms. Two images are considered similar if they have a large intersection. The intersection

is then divided by the total number of pixels in the second image to normalise the value. A

disadvantage with these approaches is that the computational complexity depends linearly on the

product of the size of the histogram and the size of the database. The complexity can be reduced by

only comparing the bins with the largest number of pixels. Swain [21] combined this technique with

histogram intersection to perform an incremental intersection. Using incremental intersection the

computational complexity can be reduced from O(nm) to O(n log n + cm), where c is the number

of bins to compare from each histogram.

d(Ii, Ij) =∑n

k=1 min(Hi(k),Hj(k))∑nk=1 Hj(k)

(2.5)

Another problem with these histogram comparison techniques is that bins are not compared

with adjacent bins which may represent perceptually similar colours. The QBIC (Query by Image

Content) [19] system uses the colour histogram cross distance which considers the cross-correlation

between histogram bins based on perceptual similarity (Equation 2.6). The cross-correlation is

determined by a matrix with entries apq. When the matrix is an identity matrix the formula

becomes the Euclidean distance.

d(Ii, Ij) =n∑

p=1

n∑q=1

(Hi(p)−Hj(p)) · apq · (Hj(q)−Hj(q)) (2.6)

Stricker and Orengo [15] argue that the problem is not with histogram comparison techniques

but with the formulation of the histogram. They propose a cumulative histogram where each bin

Ci in the cumulative histogram is the sum of all bins Hj≤i in the colour histogram. However, their

results do not show a significant improvement over standard colour histograms.

36

Page 59: Content-based Retrieval of Digital Video

In Chapter 3 a new technique is presented called fuzzy histograms which addresses the issues

surrounding the quantisation of colour space by applying anti-aliasing techniques. The technique

retains the same histogram representation and therefore allows for existing histogram comparison

techniques to be used. The improved function of fuzzy histograms allows for a smaller numbers of

bins to be used.

2.3.4 Texture and Edge

Texture is the pattern of change in colour of an image. Some textures are uniform such as the weave

in a textile whilst others are non-uniform like the leaves on a tree. Since texture is a repeating

pattern, texture processing involves identifying the features of the pattern. Assuming the pattern

remains uniform throughout the texture then the pattern will have a scale and an orientation. If the

pattern changes throughout the texture then the texture will also have a measure of randomness.

The pattern will have a form which consists of contours and colour. Colour information is not as

important as contour information in texture processing as it is often sufficiently represented by

the colour processing techniques presented in the previous section. Contours consist of a series of

connected edge points. Edges represent a change in colour amplitude. The rate of change can be

transformed from the time domain into the frequency domain where edges can be described in

terms of spatial frequency. Since the locality of the edge is also important, finite impulse response

filters are often used such as small masks and wavelets as opposed to infinite impulse response filters

such as sine and cosine waves which are used in the fast Fourier and discrete cosine transforms.

However, since texture is a repeating pattern within an area techniques such as the fast Fourier

and discrete cosine transforms can be used.

Edge Detectors

Texture is often represented by the distribution of edge within an area. Simple mask-oriented edge

detectors such as Laplacian [56], difference of Gaussians, Sobel, Roberts, Prewitt, Kirsch [57], Frei-

Chen [58], and Robinson [59] edge detectors provide a fast method for determining edge. Some edge

detectors such as the Laplacian and difference of Gaussians are non-directional where as others such

as the Sobel, Roberts, Prewitt, Kirsch, Frei-Chen, and Robinson have more than one orientation.

Oriented edge detectors require greater computations as more than one mask must be applied. If

the individual orientation responses are not used at higher levels of edge or texture processing then

there is little advantage in using oriented edge detectors over non-directional edge detectors. The

Sobel, Roberts, and Prewitt operators consist of two orientations 90 apart where as the Kirsch,

Frei-Chen, and Robinson edge detectors consist of four orientations 45 apart allowing greater

precision in identifying the orientation of an edge. These masks are simple to implement and have

fast execution times however they have the disadvantage that finer orientation precision can not

be gained nor can they operate at multiple spatial frequencies.

To allow for greater specificity in the orientation and spatial frequency of an edge, edge detectors

37

Page 60: Content-based Retrieval of Digital Video

that are described by a scalable and rotatable continuous function must be used. The two most

common examples of edge detectors that are described by a continuous function are the Gabor [60]

and Canny [13] edge detectors. The Gabor filter consists of a sine or cosine wave within a Gaussian

envelope:

Godd(x) = e−x2

2σ2 sin[2πv0x] (2.7)

where σ is the bandwidth of the Gaussian envelope and v0 is the wavelength of the sine wave.

The Canny edge detector is simply the first derivative along one dimension of a two dimensional

Gaussian filter:

C(x) =x

σ2e−

x2

2σ2 (2.8)

Frequency Domain

The Gabor filter can also be the basis function for a wavelet. Wavelets are a set of functions at

multiple scales, orientations, and positions that can be used to identify the local spatial frequency

of an image. Ma and Manjunath [60] used Gabor wavelets to represent multiple spatial frequen-

cies of textures. Wavelets are generally combined with a hierarchical decomposition to produce

coefficients representing scales to the power of 2 (see Appendix B for information on the wavelet

decompositions).

Since a primary component of texture representation is spatial frequency, techniques such as

the FFT and DCT can be used to transform the texture from the time domain into the frequency

domain. A two dimensional FFT can be applied to an image to provide an indication of both

horizontal and vertical spatial frequency providing an indication of the orientation of the texture

pattern. Picard and Liu [61] used the Fourier transform of an image to determine the spatial

frequency of textures in Texture Photobook. Since texture is often only a part of an image and can

change over the image it is better to use block-based forms of the FFT and DCT. Since the DCT is

often used in image and video compression such as JPEG, M-JPEG, and MPEG, the uncompressed

coefficients can be used for texture analysis which greatly speeds up file processing. The advantage

of wavelet techniques however is that per pixel precision can be provided for texture description

as opposed to the per block description of the FFT and DCT.

Statistical Models

Statistical models are used to represent the dependency of neighbouring pixels. Therefore the sta-

tistical model can represent the structure of the texture element as well as the random component

of texture. Various statistical models have been used in the literature including moving average

(MA), auto-regressive (AR) [62], auto-regressive moving average (ARMA) [63], simultaneous auto-

regressive (SAR) [61], multi-resolution SAR (MRSAR) [64], Gauss-Markov, Gibbs [65], and fractal

[66, 67] models. These techniques have performed very well and are often combined with oriented

spatial frequency techniques. Statistical models such as the MRSAR can operate at multiple scales

38

Page 61: Content-based Retrieval of Digital Video

which is important for textures that often consist of multiple spatial frequencies.

Unified Texture Models

Researchers have proposed three part models to describe texture. These models generally represent

the spatial frequency, orientation, and noise components of texture. Tamura et al. [39] through

psychological studies identified the three dimensions as coarseness, contrast, and directionality.

A similar study by Rao and Lohse [68] found that the three salient dimensions to texture are

repetitiveness, directionality, and complexity.

Based on the decomposition by Rao and Lohse [68] Francos et al. [36] developed the 2D Wold

decomposition which decomposes a texture into three dimensions which consist of the harmonic,

evanescent, and indeterministic components. In essence, each model uses different terminology to

describe the same three components. QBIC [16] uses Tamura features for texture representation

whilst Picard and Liu [61] use the 2D Wold decomposition in Texture Photobook for determining

texture similarity.

See Appendix B for a detailed review of techniques for representing and segmenting texture

using unified models.

2.3.5 Contour

In content-based retrieval, edges are most commonly used for texture representation, however,

edges are also required for contour extraction. Contours consist of linked edge points with a similar

orientation. The process of linking edge points together is called edge linking, contour following,

or simply local processing [69]. The local processing approach is very simple in that an edge is

linked to one of its eight neighbours if both the magnitude of the response and the orientation of

the edge are within a predefined threshold. The global processing approach for extracting contours

is where the edge points are expected to lie on geometric primitives such as lines and circles. The

edge points are transformed from the x, y space to the parametric space of the geometric primitive

using a technique called the Hough transform [69]. Clusters of activity in the parametric space

are identified as geometric shapes and are extracted as contours. The problem with the global

processing approach is that it assumes the contours will conform to relatively simple geometric

objects and therefore is more useful for pattern matching applications where the geometric shape

is known before processing.

2.3.6 Image Segmentation

Image segmentation involves decomposing an image into areas of homogeneous features. A variety

of features and techniques may be used. Generally images are segmented based on colour and

texture, however images can also be segmented based on contour information. Images segmented

39

Page 62: Content-based Retrieval of Digital Video

based on colour and texture generally use pixel grouping techniques based on locality or clustering.

Locality grouping techniques group pixels together which have similar values and are in a local

neighbourhood. Grouping techniques based on clustering do not necessarily have to occur in a

local neighbourhood and the clustering occurs globally assuming that the global clustering is

enough to identify the distinguishable portions of a natural image. Even though image segmentation

techniques can also apply to texture, in this section we will focus on segmenting by colour.

Segmentation using Colour Distribution

The simplest method for segmenting an image is to apply a global grey level threshold. However,

images generally have more than two prominent colours and require more than a single partition

in the colour distribution to accurately segment the image. Segmentation by colour distribution is

often referred to as histogram splitting. An example is shown in Figure 2.2 (a) where the original

and segmented images are shown.

A major limitation with most histogram splitting techniques is that they require the number

of clusters to be initially specified. Fortunately, the number of clusters does not necessarily refer

to the number of ground truth regions and can be estimated from the histogram itself. Segmenting

an image by histogram is an optimisation problem and has been solved using the hard c-means

(HCM) method [70], fuzzy logic [71], unsupervised neural networks [70, 72] and genetic algorithms

[73].

The hard c-means method initially chooses a number of class centres which represent the central

grey level class for each cluster. Every grey level is then assigned to its nearest class centre based

on the Euclidean distance measure between the grey level and class centre. The class centres zi are

then updated using the following formula:

zi =

∑x∈Ci

hxgx∑x hx

(2.9)

where zi is the class centre, i is the class, Ci is the set of grey levels in class i, hx is the number

of pixels for grey level x, and gx is the grey level of bin x. The convergence of the class centres is

checked and if it is below a threshold then the segmentation process stops otherwise the grey levels

are reassigned to class centres and the class centres updated again and checked for convergence.

Techniques other than HCM have provided marginal improvement. However, all methods are

limited by the fact that histogram splitting essentially quantises the colour space. A problem with

colour quantisation is that a smooth region varying from one colour to another may be quantised

into two or more colours resulting in the region being split when in reality it is perceived as only

one region. Histogram splitting is most effective in specific domains such as medical imaging [70].

40

Page 63: Content-based Retrieval of Digital Video

(a)

(b)

(c)

(d)

Figure 2.2: Image segmentation algorithms. Histogram splitting (a), 8 regions. Watershed (b),

51,746 regions. Region growing and merging (c), 90 regions (threshold = 6). Region splitting and

merging (d), 79 regions (threshold = 15).

41

Page 64: Content-based Retrieval of Digital Video

Region Splitting

Region splitting is based on a quadtree decomposition of the image based on variance within blocks

[69]. Initially the entire image is considered as the starting block and its variance is analysed. If

the variance is above a specified threshold then the block is decomposed into four blocks. If the

variance is below a threshold then the block is considered to contain homogenous pixel values and

is stored as a region. The operation continues recursively until either a block size of one pixel

is reached or until the variance falls below the threshold. The final segmentation is block-based

and will not accurately represent natural images. Region splitting is usually combined with region

merging to merge neighbouring regions with similar average colour. This technique is called split

and merge. An example of the split and merge technique is shown in Figure 2.2(d).

Region Growing

Region growing assumes that neighbouring pixels of similar intensity will be part of the same

region. Therefore the technique involves walking throughout the image and comparing each pixel

with its neighbours. If the difference between a pixel and its neighbour is below a threshold then

the neighbour is added to the same region as the pixel being tested. Regions are grown until all

of the pixels have been labelled. Region growing has been shown to be suitable for natural image

segmentation [74].

Watershed

The watershed algorithm for image segmentation is similar to region growing, however, the tech-

nique works on a grey level version of the image and progressively floods local minimas in the

image until the entire image is submerged. The technique works by first starting at the lowest

grey level and searching for neighbouring pixels which have this grey level value. Neighbouring

pixels at the same grey level are joined into a region. As the grey level increases new pixels are

added to neighbouring regions or new regions are created if the pixels are isolated. This technique

is suitable for images which have meaningful grey levels. In natural images where the grey level

is not significant but rather the change in grey level, an edge map can be computed to make the

edges significant for region boundaries. In general, watershed tends to produce more segments than

region growing or histogram splitting. Manual placement of markers can help reduce the number of

regions detected. Watershed’s sensitivity to local changes can also be reduced by initially blurring

the image which acts as a low-pass filter. However, the performance of the watershed algorithm

when applied to natural images is much worse than region growing or splitting as can be seen in

Figure 2.2(b) and shown in [74].

42

Page 65: Content-based Retrieval of Digital Video

Region Merging

The image segmentation algorithms described so far generally over-segment and require an addi-

tional stage to merge regions which are part of the same perceptually significant region. Merging

criterion vary in complexity from simply comparing average colour to merging regions based on

Gestalt grouping laws. Some techniques for region merging include average colour, size, edge in-

formation, and laws of perceptual organisation. An example of region growing and merging using

average colour and size is shown in Figure 2.2(c).

The laws of perceptual organisation (Gestalt laws) [75] can be used to group regions. Gestalt

laws treat the brain as a black box and don’t try to describe how the groupings are performed but

rather what groupings occur. The groupings are based on Pragnanz (the law of good figure states),

similarity, good continuation, proximity, connectedness, common fate, and meaningfulness [75].

Wardhani and Gonzalez [74] have attempted to approximate some of the Gestalt laws using

good continuation, surroundedness, symmetry and common fate for grouping. The groupings are

stored in a tree structure preserving the original segmentation and also allowing for overlapped

groupings. Good continuation grouping is achieved through extracting edge and line information

from the original image. If the two regions lie on the same continuous line then they can be grouped.

Surroundedness occurs when a region is completely surrounded by another region. Surroundedness

can be an indication that the surrounded region is part of the surrounding region or is in front of

the surrounding region. Symmetry is determined by comparing the shape of two regions that lie

on an axis of symmetry. Symmetrical regions such as the two halves of a jacket may be from the

same object. Common fate can be determined by analysing the motion vectors of regions between

frames. Similar trajectories infer that the two regions are either from the same object or are related.

2.3.7 Combining Points, Lines, and Surfaces

Up to this stage we have investigated techniques for identifying boundaries between regions through

image segmentation and edge extraction. These features, by themselves, may not be adequate to

describe the boundaries between regions in a scene. Some boundaries may only be partially visible,

generating only partial edges. For humans the continuation of a broken boundary is obvious, and

can be filled in preattentively [25]. Therefore, we are able to see lines and surfaces which may not

have well defined boundaries in an image. The boundaries that can be identified must be grouped

together to form inferred lines and regions.

One technique is to identify the vertices within an image. Vertices can be linked together if

they share a common edge, even if the edge is incomplete. Other lines may be inferred from

vertices indicating occlusion. The vertices aid in inferring lines and also inferring regions because

the vertices indicate a connection between lines. The vertices and lines approach will handle objects

with straight edges and sharp corners but may find it difficult to extract partially visible curved

objects.

43

Page 66: Content-based Retrieval of Digital Video

Another approach is to use a feature adjacency graph (FAG) which groups points, lines, and

segments (regions) [76]. The grouping allows lines to be grouped with points that form part of the

line or the vertex of a line. The scheme also allows lines and points to be grouped, or associated,

with adjacent segments of colour. As mentioned above the extraction of low-level features may not

be complete so certain points, lines, or segments may not be present in the graph to allow for an

accurate grouping. Therefore, the scheme proposed by Fuchs and Forstner [76] tests for perceptual

grouping properties such as identity, point-line-incidence, colinearity, parallelity, and orthogonality.

An iterative procedure generates hypotheses of plausible groupings. Fuchs and Forstner have found

that convergence is usually achieved after 5-10 iterations. Unfortunately, this model does not take

advantage of the presence of vertices in the hypothesis validation process limiting its ability to

detect occluded lines and regions.

Rao [77] has proposed a system for extracting 3D rectangular solids from images and videos. The

system identifies vertices and lines which are grouped together and used to indicate the presence

of a 3D rectangular solid. The system is robust in that it only needs the presence of a few lines

and/or vertices. However, the system is restricted to one particular type of 3D object and does not

extract regions or surfaces.

2.3.8 Shape from Contour, Shading, and Texture

In the literature shape can be the instantaneous three dimensional orientation of every point within

a region or the overall three dimensional orientation of a region. Researchers have been able to

extract the shape of a region by analysing contours, shading, and texture.

Witkin [78] proposed that shape can be extracted by assuming that texture elements do not

simulate projection. Therefore the contours of the texture would obey a perspective projection.

Their technique can be applied to contours as well as textures. However, the method fails when a

texture element is not uniform such as an ellipse or parallelogram [79].

Another method to determine shape from contour is to maximise the area of a region with

respect to the square of the perimeter [80]. This approach assumes that an object with a large tilt

will project a small area on the image whilst retaining a similarly-sized perimeter. The method

has been improved and optimised by Davis et al. [81].

Methods for shape from shading are relatively few and are based on a reflectance model of the

image [24]. The surface radiance, R(x) is determined by the radiosity equation [24]:

R(x) =ρ

π

∫V(x)

RsrcN(x) · u dΩ +ρ

π

∫H(x)\V(x)

R(Π(x,u))N(x) · u dΩ (2.10)

where x is a surface point, N(x) is the surface normal, H(x) = u : N(x) ·u > 0 is the hemisphere

of outgoing unit vectors, V(x) is the set of unit directions in which the diffuse source is visible from

x, dΩ is an infinitesimal solid angle, and Π(x,u) is the surface point visible from x in direction u (Π

denotes projection) [24]. The algorithm works by associating each pixel with a node N (x, y) which

44

Page 67: Content-based Retrieval of Digital Video

initially starts out with a depth of zero and increases as the algorithm progresses. The algorithm

can be very time consuming because new nodes are inserted into the skyline of every other node.

The shape-from-shading algorithm has been shown to be able to accurately retrieve the depth from

intensity data. However, the technique does not reliably detect the depth of surfaces which subtend

a small angle from the source of light. The change in intensity is not large enough for the depth to

be reliably detected [24].

Shape can be extracted from texture by analysing the spatial frequency of the texture. A high

spatial frequency can indicate a large tilt whilst a low spatial frequency can indicate a small tilt. As

the spatial frequency changes depth information can also be extracted. Early work in extracting

shape-from-texture was conducted by Bajcsy et al. [82] using the Fourier transform of moving

windows. The wavelength of the texture can be computed from the Fourier transform and is used

to determine the relative depth of regions of a surface. A generalisation of this technique has

been proposed by Jau and Chin [79] using the Wigner distribution [83]. The Wigner distribution

provides a 4D representation of the spatial frequency content of an image. The 4D representation

is computed through a Fourier transform of a neighbourhood of pixels for every pixel in the image.

The texture density is determined for every pixel from the Fourier transform providing a measure

of the spatial frequency at a point. The texture density is used to create a texture density map

which is used to determine the surface orientation and relative depth. Experimental results show

that the technique works with less than 10% error when a window size of 16× 16 pixels or larger

is used [79].

2.4 Representation

The previous section discussed features that are useful for CBVR. The form of the features ex-

tracted may provide a good similarity measure but may not be efficient for storage or querying.

This section discusses techniques for representing features both in terms of shape representation

and relative position.

2.4.1 Shape Representation

A useful tool in an image retrieval system is to query by the shape of an object. A simple technique

for representing the shape of extracted objects would be to store a pixel level outline of each object.

Finding similar shapes would consist of a point-by-point comparison. This technique is limited in

its application because the query shape and indexed shape may have different positions, scales and

orientations. In addition, the technique would not be robust to noise or to shape distortions of the

object. It is clear that a shape representation technique is required which is invariant to a number

of transforms.

Positional invariance is not difficult to achieve even with a pixel level representation as the

45

Page 68: Content-based Retrieval of Digital Video

Current tangent vector

Previous tangent vector

Figure 2.3: Scale invariance by storing the angle between tangent vectors.

co-ordinates of each point can be stored relative to the centre of an object rather than as absolute

values. Scale invariance is a little more difficult but can be achieved by storing the difference

between tangential angles of adjacent points (Figure 2.3).

Using the tangential angles of each point a Fourier description of the shape of the object can

be generated. The Fourier descriptors describe the shape in terms of the frequency of the outline.

The major form of the object can be described by the low frequency coefficients of the Fourier

transform, whilst fine changes in the outline are represented by high frequency coefficients. By

only comparing a few of the low frequency coefficients the comparison can be made robust to

slight variations caused by noise or distortions. In addition, the Fourier descriptors are invariant to

rotation because the descriptors aren’t ordered by their orientation relative to the object’s centroid.

Because of the limitation of existing feature extraction techniques many shape representation

methods are designed for 2D objects. Two dimensional objects generally represent three dimen-

sional objects and from scene to scene may change position and orientation. Through a 2D projec-

tion a change in 3D position may generate a 2D position or scale change. If a 3D object rotates, the

2D projection may rotate but could also include a more complex 2D transform. The affine trans-

form is an approximation to the perspective transform and is composed of translation, rotation,

scale, and shear transforms. Fourier descriptors are invariant to the first three affine transforms

(known as the similarity transform) but are not invariant to shear transforms. Arbter et al. [84]

developed a set of affine invariant Fourier descriptors which were applied successfully to silhouettes

of rotating aircraft models.

A different approach has been proposed by Scarloff and Pentland [85] using eigenmodes. Their

method involves describing a shape using Galerkin interpolation which produces a finite element

model. The eigenmodes (eigenvectors) are computed from the finite element model, which describe

how each mode deforms the shape. The first three eigenmodes represent translation and rotation,

46

Page 69: Content-based Retrieval of Digital Video

and the rest are non-rigid modes [85]. The non-rigid modes are ordered by frequency where low

frequency modes represent global deformations whilst high frequency modes represent local defor-

mations. The eigenmodes can be used for object recognition which is invariant to affine transforms,

noise and deformations. The eigenmodes have been used in the Shape Photobook [20] where the

first 22 modes were used for comparison.

Transform invariant shape descriptors are required for 2D shapes to compare the projections of

3D objects in different images. However, such descriptors may not be necessary if the full 3D shape

of the 2D projection can be determined. Even so, a 3D shape description should still be invariant

to position and rotation. Furthermore, for a system to achieve the human ability to match objects,

a 3D shape description would need to be robust to noise and also invariant to global and local

deformations.

2.4.2 Spatial Representation

The problem of spatial representation stems from the need for spatial relationship queries between

extracted objects. For example, a user may want to retrieve images which contain blue sky at the

top and green mountains at the bottom. The problem can explode as every object is compared

with every other object to determine their spatial relationships. To reduce the size of the problem

at query time, researchers have tried to index some of the spatial relationships when an image is

being added to the system.

One of the original techniques for spatial indexing was 2D strings [86]. 2D strings represent the

relationships between objects by storing the order of objects for each column and then also for

each row without storing the actual position of each object. An example of a 2D string is shown

in Figure 2.4. Pictures are compared by comparing substrings and subsequences. For a string that

represents the order of objects in columns, a local substring is the string for one column (Figure

2.4), and for a string representing rows, a local substring is the string for one row. Matching occurs

by determining whether one picture contains a subsequence of another. There are type-0, type-1,

and type-2 subsequence matchings. A type-0 matching occurs when objects of the query image

must be in the same order on an axis as the database image or project to the same position.

Type-1 matching is more strict because objects in different positions can not project to the same

position. Type-2 matching is the strictest as all of the relative positions of objects in the query

image must match the relative positions of some of the objects in the database image. The primary

limitation with the 2D string approach is that exact matches must occur and only along two axes.

Another problem with 2D strings is that they treat objects as points and ignore other spatial

relationships such as disjoint, touches, intercepts, and contains. There are 13 of these relationships

for one dimension which make up 169 types for two dimensions [87]. To cater for such queries Liang

et al. [87] proposed the R string representation which stores the co-ordinates of the minimum

bounding rectangle (MBR) for each object. To minimise the complexity of executing queries,

complex queries are broken down into simple queries which are executed in order of complexity.

47

Page 70: Content-based Retrieval of Digital Video

d

b c

a a(ad < ab < c, aa < bc < d)

Figure 2.4: 2D string.

Gudivada and Raghavan [88] proposed a technique where an image is represented as a graph

with the edges indicating spatial relationships between objects. Edges are stored with the object

ids and also the slope of the edge. The number of edges stored for an image is n(n− 1)/2, where n

is the number of objects in an image. The similarity between two images is based on the number

of common edges and also the difference in angle between common edges. If all edges have the

same rotation angle then the database image is a perfect rotational variant of the query image.

If the rotation angles differ between edges then the database image is a multiple rotation variant

of the query image. The smaller number of multiple rotations the more similar the images are.

Experiments performed by Gudivada and Raghavan [88] have shown that their spatial similarity

performs better than type-0, type-1, and type-2 2D string queries. Type-0 2D string queries per-

formed similarly to the proposed algorithm although the complexity of the type-0 match is far

greater.

El-kwae and Kabuka [89] extended the work of Gudivada and Raghavan [88] to allow for

topological spatial queries. The topological relations used by El-kwae and Kabuka were similar to

those used by Liang et al. [87] including disjoint, meets, contains inside, overlap, covers, and equals

relations. Topological relations can be useful because they are invariant under perfect translation,

scaling, and relation transforms. El-kwae and Kabuka [89] also incorporated a rotation correction

angle (RCA) into the similarity function which can make comparisons more robust under rotations.

Smith and Chang [22] have developed a system for spatial and feature image query called SaFe.

The system indexes the minimum bounding rectangles of regions, the area of the MBR, and other

features such as colour and texture. The similarity between regions is determined by the difference

in position, area, spatial extent (width and height), and object features. Multiple region queries

are handled using 2D strings. The 2D strings are created at query time after all other comparisons

have been made to reduce the complexity of the query. Smith and Chang [22] also provide a simple

mechanism for rotational invariance which uses an additional 2D string projection at 45 to the

normal projection. Image rotations of 90 and 135 are handled by flipping the x and y projections

of the 0 and 45 2D strings. Generating 2D strings at query time eliminates the need to store

them, and the generated 2D strings only contain objects relevant to the query. Smith and Chang

[22] show that the SaFe system is able to produce much better query results than colour histogram

comparisons alone.

Li et al. [90] have proposed a query mechanism which is independent of the indexing scheme

and allows for searching by content, spatial and temporal rules, fuzzy conjunctions, and semantics.

48

Page 71: Content-based Retrieval of Digital Video

The scheme uses sub-goal ordering, query block management, and dynamic search to execute the

query in the most efficient way.

2.4.3 MPEG-7

MPEG-7 [91] is a standard for feature representation in audiovisual systems. Conforming MPEG-7

systems need only support the MPEG-7 format when interfacing with other systems thereby making

MPEG-7 largely an interchange format. However, many of the descriptors are compact vectors

and could also be used as the primary storage format. MPEG-7 defines formats for representing

colour, texture, shape, motion, and face recognition within an image or video sequence. Even

though MPEG-7 is a format as opposed to a feature extraction process, some of the descriptors

will assume that certain feature extraction techniques are used in generating the descriptor. For

example, the Scalable Color Descriptor requires that the HSV colour space is used as opposed

to the HV C colour space. Likewise the Homogenous Texture Descriptor uses Gabor filters with 6

orientations and 5 scales. By placing some constraints on the feature extraction techniques used,

MPEG-7 becomes a partial standard for feature extraction as well.

MPEG-7 is a large work and is the most comprehensive standard for representing many kinds

of audiovisual features. The breadth of MPEG-7 and the focus on providing a standard interchange

format will allow for the integration of different commercial systems that may focus on disjoint

features such as audio and video. For the purposes of research, the MPEG-7 format can be a little

restrictive, for example, it may be shown that Gabor filters with 12 orientations instead of 6 provide

better texture retrieval for certain applications. Fortunately, MPEG-7 has also been designed to be

extensible and such modifications can be made, although support for these extensions from other

systems can not be guaranteed.

2.5 Psychology

The purpose of content-based retrieval is to retrieve multimedia objects with the same ranking

of similarity that would be given by a human. Therefore rather than looking at the problem

of feature extraction and image similarity from a purely signal processing point of view it is

worthwhile to consider how the human brain processes images and determines similarities. Some

of the techniques reviewed in this chapter show a psychological basis for their construction such

as the three perceptual dimensions of texture [68] used in the 2D Wold decomposition [36] and

the Gestalt grouping laws [75] used in grouping segmented image regions [74]. In this section we

discuss aspects of the human vision system that are useful for feature extraction and determining

image similarity. The conclusions in this section are drawn from a more detailed review of human

vision presented in Appendix A.

The human brain is structurally different to a conventional computer. The human brain con-

49

Page 72: Content-based Retrieval of Digital Video

sists of many parallel processing neurones whereas conventional computers are largely serial. Even

though CPUs (central processing units) are becoming increasingly parallel, the order of parallelism

is usually around 10 units that can operate independently at the same time. This is in contrast to

the billions of neurones in the human brain. The difference in architecture between the brain and

a computer does place some limitations on the usefulness of simulating human vision mechanisms.

However, it is worthwhile to investigate what is currently known about human vision for inspiration

in determining feature extraction and image similarity techniques.

Vision processing occurs in the brain along a number of parallel pathways flowing from the

retina to the visual cortex at the back of the brain and then along the sides and top of the brain

before synapsing with other systems such as memory and the central executive system [75] (see

Section A.1 and Figure A.2). The multi-staged parallel architecture of vision processing provides

some clues to how vision is processed in the brain. Some features that are processed in parallel

pathways from the retina to higher level components of the visual cortex include motion, structure,

colour, orientation, and separate left and right eye processing [75].

The retina consists of short, medium, and long wavelength cone photoreceptors which are used

to detect colour (see Section A.2). The short, medium, and long wavelengths roughly correlate

with the RGB colour space and provide a basis for using RGB images as input for a feature

extraction process. The output from the photoreceptors is immediately processed by Ganglion

cells to transform the colour signals into an opponent colour model (see Figure A.3). One of the

advantages of using an opponent colour model is that the chrominance is separated from the

luminance and hence there is less correlation between the colour components. Separate pathways

handle the luminance and chrominance signals in the visual pathway.

The signals from the retina flow down the optic nerve through the lateral geniculate nucleus

(LGN), which is in the centre of the brain, to the primary visual cortex (V1), which is at the back

of the brain. Hubel and Wiesel [92], through experiments on the cat visual cortex, found that the

neurones in V1 had receptive fields that responded to oriented stimulus. Later research by Hubel

and Wiesel [10] found that these oriented cells, known as simple cells, were arranged in repeating

10 oriented columns called hypercolumns (see Section A.4 and Figure A.5). The receptive fields

change at 10 intervals over 180 before repeating again resulting in 18 orientations being used to

represent edge. This is vastly greater than the 2 or 4 orientations used by fixed mask edge detectors

[57] and even the 6 orientations often used with Gabor filters [60]. The orientation tuning curves

of simple cells show that they will respond to a stimulus with an orientation greater than 10 in

deviation from the orientation of the receptive field indicating that multiple simple cell responses

are used to determine the exact orientation of an edge (see Figure A.8).

Signals flow from V1 to V2 and V3. Hubel and Wiesel [93] found that V2 consists mainly

of complex cells and a few hypercomplex cells and contains no simple cells. Hypercomplex cells

exhibit very specific receptive fields responding to complex features such as line-ends, corners, and

particular directions of motion. The fact that there are no simple cells in V2 indicates that the

complex cells take input from the simple cells in V1, and the greater number of complex cells than

50

Page 73: Content-based Retrieval of Digital Video

hypercomplex cells in V2 probably indicates that hypercomplex cells take input from complex cells.

Since neurones in the visual cortex can be simulated by image convolution filters such as the Gabor

filter it is possible that human vision is most accurately simulated using multiple stages of image

convolution filters for the Ganglion cells in the retina, the simple cell and complex cells in V1, and

the complex and hypercomplex cells of V2. Marr [56], Grossberg et al. [94], and Heitger et al. [12]

have proposed signal processing approaches to simulate the multi-staged architecture of the visual

cortex (see Section 2.6 for a discussion of these computational models).

Higher up the visual pathway, processing splits into the form-colour pathway and the motion-

structure pathway. V4 and IT have been found to process shape, colour, and texture [95]. Neurones

further along the visual pathway respond to more complex stimuli but become less specific to

orientation, size, and position. Also neurones in V4 and posterior IT respond to specific shapes,

textures, and colours, whilst cells in anterior IT respond to combinations of shape, colour, and

texture and become less dependent on size and position of stimulus. Once again the multistage

architecture of combining inputs from previous stages becomes obvious. This parallel architecture

for representing objects regardless of orientation, size, and position may be unnecessarily complex

for computers which could parse the features of an image and store objects in an array without

requiring a massively parallel neural simulation to achieve the same representation.

Less is known about vision processing beyond V4 and IT. At this point we must look at high-

level vision processing theories such as Biederman’s recognition-by-components [96] and Kosslyn’s

high-level theory for seeing and imaging [5] (see Section A.8). However these theories do not provide

enough detail to use as a basis for a feature extraction method however they can be used to guide

the design of feature extraction techniques.

Even less is known about how the brain determines similarity between images. Tversky [9]

proposed hierarchical feature sets and methods for determining similarities between feature sets

based on intersections and differences. Such an approach could be applied to the representation

and querying phases of a content-based retrieval system.

2.6 Computational Models of the Visual Cortex

To validate vision processing models researchers have implemented systems based on neurophysio-

logical evidence [56, 94, 12]. These models show how cortical architectures can be mapped to digital

signal processing techniques and provide a physiological basis and motivation for edge detection

techniques.

2.6.1 Primal Sketch

Marr [56] proposed that human vision is highly localised and parallel, and that vision is constructed

from sharp and smooth intensity changes. Sharp intensity changes may represent the junction of

51

Page 74: Content-based Retrieval of Digital Video

two surfaces or the occlusion of one surface over another. Smooth intensity changes may represent

the curvature of a surface or the edge of a shadow. Intensity changes can be detected locally and

it is proposed that it is the first stage of processing. Marr proposed the zero-crossing detector

which detects the two dimensional zero-crossings of the image gradient. Zero-crossings can be

detected simply by subtracting the values of adjacent pixels however a more accurate method

also involves a Gaussian smoothing operation. The optimal operator is known as the Laplacian,

[∂2I/∂x2 + ∂2I/∂y2].

The Laplacian operator has a receptive field very similar to ganglion and LGN cells which

represent the first stages of visual processing. To detect intensity changes of different spatial sizes

multiple Laplacian operators of different sizes are applied to the image. The raw primal sketch is the

description of each of the channels of operators with different sizes. At the next stage of processing

lines and edges are detected by AND-ing outputs from lines of zero-crossing operators. The result

is a representation of local orientations similar to simple and complex cells in the primary visual

cortex.

The next stage of processing attempts to integrate the local information to derive global prop-

erties of an image. For example, a line of simple cells firing will indicate a connected contour. Also

at this stage “virtual” or illusory contours are detected such as the border between textures or

partially occluded contours. The result is called the full primal sketch.

2.6.2 Grossberg

Another pioneer in computational modelling of low- to intermediate-level vision processing is Gross-

berg [94]. Grossberg’s model is based on two subsystems which process boundary contours (BCS)

and feature contours (FCS). The BC system detects discontinuities in images to form boundary

contours which may also include illusory contours. The FC System responds to colour and texture

and acts as a filling in process, filling in spatial areas up to boundary contours, whether real or

illusory. Grossberg’s model has evolved over the years gradually including more types of cells found

in the visual pathway. This synopsis is taken from [97].

The BC system begins at the LGN where on and off cells detect local discontinuities which

are not directionally sensitive. Simple cells receive input from LGN cells allowing them to detect

edges and bars. Complex cells integrate simple cell responses of opposing contrast, representing

local orientations which are independent of contrast direction. The model moves on to simulate

hypercomplex cells which are activated by complex cells with orientations roughly 90 apart to

detect line-ends and corners. Also hypercomplex cells are inhibited by nearby complex cells to

perform spatial sharpening. Higher order hypercomplex cells perform orientation competition be-

tween hypercomplex cells at the same position which then activates long-range bipole cells. Bipole

cells initiate long-range boundary completion and grouping by being activated by same orientation

higher-order hypercomplex cells and inhibited by different orientation higher-order hypercomplex

cells. Outputs from bipole cells feed back to hypercomplex cells in a cooperative-competitive (CC)

52

Page 75: Content-based Retrieval of Digital Video

loop. The hypercomplex cells also provide feedback to LGN cells which has been confirmed through

the discovery of length tuned cells in LGN [98].

Grossberg’s model has been applied to illusions [25, 99], occluded images [11], and synthetic

aperture radar processing [99] and appears to simulate perceptual responses. Grossberg’s cooperative-

competitive feedback loop is supported by Biederman’s [96] results where it takes almost a second

to perform contour filling in, indicating a more complex process than a simple feed forward network.

Grossberg’s model is currently the most complete, however comparing it to evidence presented in

the human vision literature review of Appendix A shows that Grossberg’s model is still a simplified

implementation of the visual cortex.

2.6.3 Heitger

The main opposing model to Grossberg’s is by Heitger et al. [12, 26]. Heitger et al. propose a model

which contains no feedback loops and can be represented entirely by mathematical operators. The

model begins with simple cell operators based on even and odd Gabor filters (Grossberg also used

Gabor filters in later models [99]). The even Gabor filter acts as a line detector whilst the odd

Gabor filter acts as an edge detector. Heitger et al. [12] used a modified Gabor filter called a

“stretched Gabor” or “S-Gabor” which reduces the frequency of the periodic component at the

extents of the Gaussian envelope (see Equations 4.3 to 4.5).

The S-Gabor filters represent the direction of contrast by the sign of the response. Complex cells

integrate the response of simple cells by squaring and adding the Gabor responses. The value is

also square rooted to obtain the same contrast response as the simple cells. Complex cells feed into

end-stopped cells which are either single-stopped or double-stopped. Single-stopped cells subtract

the responses of two complex cells at a distance d apart:

ES(x, y) = [C(x− d, y)− C(x + d, y)]+ (2.11)

while double-stopped cells subtract the responses of two complex cells at a distance 2d from a

centre complex cell:

ED(x, y) = C(x, y)− 12[C(x− 2d, y) + C(x + 2d, y)]+ (2.12)

However, using this approach end-stopped cells can also respond to the middle of lines in

addition to just line ends. Heitger et al. solved this problem by implementing surround inhibition

from complex cells. In a later paper Heitger et al. [26] described a process for grouping end-

stopped cell responses. Their approach is similar to Grossberg’s bipole cells although they are able

to detect which side of the illusory contour is in the foreground and which is in the background.

The approach by Heitger et al. is simpler than Grossberg’s because it doesn’t contain any feedback

loops. However, feedback loops may be necessary for contours to emerge when surrounded by

contradictory features.

53

Page 76: Content-based Retrieval of Digital Video

2.6.4 Walters

Another approach has been proposed by Walters [100] which only processes black and white images.

The result is a model which can be described purely by individual bits being on or off. The model has

the ability to represent illusory contours and to enhance cartoon images. It also has the advantage

of being very fast to execute. Its applicability to content-based retrieval however is limited because

it is not designed for natural images.

2.6.5 Conclusion

Much research has been performed to understand the mechanism of human vision however most

of the knowledge is focussed on the earlier stages of vision processing such as the retina and V1

which are relatively simple to simulate and verify with computer technology. Less is known about

how individual features are grouped to form complex objects which still remains a challenge for

content-based retrieval research. Even so, the outcomes of vision research are used as a guide for

this thesis for extracting colour, edge, texture, and contour features and also for clustering similar

images.

54

Page 77: Content-based Retrieval of Digital Video

Chapter 3

Colour

Colour is one of the primary features used to represent and compare visual content. Colour extrac-

tion can be relatively simple and much research has been performed in the area [15, 21, 27, 101].

The challenge lies in extracting colour quickly in a compact representation that can be queried

efficiently. Image colour is readily accessible as most image storage formats can be converted to

RGB format for display purposes. The two problems with raw RGB pixel data are:

1. The RGB colour model is not perceptually uniform, and

2. The raw pixel data is not compact

Therefore a better colour model is required along with a better form of representation. This

chapter discusses colour models, representation techniques, and comparison techniques. New rep-

resentation and comparison techniques are presented including fuzzy histograms and prominent

colours that provide better results using more compact representations than existing techniques.

3.1 Colour Models

Content-based retrieval systems must be able to compare features using a relatively simple compu-

tational process that produces results similar to human perception. The RGB format often used

for images in computer memory is not perceptually uniform which means that the Euclidean dis-

tance between two sets of points in RGB colour space will not give the same results as human

perception.

The RGB colour space has some similarity to the short, medium, and long wavelength photore-

ceptors in the retina but with overlapping response curves [75]. The outputs from the photorecep-

tors are processed by ganglion cells to produce an opponent colour model consisting of white-black,

blue-yellow, and red-green components [75]. The advantage of opponent colour models is that the

55

Page 78: Content-based Retrieval of Digital Video

luminance component (white-black) is extracted separately from the chrominance components.

This can be important as luminance is the primary component for determining boundaries and

shape [96]. The result is that opponent colour models are less correlated than RGB making oppo-

nent colour models more suitable for compression.

The chrominance components do not represent the amplitude of light waves but instead repre-

sent the hue of a colour. For example, Y UV represents chrominance as the colour ranges blue-to-

yellow (U) and red-to-green (V ):

Y = 0.265R + 0.670G + 0.065B

U = (B − Y )sin 33

2.03

V = (R− Y )sin 33

1.14

Swain and Ballard [21] used a simpler integer computation of the opponent colour axes which

are defined as:

rg = r − g (3.1)

by = 2× b− r − g (3.2)

wb = r + g + b (3.3)

Even though the visible light spectrum begins with the colour red and ends with the colour vi-

olet, the brain perceives colours as a colour wheel (Figure 3.1) where violet merges again with red.

Colour models that represent the hue as a colour wheel more closely model the human perception

of colour than opponent or RGB colour models. Hue-based colour models generally have satura-

tion and value as there other components. Saturation refers to the relative amount of the hue’s

wavelength relative to the presence of other wavelengths and value refers to the overall amplitude

of the light signal. Perceptually motivated colour models include HV C (hue, value, chroma) [102]

and HSV (hue, saturation, value) [27].

A content-based retrieval system requires a colour model that can be efficiently transformed

from RGB and will model human perception. The performance of RGB and perceptual colour

models will be evaluated in Section 3.3 in the context of histogram representations of colour.

3.2 Colour Representation

The basic requirement of colour representation is to represent the distribution of colours in an

image. In the following sections we will discuss two existing approaches for representing colours:

histograms and colour sets, followed by a new approach that attempts to extract the most promi-

nent colours in an image.

56

Page 79: Content-based Retrieval of Digital Video

Purple

420 nm

470 nm

490 nm

700 nmViolet

Pure blue

Blue-green

Pure green

497 nm

570 nm

600 nm

Pure yellow

Orange

Red

Pure red

Figure 3.1: Colour wheel.

3.3 Histograms

Histograms attempt to represent the most significant colours in a scene by quantising the colour

space into bins and quantifying the number of pixels that fall into each bin [21]. The bins with the

largest number of pixels will contain the most significant colours. Colour histograms can be simple

to construct. They generally consist of three dimensions representing each colour axis such as RGB

or HSV . To produce a compact feature vector the number of bins, N , along each dimension must

be kept small as the total number of bins increases proportionally to N3. A sample RGB colour

histogram is shown in Figure 3.2. As can be seen in this example the three colour axes are highly

correlated as each distribution roughly follows the other.

Figure 3.2: Colour histogram.

57

Page 80: Content-based Retrieval of Digital Video

When histograms are compared the goal is to find similar quantities of the largest colours from

the two images. One approach is to treat the histogram bins as one feature vector in multidimen-

sional space and use the Euclidean distance as a distance measure [16].

One of the more reliable histogram comparison techniques implemented so far is histogram

intersection [21]. Histogram intersection adds up the minimum values between each pair of corre-

sponding bins. If there is a large overlap between the bin values of two corresponding histograms

then the minimum values will also be large. Therefore two similar histograms will result in a large

intersection value. This makes histogram intersection a similarity measure as opposed to a dis-

tance measure. Swain and Ballard [21] showed that the histogram intersection similarity measure

is equivalent to the absolute difference distance measure if the absolute difference is divided by 2

and subtracted from the number of pixels:

I(i, j) = k −A(i, j)/2 (3.4)

where I(i, j) is the histogram intersection between histograms i and j, A(i, j) is the absolute

difference between histograms i and j, and k is the number of pixels. Either technique may be used

depending on whether a similarity measure or distance measure is required.

3.3.1 Colour Histogram Experiments

Colour histograms are a well established colour representation technique and are evaluated first to

be used as a benchmark for the other techniques investigated. The experiments were performed on

a database of real world photos [103] by performing three different similarity searches each with a

different photo and comparing the top ten results.

Two different colour spaces were used, RGB and a perceptually motivated colour space. The

RGB colour space is not perceptually uniform and as a result the fixed histogram ranges may

over-represent one portion of the colour space and under-represent another portion. A colour space

that has been designed to imitate human colour perception is the HV C colour space [102] which

has the three components hue, value, and chroma. RGB colour co-ordinates may be transformed

to the HV C colour space using the CIE(1976)L*a*b* transformation [101]. The CIE(1976)L*a*b*

transformation begins by converting the RGB values into CIE XY Z values using the formulae:

X = 0.607R + 0.174G + 0.201B (3.5)

Y = 0.299R + 0.587G + 0.114B (3.6)

Z = 0.066G + 1.117B (3.7)

The L∗a∗b∗ values can then be obtained from the XY Z values, where X0, Y0, and Z0 represent

the X, Y , and Z values for the reference white.

L∗ = 116(

Y

Y0

)1/3

− 16 (3.8)

58

Page 81: Content-based Retrieval of Digital Video

Ideal Result ImagesQuery Images

Figure 3.3: Test images for the colour histogram experiments and the most similar images that

should be returned as the first results after a query.

a∗ = 500

[(X

X0

)1/3

−(

Y

Y0

)1/3]

(3.9)

b∗ = 200

[(Y

Y0

)1/3

−(

Z

Z0

)1/3]

(3.10)

Finally the HV C values can be derived from the L∗a∗b∗ values

H = arctan(b∗/a∗) (3.11)

V = L∗ (3.12)

C =√

(a∗)2 + (b∗)2 (3.13)

Determining the HV C values from RGB can be a difficult process as can be seen with the

preceding formulas. Smith and Chang [27] used a more tractable transform to HSV colour space.

The algorithm assumes input in the range R,G,B ∈ 0→ 1 and produces output, H ∈ 0→ 6 and

S, V ∈ 0 → 1. The algorithm for transforming a point in RGB colour space to HSV is shown in

Algorithm 1.

Since the RGB → HSV transform is simpler and more tractable than the RGB → HV C trans-

form we have decided to use it for the perceptually motivated colour space. The colour histogram

experiments evaluated the performance of RGB and HSV colour histograms using the histogram

intersection technique.

Three test images were used from the sample image database which had definably similar

images which are shown in Figure 3.3. The first image, ‘Car’, is similar to two other car images

which should be returned as the first two results. The second image, ‘Wedding’, is similar to 10

other wedding photos which should be returned as the top ten results. The third image, ‘Bush’, is

an image consisting of a lot of texture in a bush environment and is most similar to one other bush

image and is also similar, but to a lesser extent, to other images in a bush setting. The ten most

similar images returned using each histogram technique for the three query images are shown in

Figures 3.4 to 3.6.

59

Page 82: Content-based Retrieval of Digital Video

Algorithm 1 RGB → HSV

V ← max← max(r, g, b)

min← min(r, g, b)

range← max−min

S ← range/max

r1 ← (max− r)/range

g1 ← (max− g)/range

b1 ← (max− b)/range

if r = max then

if g = min then

H ← 5 + b1

else

H ← 1− g1

end if

else if g = max then

if b = min then

H ← 1 + r1

else

H ← 3− b1

end if

else if b = max then

if r = min then

H ← 3 + g1

else

H ← 5− r1

end if

end if

60

Page 83: Content-based Retrieval of Digital Video

RGB (3,3,3)

RGB (3,3,3) F

RGB (4,4,4)

RGB (4,4,4) F

RGB (5,5,5)

RGB (5,5,5) F

HSV (3,2,2) F

HSV (3,2,2)

HSV (6,2,2) F

HSV (6,2,2)

HSV (6,3,3) F

HSV (6,3,3)

HSV (18,2,2) F

HSV (18,2,2)

HSV (18,3,3) F

HSV (18,3,3)

Figure 3.4: Histogram results for a search on the Car image.

61

Page 84: Content-based Retrieval of Digital Video

RGB (3,3,3)

RGB (3,3,3) F

RGB (4,4,4)

RGB (4,4,4) F

RGB (5,5,5)

RGB (5,5,5) F

HSV (3,2,2) F

HSV (3,2,2)

HSV (6,2,2) F

HSV (6,2,2)

HSV (6,3,3) F

HSV (6,3,3)

HSV (18,2,2) F

HSV (18,2,2)

HSV (18,3,3) F

HSV (18,3,3)

Figure 3.5: Histogram results for a search on the Wedding image.

62

Page 85: Content-based Retrieval of Digital Video

RGB (3,3,3)

RGB (3,3,3) F

RGB (4,4,4)

RGB (4,4,4) F

RGB (5,5,5)

RGB (5,5,5) F

HSV (3,2,2) F

HSV (3,2,2)

HSV (6,2,2) F

HSV (6,2,2)

HSV (6,3,3) F

HSV (6,3,3)

HSV (18,2,2) F

HSV (18,2,2)

HSV (18,3,3) F

HSV (18,3,3)

Figure 3.6: Histogram results for a search on the Bush image.

63

Page 86: Content-based Retrieval of Digital Video

3.3.2 RGB Histogram Results

The RGB histogram was tested with 3, 4, and 5 bins for each dimension resulting in a total of

27, 64, and 125 bins, respectively. Even with 125 bins the “Car” query in Figure 3.4 is not able

to return both cars in the top ten results. This can be explained by the high correlation between

each axis in the RGB colour space. The database became noticeably slow when 125 bins were used

indicating that the number of bins used must be much less than 125.

3.3.3 HSV Histogram Results

The HSV histogram was implemented so that the hue, saturation and value dimensions were

quantised uniformly by the number of bins. The hue dimension was segmented so that primary

colours always fell in the centre of bins and when 6 or more bins were used the secondary colours

were also aligned to bin centres. Hue was evaluated with 3, 6, and 18 bins whilst saturation and

value axes were evaluated with 2 and 3 bins.

The HSV results were much better than the RGB results even when only 12 bins were used.

However, 54 bins were required (HSV 633) for the Car test image so that the other two car images

were returned as the top two results. Three bins for saturation and value dimensions performed

better than two bins. However, the increase in hue bins from 6 to 18 did not improve results

dramatically. From these results it can be seen that the HSV colour space performs significantly

better than the RGB colour space for determining image similarity based on colour histograms.

3.4 Fuzzy Histograms

Our goal is to find a representation that is compact but also produces good results. One of the

problems with the colour histogram representation evaluated in the last section is that as the

number of bins decreases the accuracy of the results also decreases. One cause for the decrease in

accuracy is aliasing effects caused by the small number of bins.

Figure 3.7 shows an extreme example where only two bins are used to represent an axis. Figures

3.7 (a) and (e) contain histogram compressed and shifted versions of the plane image. Even though

there is a slight difference in the lightness of the two images, they remain highly similar. The

problem is that even if there is only a small change in the overall colour, a major change in the

histogram may occur. If this shift occurs near a bin border then a substantial number of pixels

can shift to a neighbouring bin (Figure 3.7 (c) and (g)). The result is that the two images have

very different bin quantities. Researchers have attempted to improve the comparison techniques to

produce better results [21]. We have taken a different approach by actually modifying the histogram

creation technique to produce a representation that more accurately reflects the true distribution

of the data whilst using a small numbers of bins.

64

Page 87: Content-based Retrieval of Digital Video

(b)

(f)

(c)

(g)

(d)

(h)

(a)

(e)

Figure 3.7: (a, e) Plane images with compressed and shifted histograms. (b, f) Grey-level frequency

distributions. (c, g) Two-bin histograms. (d, h) Fuzzy histograms and membership functions.

Our solution is to use fuzzy membership functions to determine how much a pixel belongs

to one bin. The membership function uses a linear function which decreases from a bin’s centre

to adjacent bin centres, represented by the dashed lines in Figures 3.7 (d) and (h). The linear

membership function is defined as:

bi =

br−|x−bc|

brif |x− bc| < br

0 else(3.14)

where x is the value of the pixel being added to the histogram, bc and br are the centre point and

range respectively of bin b, and bi is the resulting bin increment (0→ 1) for bin b.

The fuzzy increment can be extended to multiple dimensions by combining the bin increment

for each dimension. The product of each dimensional bin increment becomes the final increment

for the bin:

bi =N∏

n=1

bi(n) (3.15)

where N is the number of histogram dimensions and bi(n) is calculated using Equation 3.14.

For circular dimensions such as hue the fuzzy increment must consider the wrapping of bins

around the dimension. However, for non-circular dimensions, such as the simple example of Figure

3.7 the membership function is constant from the centre of the bin to the extreme which is shown

in Figures 3.7 (d) and (h).

The result of applying the fuzzy histogram to the first distribution in Figure 3.7 (b) can be

seen in Figure 3.7 (d) and is much more similar to the fuzzy histogram of Figure 3.7 (h). The fuzzy

histogram is much more descriptive of the distribution and still compatible with existing histogram

comparison techniques resulting in better query results.

65

Page 88: Content-based Retrieval of Digital Video

3.4.1 Fuzzy Histogram Results

Figures 3.4 to 3.5 show the results of the fuzzy histogram technique using the same colour spaces

and number of bins as the previous experiments. The fuzzy histogram results are marked with

an ‘F’. As can be seen the fuzzy histogram technique improves results considerably especially for

histograms with small numbers of bins on each axis. A noticeable improvement is seen in the

RGB colour space when 4 bins are used for each axis. In the HSV colour space fuzzy histograms

performed much better than conventional histograms when (3,2,2), (6,2,2), and (18,2,2) bins were

used.

3.5 Colour Sets

Our work on fuzzy histograms allows histograms to be used with a lower number of bins. In this

section we compare it with another well known method of colour representation called Colour Sets

[27].

Colour Sets take a different approach to conventional histograms. Only one bit is used for each

bin, therefore an image either has the colour or it doesn’t. Because only one bit is used for each bin

up to 8 times the number of bins can be used over conventional histograms (assuming conventional

histograms use one byte per bin). The HSV colour space is used and the hue dimension is divided

into 18 bins whilst 3 are allocated to both the saturation and value axes. Four additional bins are

provided for grey levels. The total number of bins is 166 but because only one bit is required for

each bin only 21 bytes are required to represent the Colour Set. In addition, histogram comparison

is simplified to a simple AND operation. The number of bits set after the AND operation provides

the similarity between the two images [27].

Since a Colour Set can only represent the presence of a colour and not its intensity it can easily

be affected by noise as only one stray pixel can indicate the presence of a colour. To minimise

these problems the source images are first scaled down to smaller sized images and a median filter

is applied to each image.

3.5.1 Colour Set Results

Figure 3.8 shows the results for Colour Sets. For the Car test image the two closest car images

were returned in the top three positions. For the Wedding image 9 out of the 10 wedding photos

were returned. For the Bush image Colour Sets performed quite poorly returning the closest bush

image in the sixth position. In comparison to the fuzzy histogram results of Figures 3.4 to 3.6,

Colour Sets provide no better performance than a HSV 322 fuzzy histogram. In addition the

HSV 322 fuzzy histogram is simpler to compute as it doesn’t require a median filtering stage. The

HSV 322 histogram consumes less storage space, 12 bytes compared with 21 bytes. The comparison

complexity is roughly the same for HSV 322 fuzzy histograms and Colour Sets. Therefore, the

66

Page 89: Content-based Retrieval of Digital Video

HSV 322 fuzzy histogram compares well with the Colour Sets approach and uses less than two

thirds of the storage space.

3.6 Prominent Colours

In looking at colour extraction we took a step back from the existing approaches to try to form

a more idealistic solution. When colour is considered in an image there is generally a handful

of prominent colours which can be used to describe the overall image. Some small variations in

colour may not be visible, hence these variations should either be ignored or grouped with a similar

prominent colour. In this section we present a new technique for extracting the N most prominent

colours and the relative prominence of each colour.

The prominent colours in an image are extracted using the following algorithm:

1. Generate a fine-grained histogram

2. Group all bins into local peaks

3. Select the N most prominent peaks

Each step is described below:

Fine-grained Histogram Generation The prominent colours technique begins by generating

a frequency histogram that allows the most prominent colours to be identified. The number of bins

used in this histogram is much higher than the number bins used in the preceding experiments

because the purpose of this histogram is to more precisely determine the colour values of the most

prominent colours as opposed to generating a compact representation. The HSV colour space was

used which was shown to perform considerably better than the RGB colour space in the preceding

experiments. 120 hue bins, 6 saturation bins and 6 value bins were used, resulting in a total of

4320 bins. Fuzzy histograms were not used as there are many bins and as shown in Section 3.4.1

fuzzy histograms are more beneficial when less bins are used.

Bin Clustering Before identifying the prominent colours, pixels from neighbouring bins are

grouped into one bin which represents the central colour of a cluster of colours. The technique we

have used is to iteratively merge neighbouring bins into the bin with the highest quantity.

The algorithm is as follows. For each bin x, it is determined how many neighbours have a larger

quantity than bin x. If the number of neighbours is greater than zero then the quantity of bin x

is divided by the number of larger neighbours and distributed to each larger neighbour. The value

of bin x is then set to zero. This process continues until no further bins are distributed amongst

neighbours.

67

Page 90: Content-based Retrieval of Digital Video

ColourSet

Prominent 4

Prominent 8

Prominent 16

ColourSet

Prominent 4

Prominent 8

Prominent 16

ColourSet

Prominent 4

Prominent 8

Prominent 16

Figure 3.8: Colour Set and Prominent Colours Results.

68

Page 91: Content-based Retrieval of Digital Video

Prominent Colour Selection The final step is to select the N most prominent colours. This

is achieved simply by finding the remaining bins with the N largest quantities. A good value of N

is determined experimentally and results are presented in Section 3.6.2.

3.6.1 Prominent Colours Storage and Querying

It is the storage stage that distinguishes the prominent colours technique from the histogram

technique. Where histograms use a fixed quantisation space, prominent colours store the central

colour of each cluster along with the number of pixels in each cluster. Assuming 3 bytes are required

to represent the 3 colour components and an extra byte to present the quantity then a total of

N × 4 bytes are required for prominent colour storage. If four colours are required for good results

then only 16 bytes are required for storage per image.

Histograms and colour set comparisons are relatively simple as each image contains correspond-

ing bins. With prominent colours both the quantity and the colour vary between images. Therefore

a special comparison technique is required to compare the prominent colours. The comparison tech-

nique is an iterative approach where the two most similar colours between two images are found

and one or both colours are removed until there are no colours left to compare. The technique

essentially attempts to determine the overlap between the two sets of colours if both were laid out

in a pie image (see Figure 3.9). The overlap between two prominent colours is determined using

the following formula:

O(i, j) = ||PA(i)− PB(j)||min(PA(i), PB(j)) (3.16)

where PA(i) is the quantity of prominent colour i in the set of prominent colours of image A,

PB(j) is the quantity of prominent colour j in the set of prominent colours of image B, and

||PA(i) − PB(j)|| represents the Euclidean distance in the HSV colour space between the two

prominent colours. This overlap value is then added to the total overlap between the two sets of

prominent colours.

Since the algorithm determines the overlap between colour quantities, it is quite likely that

two very similar colours may only overlap by a very small amount. In this case, after the overlap

between the two colours has been determined, the colour with the smaller quantity is removed as

its quantity has accounted for the entire overlap. The colour with the larger quantity remains as

only part of it overlapped with another colour, however its quantity is reduced by the intersection

amount. The benefit of taking this approach is that one image may have 100 green pixels whilst

another image may have 70 slightly darker green pixels and 30 slightly lighter green pixels. This

approach will provide roughly the same results whether a colour is split or not. If both colours

contain exactly the same number of pixels then both quantities are set to zero.

The prominent colours comparison algorithm is iterative and includes an expensive step to find

the most similar pair of colours making it more complex than the histogram intersection or colour

sets to compute.

69

Page 92: Content-based Retrieval of Digital Video

Figure 3.9: Prominent colours of the three car images.

70

Page 93: Content-based Retrieval of Digital Video

3.6.2 Prominent Colours Results

The results for the prominent colours experiments are shown in Figure 3.8. Four, eight, and sixteen

prominent colours were generated for the experiments. The results show that the results improve

as the number of prominent colours increases. Unfortunately, even with 16 colours the prominent

colours approach falls just short of the Colour Sets approach for the Car and Wedding images,

however it was able to successfully return the most similar image to the Bush image in the first

position. Generating prominent colours is slower than generating a fuzzy histogram or a Colour Set.

In addition, the comparison complexity is quite high as it is an iterative approach. The prominent

colours approach did however provide better results than a standard HSV 322 colour histogram.

3.6.3 Other Approaches

QBIC [19] uses a similar concept to the prominent colours approach by selecting 256 representative

colours using a greedy minimum sum of squares clustering. Two histograms ~x and ~y with K

elements are compared using a similarity metric defined by the following formula:

d2(~x, ~y) =K∑i

K∑j

aij(xi − yi)(xj − yj) (3.17)

where aij is the similarity between the two colours represented by histogram elements i and j:

aij ≡ 1− d(i, j)/dmax (3.18)

where dmax is the maximum distance between any two colours. The difference between the QBIC

approach and the prominent colours approach is that the prominent colours approach represents

the exact central colour as opposed to the histogram bin ‘supercells’ used by the QBIC system.

Gong [4] also used a non-uniform histogram based on predefined bins in the HV C colour space.

Each bin corresponds to a human identifiable colour as shown in Table 3.1. Results for the Car,

Wedding, and Bush query images using Gong’s histogram are shown in Figure 3.10. The Car search

results are relatively poor with only one car image being returned and in the third position. The

Wedding results are also quite poor as five of the ten images returned are not of the wedding.

Gong’s histogram performed better with the Bush image correctly returning the most similar bush

image. The poor performance of Gong’s histogram can be explained by a number of factors. Firstly,

the histogram only contains 11 bins which is smaller than any of the histograms evaluated, and

it does not use any form of bin anti-aliasing such as the fuzzy histogram technique presented in

Section 3.4 which would be a challenge to apply since the bins are not uniformly quantised in

the HV C colour space. Secondly, the bins are aligned to commonly classifiable colours which may

not allow for good discrimination between natural colours in images. Finally, Gong’s histogram is

designed to represent texture regions as opposed to entire natural images which may contribute to

its poor performance.

71

Page 94: Content-based Retrieval of Digital Video

Table 3.1: Range of each of the colour zones used by Gong [4].Colour Name Hue (degree) Value Chroma

Red 0–36 4 – 9 1.5 – 30

36 – 64 4 – 9 15 – 30

Orange 64 – 112 4 – 8 9 – 30

Yellow 80 – 112 9 – 10 1.5 – 30

Skin Color 36 – 64 4 – 9 1.5 – 15

64 – 112 4 – 8 1.5 – 9

Green 112 – 196 4 – 10 1.5 – 30

Cyan 196 – 256 6 – 8 1.5 – 30

Blue 256 – 312 4 – 8 1.5 – 30

Purple 312 – 359 4 – 8 1.5 – 30

Black – < 3 –

Grey – 4 – 8 < 1.5

– 3 – 4 –

White – > 9 < 1.5

Figure 3.10: Results for the Car, Wedding, and Bush images using Gong’s histogram [4].

72

Page 95: Content-based Retrieval of Digital Video

Further work can be done to improve the prominent colours approach by improving the cluster-

ing and comparison algorithms. Clustering approaches such as k-means centred clustering may be

more applicable as the number of prominent colours is known from the start. Also the comparison

technique could be further optimised and refined.

3.7 Summary

In this chapter we have presented our findings in determining solutions for extracting, representing,

and comparing colours in images within the context of a structural content-based retrieval system.

Our goals were to have colours extracted, represented, and compared efficiently whilst providing

robust results. We began with the broadly used colour histograms and improved them when small

numbers of bins are used by applying fuzzy histograms. We also presented a new prominent colours

technique which is designed to achieve the goals laid out. However, the prominent colours technique

did not perform as well as the existing Colour Set approach or as well as our new fuzzy histogram

approach. It did however perform better than a standard histogram and Gong’s colour histogram

[4] but at the expense of more complex generation and comparison algorithms.

The prominent colours approach shows promise and more work can be done in the area to

improve the selection of the prominent colours and the comparison of two prominent colour sets.

The best results came from the fuzzy histogram approach which is able to dramatically improve

results when very small numbers of bins are used, satisfying the goals laid out for this portion of

our research.

73

Page 96: Content-based Retrieval of Digital Video

74

Page 97: Content-based Retrieval of Digital Video

Chapter 4

Edge and Texture

Edges describe the spatial differences across an image. These differences form boundaries that allow

the human visual system to distinguish between homogeneous colour regions in an image. Simi-

larly, content-based image retrieval systems use low-level edges in higher level feature extraction

techniques such as contour extraction and texture analysis to differentiate between regions within

an image.

Edges have been used extensively for content-based image retrieval and much research has been

conducted [13, 60, 19, 4, 104]. However, many of the techniques proposed in the literature only use

very simple edge detectors [27, 19, 4]. By themselves simple edge detectors perform well against

complex edge detectors, however they perform poorly when used for higher level feature extraction

such as contour extraction and texture analysis. Specific edge detectors have been designed to

extract features required by texture analysis [60, 105, 39] but few edge detectors have been designed

with the intent of accurate contour following. Contours are important in content-based retrieval

systems as they are one of the high-level structural representations within an image. Since the

performance of contour following is intrinsically dependent on edge detection the primary purpose

of this chapter is to investigate edge detection techniques for contour following and to build upon

these techniques to produce an edge detector tuned for contour following that can also be used for

texture analysis. The resulting edge detector, called the Asymmetry edge detector, is able to provide

the best single pixel responses across multiple orientations compared with existing techniques.

4.1 Edge Detection

Edges form where the pixel intensity changes rapidly over a small area. Edges are detected by

centring a window over a pixel and detecting the strength of edge within the window. The result is

stored at the same pixel location. Edge responses produced by a number of common edge detectors

are shown in Figure 4.1.

75

Page 98: Content-based Retrieval of Digital Video

(a) Original image (b) Simple difference operator (c) Laplacian

(d) Roberts (e) Prewitt (f) Sobel

(g) Frei-Chen (h) Kirsch (i) Robinson

Figure 4.1: Some common edge detectors applied to image (a). Each result image represents the

absolute maximum magnitude at each pixel after the individual masks have been applied.

76

Page 99: Content-based Retrieval of Digital Video

-1 1

(a) Simple Difference Operator 0

-1

0

-1

0

0 -1

-1

4

(b) Laplacian

1 0

0 -1 -1

10

0

(c) Roberts

-1 -1 -1

000

1 1 1 1

1

1

-1

-1

-1 0

0

0

(d) Prewitt

-1 -2 -1

000

1 2 1 1

2

1

-2

-1

-1 0

0

0

(e) Sobel

Figure 4.2: Some common edge detectors.

Edge detection techniques often use a mask that is convolved with the pixels in the window.

A simple difference mask is shown in Figure 4.2 (a). The difference mask is directional. An edge

detector can also be non-directional such as the Laplacian or difference of Gaussians as shown in

Figure 4.2 (b). Since edges are directional and contours consist of oriented edges, we are primarily

interested in directional edge detectors.

Other simple, but extensively used edge detectors, include the Sobel, Roberts, and Prewitt

operators (Figure 4.2) [69]. Such operators are directional and can be used to detect orientations at

90 intervals. Other operators such as the Frei-Chen [58], Kirsch [57], and Robinson [59] operators

can also be oriented at 45 intervals allowing up to 4 orientations to be detected (Figure 4.3 (a)).

In contrast, the human vision system detects 18 different orientations at 10 intervals [10].

Operators that are specified by a continuous function rather than a fixed mask can be rotated

to any arbitrary orientation. The Gabor filter [60] and the Canny operator [13] (Figure 4.4) can

both be described mathematically and are two of the most advanced edge detectors as they have

a similar receptive field to the edge detectors of the human vision system [12].

Figure 4.1 shows the output of the various edge detectors discussed in this section applied to a

test image. However, it is not possible to determine a good edge detector simply by looking at its

output. Instead we must look at the design and features of an edge detector with respect to the

requirements of contour following.

77

Page 100: Content-based Retrieval of Digital Video

3 3 3

303

-5 -5 -5 3

3

3

-5

3

-5 -5

3

0

-5 3 3

30-5

-5 3 3 3

3

3

-5

-5

3 3

-5

0

(a) Kirsch masks

(b) Kirsch mask results

Figure 4.3: The Kirsch mask [57] applied to the image in Figure 4.1(a). The masks detect edges at

0 , −45 , −90 , and −135 .

78

Page 101: Content-based Retrieval of Digital Video

4.2 Edge Detector Requirements

For each pixel, contour following requires the orientation and strength of each edge. Contour

following also requires highly tuned edge responses. Tuning can occur across orientations and also

across spatial locations. Figure A.8 shows the orientation tuning response curve for a simple cell

in human vision. Likewise, oriented edge detectors produce different edge responses depending on

the orientation of the edge input. The output will peak when the orientation of the edge and the

detector are aligned and will fall off as their orientations change. Since contour following will follow

the orientation with the largest strength it is important that the edge detectors are tuned tightly so

that the contour following algorithm doesn’t inadvertently follow the wrong orientation. However,

the tuning can not be too tight as responses by two edge detectors with adjacent orientations can

be used to determine the exact orientation of an edge that lies between the two orientations.

Position tuning is also important as a contour following algorithm will also consider a neigh-

bourhood of pixels to determine the next pixel to include in the contour. If two adjacent pixels

produce a strong response then the contour following algorithm may unnecessarily create two

contours at that point rather than following the pixel that the edge is truly aligned to.

Adjacent orientation responses are used to determine the exact orientation of an edge. In the

same manner it is possible to use adjacent position responses to determine the exact position of

an edge. This process is called subpixel edge detection [106], however subpixel edge detection is

beyond the scope of this research, primarily because each stage of edge and contour processing

assumes that each edge is aligned with the centre of a pixel.

In summary, the edge detector must satisfy the following requirements:

• Produce multi-orientation output

• Orientation-tuned with only two adjacent responses generated

• Position-tuned with only one adjacent response generated

• Efficient, small window, convolution-style operator

4.3 Multi-orientation Operators

The Gabor and Canny operators are the most suitable operators for multi-orientation edge de-

tection as they are described by a continuous function (and therefore can be used to construct

multi-orientation detectors), resemble edge detectors in the human vision system, and have been

extensively investigated [60, 13, 107]. Other fixed mask operators such as the Laplacian, difference

of Gaussians, Roberts, Prewitt, and Sobel operators are not suitable because they only support 1

to 4 orientations. An additional benefit of the Gabor and Canny operators is that they are scalable

and can be used to identify edges of different resolutions.

79

Page 102: Content-based Retrieval of Digital Video

In this research we have decided to use the S-Gabor filter proposed by Heitger et al. [12] over the

standard Gabor filter. The standard Gabor filter modulates a sine or cosine wave with a Gaussian

envelope:

Godd(x) = e−x2/2σ2sin[2πv0x] (4.1)

Geven(x) = e−x2/2σ2cos[2πv0x] (4.2)

where σ is the bandwidth of the Gaussian envelope and v0 is the wavelength of the sine wave.

The odd Gabor filter is used for edge detection whilst the even Gabor filter can be used for line

detection. The Gaussian envelope of the Gabor filter is not able to curtail the periodic nature of the

sine or cosine wave and therefore additional fluctuations of the wave may appear at the extremities

of the filter. Since edges are a local phenomenon there is no need for a periodic wave and the

S-Gabor filter reduces the frequency of the sine wave as x increases so that only one wavelength is

present:

Sodd(x) = e−x2/2σ2sin[2πv0xξ(x)] (4.3)

Seven(x) = e−x2/2σ2cos[2πv0xξ(x)] (4.4)

ξ(x) = ke−λx2/σ2+ (1− k) (4.5)

where k determines the change of wavelength. The Canny operator is simpler as it does not use

periodic functions:

C(x) =xe−x2/2σ2

σ2(4.6)

When a multi-orientation operator is applied to an image multiple edge images are generated.

Therefore the greater the number of orientations per edge detector the greater the amount of

memory is required to store the result images and also the longer it will take to generate the

images. For the purposes of optimisation it is beneficial for the number of orientations to be as

small as possible. We have decided to use 12 orientations at 15 intervals as a compromise between

the 18 orientations of human vision and the 1 to 4 orientations offered by the fixed mask operators.

4.3.1 Multi-orientation Experiments

The S-Gabor and Canny operators were chosen because they can be used at any orientation and

resemble the receptive fields of visual cortex simple cells. The odd S-Gabor filter was constructed

in two dimensions using the following formulae:

Sodd(x′, y′) = e−(x′2+y′2)/2σ2sin[2πv0y

′ξ(x′, y′)] (4.7)

ξ(x′, y′) = ke−λ(x′2+y′2)/σ2+ (1− k) (4.8)

where x′ and y′ are the rotated and scaled pixel co-ordinates defined below in Equations 4.10 and

4.11. The remaining parameters were adjusted to provide a filter that produces only one period of

the sine wave under the Gaussian envelope with a wavelength of 2 pixels, resulting in σ = 0.646,

v0 = 0.5, λ = 0.3, and k = 0.5.

80

Page 103: Content-based Retrieval of Digital Video

The Canny filter was constructed in two dimensions using the following formula:

C(x) =−y′e−(x′2+y′2)/2σ2

σ2(4.9)

where σ = 0.35 to also provide a separation of one pixel between lobe peaks.

The filters were rotated and scaled by pre-rotating and scaling the x and y pixel co-ordinates:

x′ =x cos(−θ)− y sin(−θ)

sx(4.10)

y′ =x sin(−θ) + y cos(−θ)

sy(4.11)

where θ = n π12 , n = (0→ 11), sy = 1, and sx determines the elongated aspect ratio of the filter.

The S-Gabor and Canny operators are very similar in shape and the similarity is shown in

their respective tuning response curves. The tuning response curves display the magnitude of the

response of the operator at different lateral positions and orientations to the edge stimulus. A

vertical black and white edge was used as the stimulus and 12 orientations of the operator were

convolved with the stimulus. Position response values were taken from the few pixels either side of

the edge whilst orientation responses were taken from each of the 12 resulting images.

In our analysis we are primarily interested in the highest frequency edges representable by the

image. These edges are formed between two adjacent pixels. Therefore the filters have a width of 2

pixels with each lobe centred on a pixel. The length of the filter must be greater than one pixel and

should be less than 10 pixels so that curves are detectable. A longer filter is desirable to filter out

noisy edges. Because there is no exact restriction on filter length we will first analyse the tuning

response curves at different lengths to determine the best length.

4.3.2 Multi-orientation Results

Figure 4.4 shows the aspect ratios of the S-Gabor and Canny operators tested. Figures 4.5 and 4.6

show the tuning response curves for the S-Gabor and Canny operators respectively at the different

aspect ratios. By comparing the graphs it can be seen that the Canny and S-Gabor operators

show very similar results (although at different aspect ratios). This can be explained by the Canny

operator being shorter than the S-Gabor operator. Since there is no difference in orientation and

position tuning between the two operators either one may be used. We have selected the Canny

operator because it requires fewer parameters.

The tuning response curves show that shorter filters provide very good position tuning but

poor orientation tuning, whilst the longer filters provide good orientation tuning but poor position

tuning at orientations slightly different to that of the edge. These tuning response curves can be

explained by visualising the overlapping of the operator lobes over a test edge (see Figure 4.7 (a)

to (d)). These scenarios indicate that whenever the edge stimulus is asymmetrical over the length

of the filter a response shouldn’t be generated. What is required is an asymmetry detector whose

response is negated from the response of the edge detector.

81

Page 104: Content-based Retrieval of Digital Video

S Gabor

Canny

Canny asymmetry

1:11.5:12:13:14:1

1:11.5:12:13:14:16:1

1.33:12:12.67:14:1

Figure 4.4: Filters tested.

4.4 Asymmetry Detector

A simple approach to identify asymmetry of edge response along the length of an edge detector

could be to simply use the same edge detector but at a 90 orientation. However, such a filter

would give the same tuning responses as those in Figure 4.6 but shifted 90 and wouldn’t be

sufficient to nullify erroneous responses. What is required is a filter which is the same shape as the

edge detector but at a 90 orientation (see Figure 4.7).

The same formula for constructing the Canny edge detector in Equation 4.9 is used for the

asymmetry filter (σ = 0.5) however the rotation and scaling equations are modified to allow for an

orthogonal orientation and aspect ratio:

x′ =3[x cos(π

2 − θ)− y sin(π2 − θ)]

2sx(4.12)

y′ =x sin(π

2 − θ) + y cos(π2 − θ)

sy(4.13)

The direction of asymmetry is not relevant so the absolute asymmetry response is subtracted

from the Canny edge detector modulated by a tuning factor t:

EA = |C| − t|A| (4.14)

where C is the response of the Canny edge detector, A is the response from the asymmetry filter,

and EA is the final edge response.

82

Page 105: Content-based Retrieval of Digital Video

0.00E+00

1.00E+02

2.00E+02

3.00E+02

4.00E+02

5.00E+02

6.00E+02

Response

0 15 30 45 60 75 90 105 120 135 150 165

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

S Gabor 1:1

0.00E+00

1.00E+02

2.00E+02

3.00E+02

4.00E+02

5.00E+02

6.00E+02

Response

0 15 30 45 60 75 90 105 120 135 150 165

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

S Gabor 1.5:1

0.00E+00

1.00E+02

2.00E+02

3.00E+02

4.00E+02

5.00E+02

6.00E+02

Response

0 15 30 45 60 75 90 105 120 135 150 165

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

S Gabor 2:1

0.00E+00

1.00E+02

2.00E+02

3.00E+02

4.00E+02

5.00E+02

6.00E+02

Response

0 15 30 45 60 75 90 105 120 135 150 165

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

S Gabor 3:1

0.00E+00

1.00E+02

2.00E+02

3.00E+02

4.00E+02

5.00E+02

6.00E+02

Response

0 15 30 45 60 75 90 105 120 135 150 165

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

S Gabor 4:1

Figure 4.5: Gabor tuning response curves

83

Page 106: Content-based Retrieval of Digital Video

0.00E+00

1.00E+02

2.00E+02

3.00E+02

4.00E+02

5.00E+02

6.00E+02

Response

0 15 30 45 60 75 90 105 120 135 150 165

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

Canny 4:1

0.00E+00

1.00E+02

2.00E+02

3.00E+02

4.00E+02

5.00E+02

6.00E+02

Response

0 15 30 45 60 75 90 105 120 135 150 165

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

Canny 6:1

0.00E+00

1.00E+02

2.00E+02

3.00E+02

4.00E+02

5.00E+02

6.00E+02

Response

0 15 30 45 60 75 90 105 120 135 150 165

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

Canny 2:1

0.00E+00

1.00E+02

2.00E+02

3.00E+02

4.00E+02

5.00E+02

6.00E+02

Response

0 15 30 45 60 75 90 105 120 135 150 165

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

Canny 3:1

0.00E+00

1.00E+02

2.00E+02

3.00E+02

4.00E+02

5.00E+02

6.00E+02

Response

0 15 30 45 60 75 90 105 120 135 150 165

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

Canny 1:1

0.00E+00

1.00E+02

2.00E+02

3.00E+02

4.00E+02

5.00E+02

6.00E+02

Response

0 15 30 45 60 75 90 105 120 135 150 165

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

Canny 1.5:1

Figure 4.6: Canny tuning response curves

84

Page 107: Content-based Retrieval of Digital Video

Asymmetry detector

Edge Detector

(a) (b)

(c) (d) (e)

Figure 4.7: (a) to (d) Edge operator scenarios. (a) Alignment of operator with edge; (b) orientation

misalignment; (c) orientation and position misalignment; (d) position misalignment. (e) Asymmetry

detector overlaid on edge detector.

4.4.1 Asymmetry Detector Results

The tuning curves of asymmetry filters for the 3:1, 4:1, and 6:1 Canny edge detectors are shown in

Figure 4.8. The tuning curves are sufficient to nullify erroneous responses (however, shorter aspect

ratios below 3:1 were not sufficient). The result of the asymmetry edge detector with tuning t = 1

is shown in Figure 4.9 (a). With a tighter tuning parameter of t = 2 the result is a perfectly tuned

edge detector in both orientation and position (Figure 4.9 (b)).

The edge stimulus used is a perfect vertical edge aligned to one of the edge detector orientations.

To test whether the Asymmetry detector performs as well with edge orientations which are not

aligned with one of the edge detector orientations the same vertical edge was tested with edge

orientations at a 7.5 offset which is half way between the usual 15 interval between edge detector

orientations. Figure 4.10 shows the results for the 7.5 offset edge detector. The Asymmetry edge

detector successfully provides two identical responses for each adjacent orientation indicating that

the orientation of the edge lies exactly halfway between the two orientations.

The tuned operator appears to work well for any aspect ratio greater than or equal to 3:1.

However, because the tuned operator is inhibited by asymmetrical stimulus it may have problems

at corners (Figure 4.11). The tuning curves at corners for the three aspect ratios are shown in Figure

4.12. Figure 4.12 shows that there is no response for the edge as the edge detector approaches the

corner. Larger aspect ratio operators fall off early whilst the 3:1 aspect ratio operator falls off only

one pixel before the end of the contour. Therefore, the best operator for all scenarios is the 3:1

aspect ratio operator.

85

Page 108: Content-based Retrieval of Digital Video

0

100

200

300

400

500

600

Response

0 15 30 45 60 75 90 105 120 135 150 165

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

Canny Asymmetry 2.67:1 (matches 4:1)

0

100

200

300

400

500

600

Response

0 15 30 45 60 75 90 105 120 135 150 165

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

Canny Asymmetry 4:1 (matches 6:1)

0

100

200

300

400

500

600

Response

0 15 30 45 60 75 90 105 120 135 150 165

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

Canny Asymmetry 2:1 (matches 3:1)

0

100

200

300

400

500

600

Response

0 15 30 45 60 75 90 105 120 135 150 165

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

Canny Asymmetry 1.33:1 (matches 2:1)

Figure 4.8: Asymmetry tuning curves.

0

100

200

300

400

500

600

Response

0 15 30 45 60 75 90 105 120 135 150 165

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

Canny Tuned 3:1

0

100

200

300

400

500

600

Response

0 15 30 45 60 75 90 105 120 135 150 165

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

Canny Tuned 3:1 minus 2x Asymmetry

Figure 4.9: Combined edge detector and asymmetry inhibitor at 3:1 aspect ratio. (a) t = 1, (b)

t = 2.

86

Page 109: Content-based Retrieval of Digital Video

0

50

100

150

200

250

Response

0 15 30 45 60 75 90 105 120 135 150 165

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

Canny Tuned 3:1 minus 2x Asymmetry, 7.5˚ offset

Figure 4.10: Tuned edge detector at 7.5 orientation offset.

The one pixel fall off in edge response before the end of the contour may affect contour extraction

as the contours extracted will not include the last pixel of the corner. However, losing one pixel

before the end of a contour appears to be a fair trade off for the improved orientation and position

tuning gained. In addition, contour-end detection and vertex extraction which are investigated in

the following chapter would be able to identify the corner and higher level processing stages would

be able to link the vertex to the edges.

Figure 4.13 shows how the Asymmetry detector compares with the standard Canny detector

for sample test images. The Asymmetry edge detector results of Figure 4.13 (c) show tighter

positional tuning than the Canny edge detector results of Figure 4.13 (b). The orientation tuning

performance is not as easily seen in a single aggregate image however the impact of improved

orientation tuning in the Asymmetry detector can be seen in later stages of processing which is

indicated by the thinned Canny and Asymmetry responses of Figure 4.13 (d) and (e) respectively.

The thinned Asymmetry edges using the thinning technique discussed in the next section contain

fewer spurious responses than the thinned Canny edges.

4.5 Thinning

Using the Asymmetry edge detector developed in the previous sections the edge responses should

be tightly tuned in both orientation and position. However, it is still possible that an edge may

generate responses over a number of positions because it’s wavelength is greater than that of the

edge detector. Therefore it is still necessary to perform some thinning on the edge responses to

reduce contours to 1 pixel thickness, which is required by the contour following algorithm.

We are only interested in thinning along the direction of a contour. Current thinning techniques

such as morphological thinning and skeletonisation ignore the direction of a contour. As a result

thinning will occur in all directions. Figure 4.15 (a)-(d) shows the results of thinning the cube

87

Page 110: Content-based Retrieval of Digital Video

Figure 4.11: Possible problem when tuned edge detector is placed over a corner.

0

100

200

300

400

500

Response

1 2 3 4 5 6 7 8 9 10 11 12

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

Canny Tuned 4:1 Corner

0

100

200

300

400

500

600

Response

1 2 3 4 5 6 7 8 9 10 11 12

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

Canny Tuned 3:1 Corner

0

100

200

300

400

500

600

Response

1 2 3 4 5 6 7 8 9 10 11 12

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

Canny Tuned 2:1 Corner

0

50

100

150

200

250

300

350

Response

1 2 3 4 5 6 7 8 9 10 11 12

S1

S2

S3

S4

S5

S6

S7

Orientation (degrees)

Position

Canny Tuned 6:1 Corner

Figure 4.12: Corner tuning curves

88

Page 111: Content-based Retrieval of Digital Video

(e)

(d)

(c)

(b)

(a)

Figure 4.13: Canny and Asymmetry edge responses for the Chapel, Plane, and Claire images. (a)

Sample images; (b) Results after applying the Canny edge detector; (c) Results after applying the

Asymmetry edge detector; (d) Thinned Canny edges; (e) Thinned Asymmetry edges.

89

Page 112: Content-based Retrieval of Digital Video

(b)(a)

Figure 4.14: (a) Cube test image; (b) asymmetry edge detector responses.

responses of Figure 4.14 using the skeletonisation and morphological thinning algorithms.

Thinning can be applied to either the individual orientation responses or to the aggregate

edge responses. However, as can be seen in Figure 4.15 (a)-(d), skeletonisation and morphological

thinning techniques ignore the direction of the contour responses and any interaction between

different orientations. Therefore a new thinning process has been developed which only thins along

the direction of a contour whilst taking into account adjacent orientations.

Morphological thinning approaches process a small neighbourhood of an image, for example a

3×3 neighbourhood of pixels. In non-directional techniques the goal is to remove any pixel adjacent

to another which lies on an edge. In directional techniques the approach is similar but pixels are

only considered adjacent along the perpendicular to the orientation of the edge response (Figure

4.16). Morphological approaches work well for edge responses that are aligned to the horizontal,

vertical, and diagonal layout of pixels. Thinning occurs by removing a pixel if two pixels are found

to be in adjacent locations. Which pixel is removed depends on the depth of the image. For binary

images there is often an iterative process where the pixels lying on the edge of the region are

removed first and the process stops when no more pixels are removed. For greyscale images, the

magnitude of the edge response can be used to determine which pixel will be removed. Usually the

pixel with the lesser magnitude is removed.

Using a neighbourhood aligned to pixel positions becomes less useful when working with more

than four orientations because the positions of adjacent responses no longer align to the centre of

existing pixels (Figure 4.16 (c) and (d)). In fact this is also true even for 45 orientations because

the distance between pixel centres is greater than the distance between the centres of horizontally

and vertically aligned pixels. Therefore if more than the horizontal and vertical orientation are to

be used for thinning then a more sophisticated technique is required to determine neighbourhood

90

Page 113: Content-based Retrieval of Digital Video

(f)(e)

(d)(c)

(b)(a)

Figure 4.15: (a) Skeletonisation of the aggregate edge responses; (b) aggregate of the skeletonisation

of individual orientation edge responses; (c) morphological thinning of the aggregate edge responses;

(d) aggregate of the morphological thinning of individual orientation edge responses; (e) Gaussian

thinning; (f) diagonal removal.

91

Page 114: Content-based Retrieval of Digital Video

(a) (b) (c) (d)

Figure 4.16: Positions of perpendicularly adjacent responses used for thinning. (a) Vertical, (b)

horizontal, (c) 45 , and (d) 15 .

responses.

4.5.1 Gaussian Thinning

The problem facing morphological techniques is a sampling problem where the sampling no longer

occurs at pixel centres. To solve the sampling problem we have created three elongated Gaussian

filters that sample at three positions orthogonal to the orientation of the edge (see Figure 4.17).

The distance between each filter remains constant regardless of the orientation, thereby solving

the sampling problem. The outputs from the three filters are then used to thin laterally along the

orientation.

The three Gaussian filters are based on the two-dimensional Gaussian envelope:

G = e−(x′2+y′2)/2σ2(4.15)

where σ is the bandwidth of the envelope and is set to 0.5, and x′ and y′ are the scaled, translated,

and rotated pixel co-ordinates:

x′ =x cos(−θ)− y sin(−θ)

sx(4.16)

y′ =x sin(−θ) + y cos(−θ) + ty

sy(4.17)

where θ is the orientation of the elongated Gaussian filter ranging in 15 increments from 0

to 165 , sx and sy determine the shape of the filter and are set to sx = 4 and sy = 1, and ty

determines the lateral translation of the elongated Gaussian filter and has the values (−1, 0, 1) for

the three lateral filters centring each Gaussian filter one pixel from the centre pixel in an orthogonal

direction from the orientation of the edge.

An edge response is cleared if either of the two lateral Gaussian samples in the same orientation

or the four lateral samples in adjacent orientations are greater than the Gaussian sample centred

at the current pixel, that is, if either of the following are true:

G−1(θ) > G0(θ) (4.18)

92

Page 115: Content-based Retrieval of Digital Video

Figure 4.17: Position of Gaussian filters used for thinning.

Figure 4.18: Potential double pixel lines after Gaussian thinning.

G1(θ) > G0(θ) (4.19)

G−1(θ +π

12) > G0(θ) (4.20)

G1(θ +π

12) > G0(θ) (4.21)

G−1(θ −π

12) > G0(θ) (4.22)

G1(θ −π

12) > G0(θ) (4.23)

where G−1, G0, and G1 are the three lateral Gaussian samples and θ is the orientation of the

elongated Gaussian filter. This first criteria thins laterally across orientations but does not perform

orientation competition at the centre pixel.

Orientation competition is performed by preserving the largest two adjacent edge responses in

a local neighbourhood along the orientation axis. Two adjacent edge responses are preserved so

that the true orientation of the edge can be interpolated. To be preserved, the current orientation

Gaussian response must be greater than or equal to the two adjacent orientation responses:

G0(θ) ≥ G0(θ ±π

12) (4.24)

or, the current orientation may have a greater adjacent orientation but it must be greater than the

responses adjacent to these two, that is:

G0(θ) < G0(θ −π

12) and G0(θ) > G0(θ +

π

12) and G0(θ) > G0(θ +

12) (4.25)

or:

G0(θ) < G0(θ +π

12) and G0(θ) > G0(θ −

π

12) and G0(θ) > G0(θ −

12) (4.26)

The result of applying the Gaussian thinning technique is shown in Figure 4.15 (e) where it can

be seen that the edges are successfully thinned along the orientation of the edges. There is still one

93

Page 116: Content-based Retrieval of Digital Video

Step 1 Step 2

Figure 4.19: Diagonal removal.

problem with this technique in that it is not able to reduce the 45 orientation edge responses to

a one pixel thick line (see Figure 4.18). This is because the resulting two pixel line contains very

little overlap in the perpendiculars, so the existing two pixels aren’t compared with each other. The

technique for thinning the 45 orientations is shown in Figure 4.19. If both positions of a diagonal

are occupied in a 2× 2 block then the other two positions are removed. If not then the reverse is

checked to see if the first diagonal should be removed. The values in adjacent orientations are also

checked. The result after removing diagonals is shown in Figure 4.15 (f). Compared with Figure

4.15 (a)-(d), Gaussian thinning produces thinner lines and conforms to the original orientations of

the edge responses.

4.5.2 Gaussian Thinning Results

Results of Gaussian thinning are shown in Figure 4.13 (d) and (e) applied to Canny and Asymmetry

edge responses respectively. The edges extracted are successfully thinned along the orientation of

the contours. The figure also demonstrates the benefits of using the Asymmetry detector for higher

level edge processing such as thinning. The thinned Asymmetry responses contain fewer spurious

edges than the thinned Canny responses showing that the multi-orientation Gaussian thinning

technique performs better with tightly tuned orientation and position edge detectors.

4.6 Asymmetry Edge Detector as a Computational Model

of the Visual Cortex

In Section 2.6 computational models of the visual cortex were presented. These models are de-

signed to validate vision processing theories rather than to be efficient edge detectors for use in

CBVR applications. The asymmetry edge detector presented in this chapter is also motivated by

the architecture of the visual cortex but is designed to be used in CBVR and other image pro-

cessing applications. Figure 4.20 shows the asymmetry edge detector and thinner in the context

of vision processing in the visual cortex. Both the Canny edge detector and asymmetry detectors

are represented as simple cells with the output of the asymmetry detector inhibiting the Canny

94

Page 117: Content-based Retrieval of Digital Video

Simple Cell

Canny 3:1

Simple Cell

Asymmetry 2:1

Photoreceptors

RGB Image

Inhibition

Orientation and Spatial Competition

Gaussian Thinning

Spatial Competition

Remove Diagonals

Figure 4.20: Asymmetry edge detector model of the visual cortex.

edge detector. The Gaussian thinning stage represents both orientation and spatial competition in

the visual cortex whilst the remove diagonals stage represents spatial competition between simple

cells. The asymmetry edge detector differs from other models such as Marr’s [56] and Grossberg’s

[94] as it does not attempt to model the non-directional ganglion and LGN cells. It is also a purely

feed-forward implementation resulting in a simpler architecture and faster execution. Higher-level

stages of the model such as edge linking and end-stopped detection are discussed in the following

chapter.

4.7 Texture Inhibition

In this chapter edge detection techniques have been presented that can detect boundaries between

regions of homogeneous colour. Detecting boundaries between regions of heterogeneous colour,

such as texture, is more complex because local edges are also formed within the regions. Consider

Figure 4.21 (a) for example, even though the different textures are easily distinguishable by the

human brain there are no contours formed by a consistent change in homogeneous colour, as can be

seen by the lack of edge response along the texture borders in Figure 4.21 (b). Therefore the edge

techniques presented in this chapter alone are not enough to identify boundaries between regions

of texture.

Identifying texture boundaries is crucial for higher-level processing of contours. Since textures

consist of contours, a contour processing stage will process all of the contours within the texture,

which is unnecessary as these contours do not represent boundaries. Therefore it is beneficial to

inhibit texture regions before higher-level processing such as contour extraction occurs. Identifying

texture regions can be difficult as any occurrence of contours could be considered texture. There-

fore rather than simply identifying textures we present a technique that identifies the boundaries

between textures, which would also include non-textural contours. Higher level processes will only

process contours that lie within texture boundaries.

95

Page 118: Content-based Retrieval of Digital Video

(a) (b) (c)

Figure 4.21: (a) A composite of Brodatz textures D9, D38, D92, and D24 histogram equalised [108],

(b) Edge responses of composite texture image, (c) Moving average of maximum edge responses.

4.7.1 Psychological and Perceptual Basis

Through intensive psychological studies Tamura et al. [39] found that humans group textures into

three groups based on coarseness, contrast, and directionality. Coarseness refers to the size of

the repeating pattern, contrast refers to the overall ratio between darkness and lightness in the

texture, and directionality refers to the orientation of the texture. A similar study conducted by

Rao and Lohse [68] found that humans grouped patterns by repetitiveness, directionality, and

complexity. Once again repetitiveness refers to the scale of the pattern and directionality refers to

the orientation of the texture. However, the third texture dimension of complexity refers to how

ordered the placement of the texture patterns are. The complexity could also be considered as

noise.

The first challenge is whether the edge responses of the Asymmetry edge detector are sufficient

to represent the three dimensions of texture. Since the primary component of the Asymmetry edge

detector is the Canny operator, which is similar to a Gabor filter, the edge detector is able to filter

spatial frequencies in a similar way to a wavelet. Therefore, the edge detector is able to detect

Tamura’s coarseness [39] or Rao and Lohse’s repetitiveness [68] which is essentially the spatial

frequency of the texture. Since the edge detector is also oriented, elongated, and uses an asymmetry

inhibitor to fine tune the orientation response, the edge detector is quite capable of representing

the orientation of a texture. Tamura’s contrast can also be represented by the amplitude of the

edge detector response since the edge detector responds to spatial changes which also affect the

contrast of the texture. The component that the edge detector does not represent directly is Rao

and Lohse’s complexity. However, the complexity of the texture is implicit in the location of the

edge responses. Therefore further processing of the edge responses is required to determine the

complexity of the texture. However, our goal is not so much to simply extract the features of the

texture but more importantly to define the spatial extent of a texture and the boundaries between

textures.

96

Page 119: Content-based Retrieval of Digital Video

(a) (b)

Illusory contour

Figure 4.22: (a) Patch-suppressed cell; (b) Abutting grating stimulus.

There is some basis for the inhibition of edge responses through texture detectors in human

vision research. Sillito et al. [109] found a majority of cells (33/36) in V1 where the response

was suppressed by an increasing diameter of a circular patch of drifting sinusoidal grating. These

cells are known as patch-suppressed cells. They found that a small disk grating or a large disk

grating with an empty centre will evoke a response but not when both are combined. Therefore,

larger areas of dense edge responses will be inhibited. Sillito et al. [109] also performed cross-

correlation experiments on pairs of cells that were cross-oriented (had preferred stimulus that were

approximately 90 to each other). They found a high correlation between cross-oriented simple

cells when the stimulus had inner and outer gratings at 90 to each other (see Figure 4.22 (a)),

suggesting functional connectivity. Larkum et al. [110] found pyramidal neurones in layer 5 which

fired if both distal and proximal dendrites received input but not if either alone were activated.

Therefore, larger areas of dense edge responses are inhibited, but only if they do not border another

area of dense edge responses which ideally have a perpendicular orientation. Grosof et al. [111] have

also found cells in V1 which respond to the illusory contour formed at the end of abutting gratings

which are different to the cells found in V2 by Soriano et al. [112] which respond to more general

types of illusory contours. The abutting grating stimulus (see Figure 4.22 (b)), which is essentially

the boundary of a texture, shows that the edge boundary between textures is detected early on in

the visual pathway.

Some textures do not have clearly defined boundaries and segregation is dependent on higher

level processing. One example is that texture elements with differing numbers of line ends are easier

to segregate than those with the same number of terminations [113]. Psychophysical experiments

performed by Beck et al. [114] found that the strength of segregation depended on the contrast

and size difference of texture elements. The size difference can also be represented as a contrast

difference, hence the perception can be explained solely through contrast. They also found that

hue can have the same effect but only if the texture element and background are of the same lu-

minance. Beck et al. [114] were able to simulate the psychophysical results using bandpass filters.

It may appear possible that the oriented bandpass filters of the primary visual cortex can perform

texture segregation. Based on the results of Beck et al. [114] this appears possible, however neu-

rophysiological recordings have found that global segregation does not occur at this stage [115]. It

97

Page 120: Content-based Retrieval of Digital Video

is possible that texture segregation can occur at a number of levels which provides a basis for the

low-level approach for processing texture boundaries taken in this chapter.

4.7.2 Texture Identification

Areas of texture need to be identified so that they do not interfere with the extraction of re-

gion boundaries. However, the boundary between two textures should also be considered a region

boundary. Therefore, a technique is required that identifies areas of texture but does not consider

the boundaries between textures as texture. An area of image consists of texture if it contains a

repeating pattern of contours. Therefore the first characteristic of a texture is that it consists of

a uniform spatial distribution of contours. The smallest unit of a repeating pattern is the texture

element, also known as a texton [116]. The distribution of contours within the texture element

does not need to be uniform, however, there must be some uniformity in the distribution of tex-

ture elements. Uniformity of distribution can be represented by the moving average of the edge

responses. Changes in the moving average reflect a change in spatial density of edge responses

within a window. The window of the moving average must be equal or greater than the size of the

texture element. For this research we have chosen a window size of 32 pixels wide and high.

Using the composite image formed from the four Brodatz textures of Figure 4.21 (a) the first

step is to extract the edge responses. The edge responses consist of 12 images representing each

15 orientation. Figure 4.21 (b) shows the maximum response from all orientations for each pixel.

Applying a moving average to the maximum edge responses produces the image in Figure 4.21 (c).

Unfortunately, applying a moving average to the maximum edge responses does not reveal much

change between the textures. This is because the textures of Figure 4.21 (a) have a relatively similar

edge density. However, the shapes of the texture elements are different and should be revealed by

processing edge orientations individually.

Figure 4.23 (a) shows the moving average applied to each orientation individually. The results

are multiplied by a factor of 10 to make the differences more visible. The differences between the

four textures begin to be revealed when the orientations are processed individually. This approach is

similar to the bandpass filters used by Beck et al. [114] to simulate visual cortex texture segregation.

A problem with the moving average approach is that the square window produces rectangular

artefacts in the average responses. This is caused by the moving average function giving every

pixel equal weighting, even those on the border of the window. The rectangular artefacts can be

removed by using a window with a Gaussian envelope where pixel weighting decreases as the radius

increases from the centre of the window. A two dimensional Gaussian filter with a bandwidth (σ)

of 10 pixels was used in place of the moving average function.

f(x, y) = e−x2+y2

2σ2 (4.27)

Since the convolution of the Gaussian filter with the edge responses can be applied in the Fourier

98

Page 121: Content-based Retrieval of Digital Video

domain, the processing time is considerably less than the moving average approach. The results of

applying the Gaussian filter to the edge responses are shown in Figure 4.23 (b). The rectangular

artefacts are now removed, however the texture borders are less defined. Even so, the Gaussian

moving average of the oriented edge responses is able to detect areas of consistent texture.

4.7.3 Texture Edges

Even though the Gaussian moving average approach is able to successfully identify texture regions

it does not identify borders between textures. The borders between textures must be identified so

that they are not included in the texture areas that will inhibit higher level contour processing.

With the Gaussian moving average approach textures are represented by areas with similar moving

average values. Since the moving average is applied to each orientation, differences between textures

containing texture elements that vary by shape can also be identified. Textures that exhibit a strong

orientation will distribute most of the edge responses in one orientation, such as the top right

hand texture of Figure 4.21 (a). However, textures with multiple orientations will distribute edge

responses across multiple orientations. Nonetheless, differences in shape between textures can still

be identified in the individual orientation responses, as can be seen in Figure 4.23 (b). Therefore,

a texture border will occur when there is no consistency of oriented texture within a region. The

lack of consistency can be represented by the variance (σ2) of moving average responses within a

window.

σ2 =∑

(x− µ)2 (4.28)

A window of 32× 32 pixels was used to compute the variance. The individual variance images

for each orientation are then summed to produce the final image which is shown in Figure 4.24 (a).

The final image clearly shows the borders between the top right texture and the other textures but

only partially represents the bottom and left borders. The variance of moving averages of the edge

responses is similar to the patch-suppressed cells of the human visual cortex reported by Sillito et

al. [109] in that large areas of similar edge responses will be inhibited unless there is variance in

the edge responses over the area.

Since the variance computation also uses a square window similar to the moving average com-

putation it was investigated whether using a Gaussian mask for the variance computation would

improve the results. The computation of µ remains the same however the squared difference of

(x − µ)2 is multiplied by the corresponding Gaussian mask before adding to the variance value.

The results of the Gaussian mask are shown in Figure 4.24 (b) and do not appear to provide

a significant improvement over the square variance approach. Minor differences between the two

images are mainly due to the Gaussian mask being slightly larger than the square window.

99

Page 122: Content-based Retrieval of Digital Video

(a)

(b)

Figure 4.23: (a) Moving average applied to individual orientations, (b) Gaussian filter with band-

width of 15 pixels applied to individual orientations.

100

Page 123: Content-based Retrieval of Digital Video

(a) (b)

Figure 4.24: (a) Variance of moving average, (b) Gaussian variance of moving average.

4.7.4 Texture Noise

The edge responses used to identify texture and texture borders in the last few sections primarily

represent the shape of the texture. The results of the variance computation in Figure 4.24 show

that the shape information alone is not enough to distinguish between textures. The three di-

mensions identified by Rao and Lohse [68] were repetitiveness, directionality, and complexity. The

directionality is represented by the oriented edge responses. However, the edge responses do not

provide a direct indication of the complexity of the texture.

Francos et al. [36] used the Wold decomposition to decompose textures into harmonic and

indeterministic components. The Wold components also relate to the components identified by Rao

and Lohse [68] where the harmonic represents repetitiveness and the indeterministic component

represents complexity. By extending the Wold decomposition into two dimensions Francos et al. [36]

also included a new component called the evanescent component which represents the orientation

of texture. Francos et al. [63] used the auto-regressive moving average (ARMA) model to isolate

the indeterministic component. However, any noise model can and has been used such as moving

average (MA), auto-regressive (AR) [62], simultaneous auto-regressive (SAR) [61], multi-resolution

SAR (MRSAR) [64], Gauss-Markov, and Gibbs [65] models. The SAR model is an instance of

Markov random field (MRF) models [64]. Mao and Jain [64] used SAR and MRSAR models to

perform texture classification and segmentation. In this section we also investigate using the SAR

model for the purpose of identifying boundaries between textures.

101

Page 124: Content-based Retrieval of Digital Video

SAR Model

The SAR model is as follows [64]:

g(s) = µ +∑r∈D

θ(r)g(s + r) + ε(s) (4.29)

where g(s) is the grey level value of a pixel at site s = (s1, s2), D is the set of neighbours at

site s which usually consists of the eight adjacent pixels, ε(s) is an independent Gaussian random

variable with zero mean and variance σ2, θ(r), r ∈ D are the model parameters characterising the

dependence of a pixel to its neighbours, and µ is the bias which is dependent on the mean grey

value of the image.

Texture representation using the SAR model involves determining the parameters µ, σ, and

θ(r), r ∈ D. For a symmetric model where θ(r) = θ(−r), all model parameters can be estimated us-

ing the least squares error (LSE) technique or the maximum likelihood estimation (MLE) method.

Mao and Jain [64] used the LSE technique because it is less time consuming and yields very similar

results to the MLE method.

SAR Implementation

Since more than one variable needs to be determined multiple regression must be used over simple

linear regression. The challenge with the SAR model is to choose an appropriate window size. In

this research the window size will be kept consistent at 32× 32 pixels. For each window, multiple

regression is used to determine the relationship between every pixel in the window and its eight

immediate neighbours. Multiple regression is usually solved using matrices. Equation 4.29 must be

rewritten using matrices:

Y = Xβ + ε (4.30)

Given that n is the set of pixels within a window and p is the set of eight neighbours around

each pixel then Y is the n× 1 matrix of grey level values within the window, X is the n× p matrix

of predictors within the window, that is, each column contains all eight neighbours for each pixel,

β is a p× 1 matrix containing the parameters θ(r), and ε is a n× 1 matrix of random disturbances

for each pixel. Solving equation 4.30 for β involves isolating the β matrix which is shown in the

following equation:

β = (X ′X)−1X ′Y (4.31)

SAR Optimisation

The SAR parameter calculations can be computationally expensive. For a window size of 32 pixels

and a neighbourhood of 8 pixels, 32× 32× 9× 9 = 82, 944 operations are performed per pixel. For

an image with 256× 256 pixels, 5,435,817,984 computations are required resulting in a processing

time of 14 minutes when implemented in Java on a 400MHz PC. In statistics, Equation 4.31 is

102

Page 125: Content-based Retrieval of Digital Video

16 x 16 window

X'

Figure 4.25: The SAR moving window effect on the X ′ matrix.

often optimised using the QR decomposition. However, we investigated an algorithmic approach

for optimisation.

To improve performance, advantage was taken of the fact that a moving window is used to

compute the SAR values. Each subsequent window along the x axis will contain all of the values

of the previous window minus the values in the left column and plus a new column of values for

the right column. This effect can be visualised by looking at X ′. For this example, assume that

the window size is only 16 × 16 pixels. X ′ becomes a matrix with 256 columns and 9 rows. The

256 columns can be divided into groups of 16 columns which represent one column in the original

image window (see Figure 4.25). Since each column in the window is represented by a series of

columns in X ′, when the window moves one pixel to the right, the columns in X ′ which were used

to represent the far left column can be overwritten with the values from the new right column in

the window.

Replacing a section of values in X and X ′ allows an optimisation in the computation of X ′X

to take place. X ′ is a relatively wide matrix and X is a relatively tall matrix, multiplying the two

together results in a small square matrix. Each element (i, j) in the result matrix is calculated by

summing the product of corresponding elements from row j in X ′ and column i in X. When the

window is shifted to the right only the summed product of the old column needs to be subtracted

from the result matrix and the summed product of the new column added in. This results in only

two sets of summed products per pixel rather than the window size, which is 32 in this case.

The number of computations per pixel is reduced to 2 × 32 × 9 × 9 = 5184 and the number of

computations for a 256× 256 image is reduced to 339,738,624. The execution time is reduced from

14 minutes to 2.5 minutes.

103

Page 126: Content-based Retrieval of Digital Video

The same optimisation can be applied to the X ′Y matrix multiplication which results in a 1×9

matrix. Before the optimisation, the computation of X ′Y requires 32×32×9×1 = 9216 operations

which is reduced to 2× 32× 9× 1 = 576 operations after the optimisation.

The optimisation can be taken even further storing the summed products of the previous

columns rather than recomputing them for every new column that is added. This halves the

number of operations to compute X ′X and X ′Y resulting in 32× 9× 9 = 2592 and 32× 9 = 288

operations per pixel respectively.

Finally, the same optimisation can be applied as the window moves down rows in the source

image. As the window moves along the x axis the new column can be computed by using the

summed multiplication for the same column in the previous row and subtracting the top pixel and

adding the new pixel. This reduces the number of operations per pixel to 9 × 9 = 81 to compute

X ′X and 9 to compute X ′Y . For pixels where x >= 1 and y >= 1 the number of computations is

independent of the window size. The only additional overhead is the additional memory required

to store the summed products of previous pixels and rows.

For a 256 × 256 image and a window size of 32 × 32 pixels 90 computations are required for

255 × 255 pixels resulting in 5,852,250 computations. 32 × 32 × 9 × 9 = 82944 computations are

required for the first pixel and 32× 9× 9 = 5184 computations are required for the remaining 255

pixels in the first row. The first pixel in each column also requires 32× 9× 9 = 5184 computations

for all but the top pixel. Therefore the total computations have been reduced to 8,579,034 from

5,435,817,984, a reduction factor of 633.

SAR Application

Using the multiple regression technique presented in the previous section the eight parameters were

determined for each pixel. We weren’t interested in the average (µ) or variance (σ2) as these have

already been computed in the previous sections. Applying the SAR model with a window size of

32× 32 pixels to the test image of Figure 4.21 (a) produced the eight parameter images of Figure

4.26. The SAR parameters show the distinction between the four textures. However, due to the

square window of the LSE technique rectangular artefacts are also produced.

The variance technique of Section 4.7.3 was applied to the SAR images resulting in Figure

4.27 (a). The results are similar to Figure 4.24 however some border responses are slightly com-

plementary. Adding the deterministic component (oriented edge responses) to the indeterministic

component (SAR model parameters) results in the combined texture edges image of Figure 4.27

(b). The combined result is slightly better than either individual result.

104

Page 127: Content-based Retrieval of Digital Video

Figure 4.26: The SAR parameters of Figure 4.21 (a).

4.7.5 Texture Inhibition

The purpose of identifying texture regions and texture borders is to inhibit contours. The texture

edges image of Figure 4.27 (b) is subtracted from the edge response image of Figure 4.21 (b) to

produce Figure 4.27 (c). The resulting image shows that contours within texture areas are largely

inhibited whilst contours near texture borders are not inhibited. Unfortunately the current tech-

nique of using the variance of SAR parameters and oriented edge responses is not accurate enough

to inhibit texture edge responses before contour processing. Ideally the inhibitory action would

result in the suppressed contours of Figure 4.27 (d). The technique could be improved by simulat-

ing the illusory contours generated by cells in V1 when presented with abutting grating stimulus

as was discovered by Grosof et al. [111]. The illusory contours would interfere with the texture

identification stages of moving average oriented edge responses and the SAR model producing more

distinct results at the boundaries between textures.

4.7.6 Comparison with Other Techniques

Unlike other systems such as QBIC [16], ARBIRS [4] detects texture first before analyse colour

regions. ARBIRS uses a relatively simple non-directional first-order derivative edge detector for

determining the basic texture features. The image is subdivided into 24× 24 pixel blocks and edge

density and coarseness values are calculated from the first-order derivative edge responses. A block

is only considered a textured region if the edge density is greater than 25% of the block. Blocks

are then grouped into regions if they have similar colour histograms. The major difference with the

texture detection used in ARBIRS and the texture inhibition approach presented in this chapter

105

Page 128: Content-based Retrieval of Digital Video

(b)(a)

(c) (d)

Figure 4.27: (a) Variance of SAR parameters, (b) Combined variance of SAR parameters and

oriented edge responses, (c) Contour image inhibited by (b), (d) Ideal inhibition.

106

Page 129: Content-based Retrieval of Digital Video

is that the ARBIRS system uses large 24×24 pixel blocks which do not allow for arbitrary texture

boundaries to be identified. However, for the purposes of image retrieval (rather than contour

extraction) the ARBIRS texture subsystem performs well.

4.8 Conclusion

Edge detection must accurately represent the edges present at each pixel. When used for contour

following the accuracy and tuning of the edge detector becomes paramount. In this chapter a

number of existing edge detectors were analysed for suitability for contour following. We found

that a majority of edge detectors that are commonly used such as the Roberts, Prewitt, Sobel, and

Laplacian are not suitable for contour following. Contour following requires multiple arbitrarily

orientated edge detectors. Of the currently used operators, only the Gabor and Canny operators

satisfy these criteria. The S-Gabor and Canny operators were analysed at multiple aspect ratios

to determine their orientation and position tuning performance. We found that neither operator

had a significant advantage over the other. We also found that as the aspect ratio increased there

was a trade off between orientation and position tuning.

An Asymmetry detector was developed that position tunes elongated orientation filters. By

itself, the elongated orientation filter produces good orientation tuning but poor position tuning.

Inhibiting the elongated orientation filter’s responses with the Asymmetry detector provided both

near-perfect orientation and position tuning. The result is a filter that outperforms any other filter

for the purposes of contour following.

To further comply with the requirements of contour following, thinning was investigated to

remove ambiguous edge responses. Morphological thinning and skeletonisation thinning were in-

vestigated but were unable to provide the correct edge responses as they could only be applied

within the discrete horizontal-vertical pixel layout of images. A new technique was developed that

allows thinning to work in the orientation of the edge response using elongated Gaussian filters

perpendicular to the edge orientation. This thinning approach is further refined by also thinning

across adjacent orientations and finally a removal of diagonals. The result is a multi-orientation

edge image that is representative of the true edges in the original image and is ideal for the sub-

sequent phase of contour following. The Asymmetry edge detector is more suitable for contour

following than the Sobel, Roberts, Prewitt, Kirsch, Robinson, and Laplacian operators and pro-

duces better results than just Gabor or Canny filters on their own whilst providing more accurately

thinned results than skeletonisation and morphological thinning.

A new approach for texture analysis was developed using the Asymmetry edge detector. The

purpose of low-level texture analysis is to inhibit edge responses before the contour following

stage to reduce processing overhead. Texture regions were identified using the Asymmetry edge

detector as well as an optimised SAR implementation. However, rather than simply identifying

texture regions, the approach is also able to distinguish between neighbouring textures so that

107

Page 130: Content-based Retrieval of Digital Video

boundaries between textures can propagate up to higher-level contour processing stages allowing

the boundaries between textures to be identified and used to form regions. The boundary detection

phase uses the moving variance to detect changes in textural distribution in Asymmetry edge and

SAR features. Even though the approach is able to identify textures and boundaries between

textures more work is required to achieve reliable texture inhibition before contour processing.

Incorporating contour-end detection may improve the technique’s ability to distinguish boundaries

between textures.

108

Page 131: Content-based Retrieval of Digital Video

Chapter 5

Contour

The previous chapter focussed on developing an edge detector that could define boundaries or

contours at each pixel for the purpose of contour extraction. In this chapter the edge points from

the Gaussian thinned Asymmetry edge detector are linked together to form whole contours. As

shown in Figure 1.4, contour extraction in this research is designed to be used for higher-level

feature extraction such as region identification. Much of the hard work in contour extraction has

been addressed with the edge detector of the previous chapter and all that remains is to use the

edge responses to form whole contours.

The challenge in contour extraction is to extract whole independent contours. That is, the

contours extracted should not be split unnecessarily but should also not be joined with other

contours. The edge features of the previous chapter aid the contour extraction process in that

multiple orientation responses are provided at each edge point. Multiple edge responses provide

two advantages. The first is that the exact edge orientation can be determined by interpolating

between adjacent orientation responses allowing more accurate orientation comparisons between

edge points. The second is that multiple edges that cross the same point can also be represented

allowing contours to co-terminate or cross the same point independently. In this chapter a new

contour extraction technique is presented based on the local processing edge linking approach [69]

that takes advantages of the edge features of the previous chapter.

Before regions can be identified, additional geometric features must be extracted. In this chap-

ter vertex extraction is briefly discussed and a new neurophysiologically-based vertex detector is

presented which is also based on the Asymmetry edge detector.

The second half of the chapter is dedicated to contour-based image similarity techniques. Exist-

ing techniques are discussed and are found to be not suitable for comparing whole image contours.

Two new contour matching techniques are presented and their performance is compared with the

Hausdorff distance [117]. Finally, a new combined colour and contour representation is presented

that is more compact than the other representations but provides comparable results.

109

Page 132: Content-based Retrieval of Digital Video

5.1 Contour Extraction

Reviews of contour extraction such as Gonzalez and Woods [69] generally begin with a quick

description of the local processing approach followed by a detailed analysis of the Hough transform.

In this section we will also describe both techniques but show that the local processing approach

is more flexible than the Hough transform but needs more work before successful contours can be

extracted.

5.1.1 Local Processing

The local processing method of linking edge points into contours involves analysing a small neigh-

bourhood of pixels and linking neighbouring points that have similar orientations to the central

pixel. Gonzalez and Woods [69] identify two properties for joining edge pixels into a contour: (1)

the strength of the response of the gradient operator, and (2) the direction of the gradient.

The first property determines that two edges are similar if the magnitude of the gradient

response is similar. If ∇f(x′, y′) is the magnitude of the gradient at neighbouring point (x′, y′) and

(x, y) is the centre of the neighbourhood then the neighbour is part of the contour if

|∇f(x, y)−∇f(x′, y′)| ≤ T (5.1)

where T is the predefined magnitude difference threshold.

Using the second property, two edges are considered similar if the difference in their angles is

less than a predefined threshold A:

|α(x, y)− α(x′, y′)| < A (5.2)

There are two limitations with the approach presented in Gonzalez and Woods [69]. Firstly,

the approach assumes that there is only one gradient direction per pixel when in fact two or more

contours may cross each other at the same pixel. Secondly, there is no consideration of the position

of the pixel in the neighbourhood and the relative directions of gradients. The local processing

approach is suitable for contour extraction, but more work can be done to extend the technique to

support Asymmetry edge detector responses and produce contours suitable for video retrieval.

5.1.2 Hough Transform

More research appears to have been performed with investigating the Hough transform [118, 119,

120, 69] compared with the local processing approach due to the motivation for performing pattern

recognition rather than contour representation. The local processing approach can be described

as an approach that can extract arbitrary contour shapes whereas the Hough transform extracts

contours that conform to predefined shape functions.

110

Page 133: Content-based Retrieval of Digital Video

(a) (b)

y

x

θ

ρ

Figure 5.1: (a) Image containing lines of various positions and angles. (b) Hough transform of

image (a) in the ρ-θ parameter pace.

The Hough transform begins with a shape function to be detected in the image, such as a line:

y = ax + b (5.3)

The shape function will have parameters, such as a and b in this case. The parameters form a

parameter space that can be laid out in multiple dimensions. A straight line has two parameters

and therefore all possible lines can be described by a point in two dimensions in the parameter

space.

Every pair of edge pixels in the edge image are substituted into the shape function to determine

the parameters of the shape that passes through both edge pixels. The parameter space becomes a

histogram and the bin that represents the parameters of the line is incremented. After processing,

the value of each bin represents the number of edge points that contributed to that particular shape.

Thresholding can be used to determine significant shapes and the resulting parameter points can

be used to reconstruct the edge image with only the significant shapes.

The gradient-offset line equation of Equation 5.3 is generally not used because the gradient a

approaches infinity as the line approaches 90 making uniform histogram construction difficult.

The gradient problem can be avoided by using polar co-ordinates:

x cos θ + y sin θ = ρ (5.4)

ρ will be no greater than half the diagonal of the original image and θ will range from −90 to 90 .

An example of the Hough transform into polar co-ordinates of an image containing lines of various

angles and positions is shown in Figure 5.1. Four dense clusters are formed in the parameter space

representing the four different line equations present in the original image.

Other shape functions such as the circle can be used which result in a three dimensional

parameter space due to three parameters in the shape function:

(x− a)2 + (y − b)2 = c2 (5.5)

The primary limitation with the Hough transform is that it searches for predefined shapes.

Any shape that is not, for example, a perfectly straight line or circle, will be misrepresented. In

111

Page 134: Content-based Retrieval of Digital Video

addition, the parameter space can increase to multiple dimensions even for relatively simple shapes

that would easily be extracted using local processing techniques. Therefore, the Hough transform is

not suitable for producing an accurate representation of contours suitable for content-based video

retrieval.

5.2 Contour Extraction Requirements

As seen in the last section current contour extraction techniques have their limitations. Before a

new technique can be developed we need to determine the requirements of a contour extraction

technique in the context of image and video retrieval. We have identified the following three re-

quirements for contour extraction. Firstly, relatively arbitrary contours must be representable. This

is important because most natural contours do not follow a simple analytic formulation or vector

description, such as a straight line or an arc. Secondly, the edge responses of the previous chapter

are very precise and unambiguous, therefore the contours extracted should reflect the same level

of precision and non-ambiguity. Thirdly, even though the contours may be of arbitrary shape they

must not contain sharp edges, which is an indication of two contours joining. Based on these three

requirements the local processing approach of the previous section is much more suitable for video

retrieval than the Hough transform. The following sections take the local processing approach and

expand and refine it to produce the contours that are required.

5.3 Identifying Edge Points

The first stage of the local processing approach is to identify edges that occur at each pixel in the

edge image. In the local processing approach described in [69] it is assumed that there is only one

edge orientation per pixel and therefore the only criteria for identifying the presence of an edge

point is whether the magnitude is greater than a predetermined threshold (Equation 5.1). The

orientation of the edge point simply becomes the angle of the gradient vector, where the gradient

vector is composed of the individual gradients along the x and y axes:

∇f =

[Gx

Gy

]=

[δfδxδfδy

](5.6)

from vector analysis the angle of the gradient is:

α(x, y) = tan−1

(Gy

Gx

)(5.7)

However, this approach is not suitable for multi-orientation edge responses such as those pro-

vided by the technique presented in the previous chapter. The first reason is that the orientation

responses are not ‘thinned,’ that is, multiple adjacent orientations may have magnitudes greater

than the predetermined magnitude threshold. However, the orientation responses can’t simply be

112

Page 135: Content-based Retrieval of Digital Video

Resp

onse

0˚ 15˚ 30˚

(a)

0˚ 15˚ 30˚

(b)

0˚ 15˚ 30˚

(c)

0˚ 15˚ 30˚

(d)

Resp

onse

0˚ 15˚ 30˚

(e)

0˚ 15˚ 30˚

(f)

0˚ 15˚ 30˚

(g)

0˚ 15˚ 30˚

(h)

Figure 5.2: (a)-(d) Four orientation response scenarios that need to be considered for computing

the true orientation of an edge. (e)-(h) The result after subtracting the minimum response from

the other two responses.

thinned because the orientation responses adjacent to the peak orientation response are required

so that the true orientation of the edge can be interpolated. So the process of selecting the peak

orientations and computing the true orientation of the edge must occur simultaneously.

The process for extracting an edge must first begin by finding the peak orientation responses at a

pixel. A peak orientation response is identified when its response is greater than the predetermined

threshold and its adjacent responses have a lower magnitude than the peak response. We have

found that ensuring that all orientation responses 45 in either direction (+/-3 orientation steps)

have a lower magnitude than the central orientation reduces the effect of noise on the results. This

process of ‘thinning’ along the orientation curve is similar to orientation competition that occurs

in the visual cortex. However, it also means that edges crossing at the same point must differ by

45 to be detected.

5.4 True Orientation

Once a peak orientation has been detected the true orientation of the edge is determined. Since

there are only a discrete number of oriented edge detectors, the true orientation of the edge must

be interpolated from the adjacent responses. Figure 5.2 (a)-(d) shows four scenarios that need to

be considered when calculating the true orientation.

The first scenario (Figure 5.2 (a)) is simple to evaluate with the peak orientation being the

113

Page 136: Content-based Retrieval of Digital Video

orientation of the edge (15 ) as both adjacent orientations are zero. The second scenario (Figure 5.2

(b)) is also simple to evaluate as both adjacent orientation responses are of the same magnitude

therefore the true orientation of the edge is the bisector between both orientations, 22.5 . The

third scenario (Figure 5.2 (c)) is more complex as the true orientation of the edge is a proportional

distance between the 15 and 30 orientations. The small response at 0 also indicates that the

true orientation of the edge may be slightly closer to 15 than just the 15 and 30 responses may

indicate. Figure 5.2 (d) shows that not only the peak orientation nor only two orientations but all

three orientations must be considered when calculating the true orientation.

The true orientation of an edge will either lie directly on an edge detector orientation or between

two edge detector orientations. So at most, only two values are required to interpolate the true

orientation. Therefore, the three orientation responses must be reduced to two. Looking at Figure

5.2 (d) we can subtract one of the smaller responses from the other two responses to produce the

result in Figure 5.2 (h). Now if the peak orientation and one of the smaller responses of Figure

5.2 (h) is used to interpolate the true orientation then the result will be 15 because both other

responses are now zero. Figure 5.2 (g) also more accurately indicates that the true orientation is

closer to 15 than the two orientation responses indicate in Figure 5.2 (c), whilst Figures 5.2 (e)

and (f) remained unchanged as they should be. The algorithm for computing the true orientation

response for a peak orientation is shown in Algorithm 2. With the complete process for extracting

edge points shown in Algorithm 3.

Algorithm 2 Calculate the true orientation for a peak orientation response oi.r3 ← min(oi−1, oi+1)

if oi−1 < oi+1 then

r2 ← oi+1

d← 1

else

r2 ← oi−1

d← −1

end if

Subtract the minimum:

r1 ← oi − r3

r2 ← r2 − r3

θ ←(i + d r2

r1+r2

)πN N refers to the total number of orientation responses

After computing the true orientation a complete edge point can be described in terms of location

and direction. The approach described above improves upon the standard local processing approach

with the following features:

• Supports multi-orientation input

• Mimics visual cortex orientation competition

114

Page 137: Content-based Retrieval of Digital Video

Algorithm 3 Extracting all edge points from an edge image (a pixel may contain multiple edge

points of different orientations).for all i: pixels in edge image, i represents pixel location do

for all oj : orientation responses at i ≥ SEED THRESHOLD, j represents orientation index

do

An orientation can only create a new edge point if it is the largest within its neighbours

if |oj | ≥ |oj−a| : −3 ≤ a ≥ 3, a 6= 0 then

θ ← true orientation of oj See Algorithm 2Create a new edge point p at location i with orientation θ

end if

end for

end for

• Is able to resolve the true orientation of an edge from more than two orientation responses

• Supports multiple edge points at the same pixel location.

5.5 Edge Linking

Once all of the edge points in the image have been determined they can be linked. The local

processing approach [69] only links points that have a similar magnitude and direction. This ap-

proach is limited for a number of reasons. Firstly, an edge may be formed between a foreground

object and a patterned background. The varying background colour may cause a variation in the

magnitude of the edge response along the contour even though there are no breaks in the edge.

Therefore, we have removed the first criteria of having a similar magnitude and only require that

the magnitude of each edge response be above the predetermined seed threshold (12), which, by

this stage, all edge points will satisfy. Secondly, even though it is important that a pair of linked

edge points have similar orientation, this criteria is too flexible and may result in incorrect edges

being linked. Consider Figure 5.3 for example. If only similar orientation is considered, points (a)

and (b) would be incorrectly linked to each other. The additional criteria of relative location with

respect to orientation needs to be considered.

The angle of the relative location of a neighbour to the edge point being considered is simple

to compute. The angle begins at 0 at location (1, 0) and increases in 45 increments for each

neighbour in the counter-clockwise direction. This relative location is only relevant with respect to

the orientation of the centre edge point. We have found that allowing a 45 difference (this is the

location threshold, AL) between the edge point’s orientation and the angle of relative location is

suitable for edge linking as it allows a contour to deviate one pixel to the left or right when moving

in the direction of the contour. Figure 5.4 shows the angle that is formed between the orientation

of the edge point and the angle of neighbouring relative locations.

115

Page 138: Content-based Retrieval of Digital Video

a b

c

Figure 5.3: Edge linking scenario. (a) 120 ; (b) 95 ; (c) 105 .

(a) (b)

0˚ 15˚

45˚

45˚

90˚45˚

90˚45˚

0˚ 15˚

30˚

60˚

75˚60˚

75˚30˚

15˚

Figure 5.4: These figures show the difference between the angle formed between two neighbouring

points and the orientation of the centre point.

The next step is to determine which edge orientation at the pixel location to link to, if any.

Firstly, neighbouring edge points are only considered whose difference in orientation from the

central edge point is less than 30 (this is the orientation threshold, AO). Secondly, the link

strength for each edge point must be greater than a predetermined threshold, which is the same as

the seed threshold. The link strength modulates the neighbouring edge point’s strength based on

the difference between the orientations of the neighbouring edge point and the central edge point.

A Gaussian function with a bandwidth of 45 is used to modulate the strengths of the neighbours:

l = ne−(θn−θc)2/b2 , b = 45 (5.8)

where l is the resulting link strength, n is the strength of the neighbour edge point, θn is the

orientation of the neighbouring edge point, θc is the orientation of the central edge point, and b

is the bandwidth of the Gaussian function. If there is no difference between the two orientations

then the neighbouring edge point’s strength will not be affected. If the difference is 45 then

the neighbouring edge point’s strength will be diminished by almost two thirds. Since the link

threshold is the same as the seed threshold a neighbouring edge point’s strength must be quite

116

Page 139: Content-based Retrieval of Digital Video

large to overcome the modulating effect of a 45 difference in orientation.

Once a neighbour has been identified for linking it must be determined whether the edge point

is already part of an existing contour. If the neighbouring edge point is already part of an existing

contour then the question arises as to why the central edge point wasn’t considered linkable to it

when that contour was being followed. The answer is because the link strength is based on the

neighbour’s strength, n, so depending on which direction the contour is being trace, either from

central edge point to neighbour, or from neighbour to central edge point, a different link strength

will be determined. This is a reasonable side-effect because a strong edge may not consider a weaker

edge worthy of being linked because of the difference in orientation. However, a weaker edge may

have been linked to another edge because of the similarity in orientation and now the weaker edge

is part of a greater contour and is able to link the stronger edge to itself. This is a basic form of

medium-level perceptual grouping similar to that which occurs in the visual cortex.

If the neighbouring point already exists in another contour then the two contours become linked

between the two edge points (but still remain independent contours). If the point doesn’t exist then

it is simply added to the existing contour being traced. The complete edge linking algorithm is

described in Algorithms 4 and 5.

5.5.1 Edge Linking Experiments

The new edge linking approach presented in this chapter was compared with the conventional local

processing approach [69]. Edge responses from the Asymmetry detector of the previous chapter

were used as input for both edge linking techniques. The Sobel edge detector was also used to

compute the edge gradient which is conventionally used in the local processing approach described

in Section 5.3, however the Sobel edge gradient was not used as input to the new edge linking

approach as the new approach requires multi-orientation input.

A threshold of 12 was applied to the Asymmetry edge responses to reduce noise whereas a

threshold of 64 was applied to the Sobel edge responses to compensate for the broader range and

thicker responses of the Sobel edge detector. A maximum angular linking deviation of only 20 was

used for linking edges from the Sobel responses because larger values allowed too many spurious

links. A maximum angular deviation of 30 was used for the Asymmetry responses because they

were more tightly tuned.

The conventional local processing approach assumes only one edge orientation per pixel there-

fore only one orientation was determined for each pixel from the multi-orientation Asymmetry edge

responses by selecting the orientation with the largest response. The new edge linking approach

allows multiple oriented edges at each pixel and therefore the orientations for the new edge linking

approach were computed using the true orientation algorithm described in Section 5.4.

As outlined in Section 5.2 the requirements of a good edge linking algorithm are contours

that do not contain sharp edges but may contain small variations in orientation throughout the

117

Page 140: Content-based Retrieval of Digital Video

Algorithm 4 Follow contour starting at edge point p

θp ← orientation of edge point p

for all ni: neighbouring pixels of p in 3× 3 neighbourhood do

θi ← orientation of edge point ni

φi ← the angle of the vector p→ ni

if |θp − φi| ≤ AL AND |θp − θi| ≤ AO then

Find the strongest edge at location i that can be linked to

Begin by computing the strength of the link for each edge

for all pj : edge points at location i, j represents edge index do

Link strength lj depends on the strength of pj and the difference in orientation between

pj and p

lj ← pje(θp−θj)2

A2L

end for

lmax ← max(l)

plink ← edge point represented by lmax

if lmax ≥ LINK THRESHOLD then

if plink is not already part of an existing contour then

Add plink to contour

Continue following contour with edge point plink by recursively calling this algorithm

else

Link plink to this contour

end if

end if

end if

end for

Algorithm 5 Find all contours in an edge imagefor all i: pixels in edge image, i represents pixel location do

for all pj : edges at location i, j represents edge index do

if pj isn’t part of an existing contour (i.e. point at same position with difference in orienta-

tion ≤ π/6) then

Create a new contour c with edge point pj

Follow contour c starting from pj(Algorithm 4)

Add contour c to contour list

end if

end for

end for

118

Page 141: Content-based Retrieval of Digital Video

(a) (b)

(c)

Figure 5.5: Input images for edge linking experiments: (a) Plane test image; (b) Sobel output; and

(c) new edge detection output.

contour. Edge linking techniques are difficult to evaluate quantitatively as there are many edge pixel

scenarios. However, the new edge linking approach was significantly better than the conventional

approach therefore a subjective analysis was more than adequate. The edge linking algorithms

were evaluated using the standard plane image (Figure 5.5 (a)) and representative contours were

analysed to determine the algorithm’s strengths and weaknesses.

5.5.2 Edge Linking Results

Figure 5.5 shows the edge images used as input for the edge linking algorithms. Figure 5.6 (a)

shows the total contours extracted by the local processing approach when applied to the Sobel

edge images and Figure 5.6 (b) shows the total contours extracted by the edge linking approach

when applied to the new edge images. Figure 5.6 (b) contains more smaller contours than (a)

because of the lower threshold used. Figures 5.6 (c) to (h) show the same contour extracted using

both algorithms.

Figure 5.6 (c) shows a contour extracted using the local processing approach that forms part

of a road on the airport tarmac. As can be seen in the original image the road is bounded by

two horizontal contours. At no point along the road do they join. However, the local processing

approach is not able to extract the two contours separately and also includes part of the tail wing in

119

Page 142: Content-based Retrieval of Digital Video

the contour. In contrast, the same contour extracted using the new edge linking approach (Figure

5.6 (d)) is one straight line that does not include the other bounding contour or any elements of

the tail wing. In addition, the advantages of the new edge detection technique can also be seen

with the contour only being one pixel thick as opposed to the two pixel thick Sobel line.

The second contour analysed (Figures 5.6 (e) and (f)) was difficult to extract accurately for

both approaches with part of the tarmac contour joining with the top of the plane. However, the

new edge linking approach was able to extract the full contour of the top of the plane up to the

tail. The conventional local processing approach was thrown by a change in orientation caused by

a nearby contour causing the contour to finish only part way down the top of the aircraft.

The final contour analysed is shown in Figures 5.6 (g) and (h). The new edge linking approach

is able to successfully extract the contour that goes down the middle of the aircraft whereas the

conventional approach extracts many other contours of the aircraft including its wing and shadow.

These results show quite clearly that the local processing edge linking approach and Sobel edge

detector combination perform quite poorly with natural images. However, its poor performance

could be explained solely by the Sobel edge detector. Therefore the conventional local processing

approach was also applied to the Asymmetry edge responses. A contour from both approaches is

shown in Figure 5.7. In the background, in grey, it can be seen that both approaches extract roughly

the same total contours as they use the same seed threshold (12). However, the contour selected

to be analysed (in black) has not been extracted correctly by the conventional local processing

approach. The road contour connects with and includes some of the tail contour. In contrast, the

new edge linking approach successfully extracts the contour without containing any edge points

from the tail of the aircraft.

In an effort to give the reader a broader idea of the kinds of contours extracted by the new

edge linking algorithm Figure 5.8 shows a number of images that display varying length contours

that were extracted by the new edge linking process.

5.5.3 Edge Linking Discussion

There are a number of reasons for the new edge linking approach’s successful extraction of contours

from the plane image. Firstly, unlike the Sobel edge detector, the Asymmetry edge detector is

designed specifically to satisfy the requirements of contour extraction including edge linking. As a

result the Sobel edge detector performs much more poorly than the Asymmetry edge detector for

edge linking. Secondly, because there are multiple orientation inputs the true orientation of each

point may be computed without interference from other orientations at the same edge point. In

contrast the gradient angle method must combine all edge stimulus into one angle approximation.

Thirdly, the new edge linking approach considers the relative location of pixels with respect to the

orientation of each edge point reducing the number of spurious links such as the one in Figure

5.7 (a). Finally, the new edge linking approach modulates the strength of a neighbour point by

120

Page 143: Content-based Retrieval of Digital Video

(a) (b)

(c) (d)

(e) (f)

(g) (h)

Figure 5.6: Local processing edge linking results are contained in the left column and the new edge

linking approach’s results are contained in the right column.

121

Page 144: Content-based Retrieval of Digital Video

(a) (b)

Figure 5.7: (a) The conventional local processing approach is applied to the new edge detector

output and a contour is selected. (b) The same contour from the new edge linking algorithm.

the relative difference in orientation to allow strong edges more flexibility in orientation but less

flexibility in orientation for weaker edges, again reducing spurious edge links. The results in Figure

5.6 and 5.7 show that the new edge linking approach is clearly better than the conventional local

processing approach.

5.6 Vertices

The primary purpose of extracting contours within the image decomposition process presented in

Figure 1.4 is to identify regions. Before regions can be extracted, contours must be grouped into

region boundaries. Human vision is able to form regions from incomplete boundaries by filling in

missing boundaries generating illusory contours [97]. Filling in illusory contours is a non-trivial

process and requires edge and contour information surrounding the missing boundary. Biederman

[96] found that shape recognition was pre-attentive and therefore must occur early on in human

vision processing. Biederman [96] also conducted experiments that showed that vertices in line

drawings were more important than the lines between vertices for shape recognition. Therefore

when the lines were missing between vertices the brain was still able to join the vertices with

an illusory contour to perform shape recognition. The ability for the brain to perform object

recognition solely on vertices places a great deal of weight on vertices. Biederman [96] explained

that features such as vertices, which are two or more lines terminating at a common point, are non-

accidental in that they rarely occur in nature without describing some important characteristic of a

distinguishable object. Biederman [96] used three vertices that occur in two and three dimensions:

‘L,’ ‘Y,’ and ‘Arrow’ vertices, however the T-vertex is also useful for determining occlusion where

a straight continuous line causes another to terminate (see Figure 5.9).

Vertices can be used to reinforce incomplete contours or to construct non-existent contours.

Grossberg et al. [25] demonstrated that illusory contours can be constructed by detecting contour-

ends and linking parallel contour-end detectors. Hubel and Wiesel [10] have found evidence in the

122

Page 145: Content-based Retrieval of Digital Video

(a) (b)

(g) (h)

(e) (f)

(c) (d)

Figure 5.8: Contours extracted from the Plane image using the new edge linking approach. Each

image displays only the contours with lengths greater than (a) 1, (b) 2, (c) 3, (d) 5, (e) 10, (f) 20,

(g) 50, and (h) 100 pixels.

123

Page 146: Content-based Retrieval of Digital Video

L Y Arrow T

Figure 5.9: Non-accidental vertices used for shape recognition.

visual cortex for contour-end detectors. Grossberg et al. [94] propose that dipole cells link the

contour-end responses together forming the illusory contour. In this section we present a tech-

nique for identifying vertices also based on the concept of detecting contour-ends but using the

Asymmetry edge detector.

5.6.1 Contour-ends

Contour-ends are detected by end-stopped cells in the visual cortex which are tuned to two edges

oriented 90 apart in a T formation [10]. Neurophysiological research has found that the vertical

length of the detector is tightly tuned like a simple cell while the top of the T is less tightly tuned

allowing for corners of a wide range of angles [93]. To achieve a similar result an end-stopped filter

has been designed using multiple inputs for the top of the T and one input for the centre (Figure

5.10 (a)). Similar filters to the Asymmetry edge detector are used. A narrower Asymmetry filter is

used to allow larger responses close to contour-ends. Also the filter is offset by 3 pixels to achieve

the T shape. Human vision uses 18 orientations with 10 between each oriented cell [10]. The

Asymmetry edge detector uses 12 orientations (15 apart) which aren’t sensitive to the direction

of contour. Contour-end detection on the other hand must describe the angle of the contour-end

in 360 requiring 24 orientations. The end-stopped filter uses 14 filters to make up the top of the

T and one filter for the centre.

The end-stopped responses are determined by first producing the 24 oriented edge responses

for each point using the same Asymmetry edge detector as Equation 4.14 but with modified

parameters:

EA = |C| − t|A| (5.9)

where C is the response of the Canny edge detector (see Equation 4.9) with θ = n π24 , n = (0→ 23),

sy = 1, sx = 3, and tx = 3 to translate each filter away from the centre of the contour end, A

is the response from the asymmetry detector with sy = 1.33, sx = 1 to provide a more compact

filter, ty = 3 to centre the asymmetry detector over the Canny edge detector, and t = 2 to provide

124

Page 147: Content-based Retrieval of Digital Video

(a) (b)

Figure 5.10: (a) End-stopped detector; (b) Vertices extracted from the Plane image.

a tight tuning curve.

The end-stopped detector of Figure 5.10 (a) is formed by sampling Asymmetry edge detector

responses from selective orientations. The maximum response from the 14 filters at the top of the

T is determined and the minimum response between the top of the T and the centre is used as the

strength of the end-stopped response:

EC = min[EA(θ),

9maxi=3

[EA

(θ ± i

π

12

)]](5.10)

where EA is the Asymmetry edge detector response from Equation 5.9 and θ is the orientation of

the vertical bar in Figure 5.10 (a).

The Asymmetry edge detector responses are tightly tuned in orientation and position alleviating

false responses which can occur with other contour-end detectors along the length of an edge [12].

The range of angles detectable by the end-stopped detector can be controlled by the number of

filters occupying the top of the T. Because the orientation sensitivity is narrowly tuned, edge filters

can be used at 45 from the primary filter allowing acute corners to be detected.

5.6.2 Thinning End-stopped Responses

When thinning end-stopped responses it is important to consider multiple orientations to ensure

that contour-ends which will form a single vertex remain together. This is achieved by generating

an aggregate image of the component orientations. The aggregate image is blurred using a Gaussian

filter with a 2 pixel radius. The Gaussian filter ensures that contour-ends of the same vertex will

be thinned to the same point. The contour-ends are thinned by thinning the blurred aggregate.

Any points that are set to zero in the aggregate are also set to zero in the multiple orientation

responses. Thinning is performed by setting the pixel to zero if it isn’t the largest value in its

125

Page 148: Content-based Retrieval of Digital Video

3× 3 neighbourhood. Orientation competition is performed to ensure that a contour-end response

occurs in only one orientation. A response is set to zero if either of its two adjacent orientations

has a larger value.

5.6.3 Vertex Extraction

Vertices are created for points which contain more than one end-stopped response greater than a

threshold (set to 16). The angles and strengths for each edge of the vertex are set to the orientations

and strengths of the end-stopped responses.

Results for vertex extraction on the Plane image are shown in Figure 5.10 (b). The end-stopped

detector allows vertices with acute corners to be detected allowing vertices with many edges to be

easily represented.

The vertex edges are then linked to the extracted contours. For each vertex edge, every contour

point is analysed to see whether the vertex edge should be linked to it. It is possible that a vertex

exists on the middle of a contour because of limitations with the contour following algorithm,

therefore all points must be checked, not just points at the end of contours. The closest contour

point in distance and angle is assigned to each vertex edge if it falls within the minimum distance

and orientation difference. After the closest point has been determine a check is performed to see

whether there is a contour-end which is closer to the vertex by distance but may not have been

as close in orientation. If there is, then the point is replaced with the contour-end which is more

appropriate to be linked to a vertex edge.

Even though the contour-vertex grouping performed well for simple geometric images more

work needs to be done to allow the robust extraction of regions from natural images using the

contour-vertex approach. Therefore vertices have not been used as features in the remainder of

this research but show promise for higher-level object extraction.

5.7 Contour Matching

The first half of this chapter has dealt with the extraction of contours which is part of the feature

extraction phase of content-based retrieval, but before they can be used they must be represented

in an efficient form suitable for performing similarity queries. The type of representation used

depends on the method of determining the similarity between image contours.

There are not a great deal of techniques that exist for determining the similarity between

contours in content-based retrieval systems. Existing systems either compare independent edge

points or use sophisticated region-based queries. For example, the ART MUSEUM system [46]

uses global and local distributions of edge features for determining contour similarities, however

edge distribution is a form of texture representation and does not consider the links between edge

points. Likewise, many existing content-based retrieval systems [16, 27, 4] support the querying

126

Page 149: Content-based Retrieval of Digital Video

of texture but do not support the querying of linked edge points in the form of contours. These

systems usually support other shape-based query methods based on regions, however the regions

are extracted through image segmentation techniques as opposed to being formed from the linking

of edge points. In addition, the shape-based query methods are generally designed for the user to

specify a subset of objects in the query image that must be found in the database. As a result

shape-based queries are generally not designed for whole image comparisons.

Scarloff and Pentland [85] devised a technique called modal matching for comparing two shapes

that is invariant to various deformations. Modal matching is a similar approach to affine-invariant

Fourier descriptors [84] but both techniques are designed for comparing object silhouettes as op-

posed to natural images that consist of many objects.

The Hausdorff distance has been used for comparing edge points in images [117]. The Hausdorff

distance is used to determine the spatial similarity between two sets of points. Given two sets of

points, A and B, the Hausdorff distance is defined as:

H(A,B) = max(h(A,B), h(B,A)) (5.11)

where

h(A,B) = maxa∈A

minb∈B||a− b||, (5.12)

and || · || is some underlying norm on the points of A and B such as the Euclidean distance [117].

Applied globally to an image, the Hausdorff distance ignores any links between edges. However,

if applied locally to groups of related edge points the Hausdorff distance can be used for recognising

model objects in scenes [117] and for object tracking in video sequences [121]. Since our requirement

is to determine overall image similarity the Hausdorff distance would need to be computed between

every contour in each image resulting in NM computations per pair of images where N and M

are the number of contours in each image. Even though the Hausdorff distance can compare sets

of edge points such as those in contours it does not consider the links between edge points or any

higher level features of contours such as orientation, length, or curvature.

What is required is a technique similar to the Hausdorff distance that uses contours as the

primitives rather than independent points. Before presenting a new technique for matching con-

tours, results from applying the Hausdorff distance to the test image database are presented and

used as a benchmark for comparing the new contour matching techniques.

5.8 Hausdorff Distance Experiments

By itself the Hausdorff distance does not provide a measure of image similarity but simply the

similarity between two contours. Huttenlocher et al. [117] extended the technique to support find-

ing objects in images by translating the model points across the image points. For whole image

matching there is no concept of translation to find a match but instead the images are overlaid and

the contours compared. Each contour can be considered as a ‘model’ where the closest matching

127

Page 150: Content-based Retrieval of Digital Video

contour is being found in the other image. This is the approach we have taken for computing

the Hausdorff distance between two images. The Hausdorff distances between each contour in the

query image and its closest matching contour in the database image are summed. This approach is

not commutative since images may have different numbers of contours in different spatial arrange-

ments resulting in a different summed distance depending on which image is considered the query

image. To make the approach commutative the reverse summed distance is also computed from

the database image to the query image and both summed distances are added together. Finally,

the result is normalised by the total number of contours in both images:

H(A,B) =

[∑a∈A min(H(a, b)∀b ∈ B)

]+

[∑b∈B min(H(b, a)∀a ∈ A)

]NA + NB

(5.13)

where A is the query image and B is the database image, a and b are contours in images A and B

respectively, and NA and NB are the number of contours in images A and B respectively.

The test image database used for evaluating colour histograms in Chapter 3 was also used to

evaluate the Hausdorff distance. Contours were extracted from each image using the technique

presented in this chapter. The same three test images used in Chapter 3 were used as query images

and the first ten images returned in ranking order of closeness of Hausdorff distance were recorded.

5.8.1 Hausdorff Distance Results

The results of the Hausdorff distance experiments are shown in Figure 5.11. The Hausdorff distance

performed poorly with the Car image only returning one other car image and it was the last image

returned. The other 9 images returned contain a lot of texture and support the fact that the

Hausdorff distance only measures a concept of spatial ‘intersection’ as opposed to similarity in

contour shape resulting in images with dense contours or texture being considered more similar

because there are more points to ‘intersect’ with. It also worth noting that the similarity values

returned by the Hausdorff distance measure do not contain a lot of variability for significantly

different images. In fact the only variation in the similarities of the three image queries was for

the first image returned for the Wedding image. The Hausdorff distance performed better with

the Wedding image however only four wedding images were returned (there are enough wedding

images in the database to fill the top ten results) and only two were in the top two results. The

Hausdorff distance performed well with the Bush image returning bush images in the top 7 results,

however, these results could also be explained by the Hausdorff distance’s tendency to measure

contour point similarity as opposed to contour shape similarity.

5.8.2 Hausdorff Distance Discussion

The results show that the Hausdorff distance does not perform well for two out of the three test

images. The closeness in similarity values of the returned images indicates that the Hausdorff

distance has trouble distinguishing the similarity between various images. One of the problems

128

Page 151: Content-based Retrieval of Digital Video

Figure 5.11: Results of three image queries using the Car, Wedding, and Bush images and the

Hausdorff distance between image contours. The query images are displayed in the left column

followed by the 10 most similar images.

with the Hausdorff distance is that it requires every edge point to be stored in the database. For

the test image database it is not unusual for an image to contain 1000 contours each with 5 or

more edge points per contour. Assuming two bytes are required to store each edge point and each

contour has on average 10 edge points then 20 KB are required to represent an image. The second

and most significant problem facing the Hausdorff distance as a contour matching technique is the

processing requirements. The Java implementation running on an 800 MHz PC took 15 seconds

to compare two images. An image database with 1000 images would take over 4 hours to perform

one query.

Based on the storage and computational requirements as well as the poor querying results the

Hausdorff distance is not suitable for contour matching. In the next section a new technique for

comparing contours is presented that focuses on improving the querying results of the Hausdorff

distance by incorporating shape features of contours rather than treating edge points independently.

5.9 Contour Similarity

The Contour Similarity technique takes the approach of comparing every contour in one image

with every contour in another image. It differs from existing techniques such as the Hausdorff

distance [117] and comparisons of edge distributions because the links between edges are implicitly

used in the contour comparisons. Where the Hausdorff distance only compares spatial location of

independent edge points the Contour Similarity technique also compares the orientation, curvature,

and length of contours.

Before contours can be compared the location, orientation, curvature, and length of each contour

must be determined. As noted above the Hausdorff distance requires every edge point to be stored

in the database for comparison however since Contour Similarity operates at the contour level only

the extracted features of each contour need to be represented. The following section describes how

these features are extracted before describing the comparison algorithm.

129

Page 152: Content-based Retrieval of Digital Video

5.9.1 Contour Representation

Contours have been represented in the literature through a variety of techniques which have been

discussed in Section 2.4.1. These techniques include tangential angles, Fourier descriptors [84], and

eigenvectors [85]. Tangential angles represent the change in curvature of uniform distances whilst

Fourier descriptors and eigenvectors represent the various spatial frequencies in the varying distance

of the shape’s outline from the centroid. Neither technique explicitly represents perceptual features

of contours and therefore comparison techniques can not be designed to use perceptual features.

For the Contour Similarity technique we have chosen four perceptual features for describing and

comparing contours:

• Centroid position (x and y)

• Length

• Prevailing orientation

• Curvature

We call the process of extracting these features Contour Summarisation and have found that

it reduces the storage requirements of summarised contours to 10% of the raw contour data. The

first two features are simple to extract. The centroid position is simply an x and y value that

represents the mean x and y positions of every edge point in the contour. The length is simply

the total number of edge points in the contour. The prevailing orientation and curvature are more

difficult to extract and are described in the next two subsections.

Prevailing Orientation The prevailing orientation is the overall orientation of the contour and

is extracted by averaging the orientations of the individual edge points. This process is slightly

more difficult than it first appears since edge point orientations only range from 0 to 180 . For

example, the average of two edge points with orientations 10 and 170 isn’t 90 it is 0 . This

problem only arises when there are orientations from both 90 quadrants. So the first step is to find

the average orientation of each quadrant. If all points lie in only one quadrant then the prevailing

orientation of the contour is simply the average of all orientations. However, if there are points from

both quadrants the two values need to be combined. If the difference between the averages of the

two quadrants is less than or equal to 90 then the two average values can be added proportioned

by the number of points that contributed to each quadrant. But if the averages differ by more

than 90 then the first quadrant’s average orientation must be shifted by 180 before the two

values are combined. This may cause the final prevailing orientation to exceed 180 and will need

to be shifted back if necessary. The algorithm for calculating the prevailing orientation is shown

in Algorithm 6.

130

Page 153: Content-based Retrieval of Digital Video

Algorithm 6 Prevailing orientation.O1 ← 0Orientations of edge points in the first quadrant 0 → 90 O2 ← 0Orientations of edge points in the second quadrant 90 → 180 L1 ← 0Number edge points in the first quadrantL2 ← 0Number edge points in the second quadrantfor all points in contour do

if point.orientation < π2 then

O1 ← O1 + point.orientation

L1 ← L1 + 1

else

O2 ← O2 + point.orientation

L2 ← L2 + 1

end if

end for

O1 ← O1/L1

O2 ← O2/L2

if L1 > 0 AND L2 > 0 then

if O2 −O1 > π2 then

O1 ← O1 + π

end if

prevailingOrientation ← (O1L1 + O2L2)/(L1 + L2)

else if L1 > 0 then

prevailingOrientation ← O1

else if L2 > 0 then

prevailingOrientation ← O2

end if

if prevailingOrientation ≥ π then

prevailingOrientation ← prevailingOrientation - π

end if

131

Page 154: Content-based Retrieval of Digital Video

Curvature Contour curvature is a description of how much a contour deviates from a straight

line. We could calculate how far the contour deviates from a straight line or the area formed between

the curve and the straight line but the simplest method is to calculate how much the orientations

of the edge points deviate from the prevailing orientation. The average absolute difference between

each orientation and the prevailing orientation is calculated for the entire contour. The edge linking

algorithm will only link two points if their orientations are within π12 radians therefore the largest

curvature occurs when a contour consists of edge points that have orientations that increment

in π12 increments. The result is circle or a semicircle. The average orientation deviation along a

semicircle from the prevailing orientation is π4 which is the largest possible curvature value and is

used to normalise curvature values. We have found experimentally that contours with a normalised

curvature above 0.25 can be considered curved lines.

5.9.2 Contour Similarity Algorithm

The Contour Similarity approach compares every contour in one image with every contour in

another image. The basic algorithm is as follows:

1. Each contour in the query image is compared against every contour in a database image to

find the contour with closest similarity. The closest similarity values are added to the running

total of similarity.

2. Step 1 is run again but in the other direction from database image to query image.

3. The two totals are added together to form the total similarity.

4. The total similarity is normalised by the total number of contour points (not contours) in

both query and database images.

The first step requires that a similarity value is computed for each contour pair in the two im-

ages. Individual contour similarity is unidirectional from small contour to large contour. Therefore

not all contour pairs will be compared in step 1 but will be after step 2 which repeats step 1 in

the opposite direction.

Contour similarity is the product of the similarity values computed for the four contour sum-

mary features: length, curvature, orientation, and position.

Cs = ls × cs × θs × ps (5.14)

The component similarities are described in the following subsections.

Length Similarity The length similarity calculation allows an effectual colinearity grouping to

be achieved. As mentioned above similarity is one directional, from the shorter contour to the

132

Page 155: Content-based Retrieval of Digital Video

(b)(a)

A

B

C

D

A

D

(c)

A

Dy

x

Figure 5.12: (a) Contours A, B, and C from the query image are colinear with contour D from the

database image. (b) Line segments reconstructed from only the contour summarisation information.

(c) Line segments rotated so that the longer line segment is parallel with the x axis.

longer. What we want to allow for is shorter contours to be considered similar to longer contours

as opposed to being considered very different. The purpose of this is to allow the similarities of

multiple shorter contours that line up against one longer contour (Figure 5.12 (a)) to be aggregated

to effectively give the same result as if the shorter contours had been grouped into one longer

contour and the two long contours compared. This is achieved by making the length similarity

simply the length of the shorter contour which will be from the query image as the comparison is

one direction:

ls = lQ (5.15)

Curvature Similarity Curvature similarity is the absolute difference between curvatures sub-

tracted from 1:

cs = 1− |cQ − cD| (5.16)

Orientation Similarity The orientation similarity is calculated by first computing the orien-

tation distance. The orientation distance is the absolute difference between the two prevailing

orientations of the contours:

θd = |θQ − θD| (5.17)

The orientation distance may be larger than π/2 which is not possible with a circular range

of π, so if it is larger then it is subtracted from π. The orientation similarity is calculated by

normalising the resulting difference by π/2 and subtracting from 1:

θs = 1− θd

π/2(5.18)

133

Page 156: Content-based Retrieval of Digital Video

Position Similarity It would be easy to think that the position similarity is simply the Euclidean

distance between the two contour centroids. However, the position similarity is the most difficult to

compute as it must not interfere with the colinearity grouping effect that allows smaller contours

to make up a larger contour. For example, in Figure 5.12 (a) contours A, B, and C are colinear

with contour D, however the Euclidean distance from their centroids indicates that A and C are

less similar than B, when in fact they all make up an equal contribution to the similarity to D

based on the colinearity grouping effect. Therefore the Euclidean distance of the centroids can not

be used to find the similarity between colinear contours. Instead, one aspect that remains constant

is the perpendicular distance between the contours.

Since only the contour summaries are stored, the perpendicular distance must be computed

from the centroid, prevailing orientation, and length of each contour. Firstly, two line segments

must be constructed that are centred at the centroid of the contour and extend half the length

from the centroid in opposite directions with the prevailing orientation of the contour (Figure 5.12

(b)). The following equations compute both points of each line segment from the query (Q) and

database (D) images:

xQ1 = xQ −lQ2

cos θQ

yQ1 = yQ −lQ2

sin θQ

xQ2 = xQ +lQ2

cos θQ

yQ2 = yQ +lQ2

sin θQ

xD1 = xD −lD2

cos θD

yD1 = yD −lD2

sin θD

xD2 = xD +lD2

cos θD

yD2 = yD +lD2

sin θD

The next step is to rotate both line segments around the centroid of the longer contour (D) as

in Figure 5.12 (c). This is achieved by first shifting all points so that they are relative to the longer

contour’s centroid:

xQ1 = xQ1 − xD

yQ1 = yQ1 − yD

xQ2 = xQ2 − xD

yQ2 = yQ2 − yD

xD1 = xD1 − xD

yD1 = yD1 − yD

xD2 = xD2 − xD

134

Page 157: Content-based Retrieval of Digital Video

yD2 = yD2 − yD

Next all four points must be rotated by the negative prevailing orientation of the longer contour:

xQ1 = xQ1 cos−θD − yQ1 sin−θD

yQ1 = xQ1 sin−θD + yQ1 cos−θD

xQ2 = xQ2 cos−θD − yQ2 sin−θD

yQ2 = xQ2 sin−θD + yQ2 cos−θD

xD1 = xD1 cos−θD − yD1 sin−θD

yD1 = xD1 sin−θD + yD1 cos−θD

xD2 = xD2 cos−θD − yD2 sin−θD

yD2 = xD2 sin−θD + yD2 cos−θD

The calculations for the longer contour can be greatly simplified. Instead of computing its

rotated line segment and then inverse rotating back to the x axis, the longer contour can simply

be reconstructed oriented along the x axis:

xD1 = − lD2

yD1 = 0

xD2 =lD2

yD2 = 0

Figure 5.12 (c) shows the y distance between the two line segments. However, there is no

guarantee that line segment A will be parallel to line segment D therefore the y distance is taken

as the midpoint between the y positions of each end of line segment A:

∆y =yQ1 + yQ2

2(5.19)

As was noted before the x distance can not simply be the distance between the two line segment

centroids as that does not take into account overlap. For example, if line segment D was half the

length than it is then there would be no difference in the distance between centroids yet there

is a great difference in overlap. Therefore, the best way to determine the horizontal distance of

line segment A from D is to measure the amount of extension from the end of D. The following

equations measure the extensions on both sides of line segment D:

elow = max(xQ1, xQ2)−max(xD1, xD2)

ehigh = min(xD1, xD2)−min(xQ1, xQ2)

Since we know that A is shorter than D, A can only extend along one side of D. The side of

the extension is the largest of elow and ehigh and becomes the horizontal distance, ∆x:

∆x = max(elow, ehigh) (5.20)

135

Page 158: Content-based Retrieval of Digital Video

If A does not extend over the end of D on either side ∆x will be negative. In this case ∆x

should be set to zero. The overall distance is the Euclidean combination of ∆x and ∆y:

pd =√

∆x2 + ∆y2 (5.21)

The position distance must be normalised to a value between zero and one which is done

using the diagonal image size. Since the diagonal image size is large compared to many contour

extension lengths a linear normalisation would not allow small differences between contours to be

easily distinguished so a two step non-linear normalisation is performed consisting of two linear

normalisations. The first range of values maps to the 0 → 0.5 domain whilst the second range

of values maps to the 0.5 → 1 domain. The first range of values is from 0 to half the length of

the longer contour and the second range of values is from half the length of the longer contour to

infinity. The first range normalisation is represented by the following equation:

pd = 0.52pd

lD(5.22)

Whilst the second range normalisation is represented by the following equation:

pd = 0.5 + 0.5pd − lD

2√W 2 + H2

(5.23)

Where W and H represent the image width and height. The result of the normalisation is

that contours that are not too similar positionally can have a similarity measure that is not too

impacted by the lack of positional similarity, providing a mild positional independence for contours

that are not close together. The final positional similarity is:

ps = 1− pd (5.24)

5.9.3 Contour Similarity Experiments and Results

The contour similarity experiments used the same database and query images as the Hausdorff

distance experiments. The same three search images of Car, Wedding, and Bush were also used.

The results for the Contour Similarity experiments are shown in Figure 5.13 with the label

‘Contour Similarity’. For the Car query image the contour similarity approach successfully returned

both car images as the top two results which is clearly better than the Hausdorff distance. Of note

is that the Contour Similarity approach reverses the order of the Car images compared with the

colour histogram results in Figure 3.4 which could be explained by the emphasis on contours rather

than colour. Also of note is the similarity value 59 given to both Car images which indicates a lack

of ability in distinguishing the differences between images. However, there is a much greater range

in similarity values for the Contour Similarity approach compared to the Hausdorff distance.

The Contour Similarity approach also performed well with the Wedding image as it successfully

returned all wedding photos. The Contour Similarity approach also performed well against the

136

Page 159: Content-based Retrieval of Digital Video

colour histogram results of Chapter 3 successfully returning all wedding photos that contain people

in them whereas some colour histogram results contained a wedding photo which does not contain

people (for example, HSV (18, 2, 2)F and HSV (6, 3, 3) in Figure 3.5).

The Contour Similarity approach performed poorly with the Bush image, which could only

be explained by its relatively strong weighting on position, even though the position similarity is

normalised to reduce this effect. The most similar photo to the Bush image was successfully returned

by the Hausdorff distance in the second position, however the Contour Similarity approach does

not return this image at all in the top ten. It should be noted that this test image contains a lot

of texture information and indicates that the Contour Similarity approach is not as well suited for

such queries.

5.9.4 Contour Similarity Discussion

The results show that the Contour Similarity approach performs quite well for comparing image

contours and performs significantly better than the Hausdorff distance and also compares well

with the colour histogram approaches except for images that are characterised primarily by texture.

However, like the Hausdorff distance, one of the major problems of the Contour Similarity approach

is its representation and querying overheads. It is not unlikely for 1000 contours to be extracted

from an image. Assuming five bytes are required to store the summarised features for each contour

then 5 KB are required for each image. This is four times less than the 20 KB required for the

Hausdorff distance but is still too large for a content-based retrieval system.

In terms of computational requirements, the Contour Similarity technique requires every con-

tour in each image to be compared. With 1000 contours in each image, 1,000,000 comparisons

would be required. The current Java implementation uses 89 arithmetic operations, 9 conditions,

and 4 trigonometric functions for each individual contour similarity calculation and takes approx-

imately half a second to compare two images. If a database contained only 1000 images then it

would take 8 minutes to compute a query. This is significantly better than the 4 hours required by

the Hausdorff distance but is still too slow for a content-based retrieval system.

The computational impact of the contour similarity approach can be reduced by caching simi-

larity results. A half matrix of contour similarities can be stored allowing a near instantaneous look

up. A 1000 image database would require 500,000 similarities to be stored, which is only 10% of

the storage requirements of the raw summarised contour data. However, the storage requirements

increase by n2 as the number of images increases. A 10,000 image database requires 50,000,000

similarities to be stored which is five times more than the summarised contour data.

Like the Hausdorff distance, the computational and storage requirements of the Contour Sim-

ilarity approach make it unsuitable for today’s content-based retrieval systems. Two areas where

the approach can be improved is in providing a more compact representation and also more efficient

image comparison. In the next section a new technique for representing and comparing contours

137

Page 160: Content-based Retrieval of Digital Video

Histogram 4,2,2

Histogram 4,2,2 F

Histogram 4,2,2,2,2

Histogram 4,2,2,2,2 F

Histogram 8,4,2,2,2 F

Contour Similarity

Histogram 4,2,2

Histogram 4,2,2 F

Histogram 4,2,2,2,2

Histogram 4,2,2,2,2 F

Histogram 8,4,2,2,2 F

Contour Similarity

Histogram 4,2,2

Histogram 4,2,2 F

Histogram 4,2,2,2,2

Histogram 4,2,2,2,2 F

Histogram 8,4,2,2,2 F

Contour Similarity

Figure 5.13: Contour histogram and contour similarity results for the Car, Wedding, and Bush

images. Histogram results indicate the number of bins for each dimension (orientation, length,

curvature, x, and y) and whether a fuzzy histogram was used.

138

Page 161: Content-based Retrieval of Digital Video

based on fuzzy histograms is presented and compared with the Hausdorff distance and Contour

Similarity approaches.

5.10 Contour Histograms

The primary problem of the Contour Similarity technique presented in the previous section is that

it uses a variable length representation. As a result the data can not be efficiently indexed using a

fixed-sized feature vector. In this section we present a novel technique of representing contours in

a fixed-sized feature vector using the fuzzy histograms introduced in Chapter 3.

It is not uncommon for edge and texture distribution to be represented in content-based retrieval

systems [16, 4, 46, 27] but the representation of contour distribution is rarely used. The difference

between contour distribution and edge and texture distribution is that contour distribution is a

representation of a higher-level feature.

A histogram consists of axes and each axis represents one feature. Section 5.9.1 identified the

four features of contours being the x and y centroid positions, length, prevailing orientation, and

curvature. Each feature becomes an axis in a five-dimensional histogram (the centroid is actually

two features, x and y). Each axis must be quantised into a number of bins to reflect the distribution

of features across each axis. The number of bins can affect the representation overhead as well as

the querying overhead. Selecting 5 bins per axis for a 5 dimensional histogram would result in a

total of 3,125 bins which would require approximately 3 KB of storage space, almost as much as

the raw summarised contours, and would require 3,125 comparisons for every image. Therefore the

number of bins and range of each bin must be carefully determined so that the total number of

bins is minimised without adversely affecting the matching results.

One of the problems with using a very small number of bins per axis is that histogram matching

techniques do not consider adjacent bins. So for example, if the length axis was divided into two

bins representing short and long contours and two contours fell just either side of the dividing

value between a long and short contour then the histogram matching technique would determine

that the two contours were completely different based on length. This problem has been addressed

in Chapter 3 through the novel approach of fuzzy histograms where instead of each bin being

incremented by one, an amount is added to adjacent bins proportional to the closeness of the value

to the centre of each bin.

Another problem facing contour distribution representation is that images generally consist of

many smaller contours and fewer long contours. A histogram matching technique would therefore

place more importance on the smaller contours and the longer contours would have little impact

on the matching results. However, this is the reverse of how human perception works, where longer

contours are given greater significance than shorter contours which generally represent texture

rather than shape. The solution to this problem is to increment each bin by the number of pixels

in the contour rather than by one for each contour. The result is that longer contours have equal

139

Page 162: Content-based Retrieval of Digital Video

Table 5.1: Bin parameters.Axis min0 centre0 max0 min1 centre1 max1

orientation (i = 0, π/4, π/2, 3π/4) i - π/8 i i + π/8

length 0 5 10 10 15 ∞curvature 0 0.1 0.25 0.25 0.4 ∞x 0 0.25 0.5 0.5 0.75 1

y 0 0.25 0.5 0.5 0.75 1

weighting with the shorter contours. Finally, the size of the source images may be different so the

contour histograms are normalised by the total number of pixels in the image which should be

proportional to the number of contours that can be extracted from that sized image.

5.10.1 Contour Histogram Experiments

There were two purposes of the contour histogram experiments. The first was to determine the

ideal contour histogram construction by evaluating fuzzy and non-fuzzy histograms with different

numbers of bins. The second purpose was to determine how well the contour histogram represen-

tation performed against the Hausdorff distance and Contour Similarity approaches and whether

it would be suitable for use in a content-based retrieval system.

Since the total number of bins can increase rapidly with five axes the number of bins per axis

was kept as low as possible. For the orientation axis 4 and 8 bins were evaluated, with 4 bins

providing an orientation granularity of 45 and 8 bins providing an orientation granularity of

22.5 . For the length axis granularities of 2 and 4 bins were evaluated. The remaining axes were

limited to two bins each. Two bins were sufficient for the centroid position axes as it allows for

positions in the four quadrants of an image to be represented. We also evaluated the performance

of a contour histogram that did not represent the position of contours at all, thereby making it a

translation invariant representation. By removing both position axes the size of the histogram can

be reduced by a factor of 4. The curvature axis used two bins allowing contours to be classified

as straight or curved. The bins’ ranges and centres (used for the fuzzy histograms) are shown in

Table 5.1.

The image database used in the Hausdorff distance and Contour Similarity experiments was

used to evaluate the contour histogram performance. Like the colour histogram experiments of

Chapter 3 the histogram intersection method was used to compare histograms. The three search

images of Car, Wedding, and Bush were also used.

140

Page 163: Content-based Retrieval of Digital Video

5.10.2 Contour Histogram Results

The results for the various contour histogram experiments are shown in Figure 5.13. For the Car

query, the other two car pictures were only returned as the top two results for histograms that

included axes that represent contour location showing that location is an important feature for

performing contour queries. However, the fuzzy histogram (4,2,2F) performed relatively well with

no location information returning the two car images in the top three results. No improvement

was gained in increasing the number of bins in the orientation and length axes from (4,2,2,2,2) to

(8,4,2,2,2). There was little difference between the contour histograms and the Contour Similarity

approach except that the order of the car images for the contour histogram approach is the same

as the colour histogram approaches presented in Chapter 3. Also the contour histograms provided

a greater dynamic range of similarity measures with values ranging from 67 to 83 for the top two

car images as opposed to 59 being given to both car images by the Contour Similarity technique.

The Wedding query performed poorly when the contour location axes were used, with two

non-wedding photos entering the top ten results indicating that sometimes contour location is

important for contour similarity whilst other times it is not. Interestingly these problems were

fixed when a fuzzy histogram was used. Once again there was no benefit in increasing the number

of orientation or length bins.

The Bush query was difficult to evaluate as many of the images returned contain some ‘bush’

in them. The (4,2,2F) fuzzy histogram performed better than the (4,2,2) non-fuzzy histogram with

the third image being more representative of bush. Including the centroid (4,2,2,2,2) didn’t seem

to improve the results however applying the fuzzy histogram improved the fifth result image which

contains more bush than the non-fuzzy histogram. Increasing the number of bins in the orientation

and length axes returned the same images but two of them were arguably in better positions. Com-

pared with the Contour Similarity approach all contour histograms performed better returning the

bush image compared with the car images returned by the Contour Similarity approach. However,

the contour histograms did not perform as well as the Hausdorff distance which was able to return

the bush image at position 2 rather than 4.

5.10.3 Contour Histogram Discussion

The contour histogram experiments indicate that incorporating the contour centroid can improve

results but only if the fuzzy histogram is also used. The benefits shown by using the fuzzy histogram

are explained by the low number of bins used to represent each axis which is where the fuzzy

histogram technique excels.

Increasing the number of bins used to represent the contour orientation and length did not

improve the results significantly. The total number of bins required for the (8,4,2,2,2) histogram

is 256 which is considerably large than the 64 bins required for the (4,2,2,2,2) histogram. Being

four times smaller, the 64 bin histogram will also require four times less storage and processing

141

Page 164: Content-based Retrieval of Digital Video

Figure 5.14: Combined fuzzy HSV (3,2,2F) and fuzzy contour (4,2,2F) histogram results for the

Car, Wedding, and Bush images.

requirements.

Compared with the Hausdorff distance the contour histograms performed significantly bet-

ter except for the Bush image. Compared with the Contour Similarity approach the (4,2,2,2,2F)

histogram provided better results, especially with the Bush query image, and significantly lower

representation and querying overhead. Assuming one byte is used to represent each histogram bin

then only 64 bytes are required to represent the contours in an image using contour histograms

compared with the 5 KB for the Contour Similarity method. In addition only 64 comparisons are

required per image compared with 1,000,000. Also since the histogram representation is a fixed

sized feature vector it could also be indexed using a multi-dimensional indexing technique such as

R-trees [28].

Based on these results the (4,2,2,2,2F) fuzzy histogram combined with histogram intersection

provides the most efficient form of representation and querying of contours in a content-based

retrieval system for comparing whole images.

5.11 Combined Contour and Colour Histograms

In this section we look at combining the contour and colour approaches from this chapter and

Chapter 3. The similarities are combined using multiplication:

S = Scolour × Scontour (5.25)

As in the other experiments the same image database and query images are used. The smallest

colour histogram that gave the best results from Chapter 3 is the HSV (6,3,3F) fuzzy histogram

which uses 54 bins. The best contour histogram from this chapter is the (4,2,2,2,2F) fuzzy histogram

with 64 bins. However, in Figure 5.14 two smaller histograms are combined, the HSV (3,2,2F) fuzzy

colour histogram and the (4,2,2F) fuzzy contour histogram.

142

Page 165: Content-based Retrieval of Digital Video

The results are as good if not better than the best individual colour and contour histogram

results and certainly better than the component histograms that make up the combined result.

However, the combined number of bins is only 28 which is less than either the 54 bins of the best

colour histogram or 64 bins of the best contour histogram. The result is lower storage requirements,

more efficient query computation, and better ordering of results.

The benefits of the combined histograms approach can be attributed to both colour and contour

information being used, which human perception also uses, but also to the fuzzy histogram approach

introduced in Chapter 3. The fuzzy histograms allow for less numbers of bins to be used, down to

a minimum of two per axis. As can be seen in these results four axes only need two bins whilst

the remaining two axes use three and four bins. Since the number of bins per axis are multiplied

together to achieve the total number of bins, a reduction in the number of bins per axis can benefit

the storage and computational requirements significantly.

5.12 Conclusion

This chapter has taken the tuned edge results of the last chapter and investigated the best way

to extract contours from this edge information and to represent them in a content-based retrieval

system. An edge linking scheme has been devised that can take advantage of the multi-oriented

edge results of the last chapter and produce better contours than the conventional local processing

approach whether combined with the new edge detector or a conventional edge detector such as

the Sobel operator. The novel edge linking scheme begins to show the advantages of the single

pixel, non-ambiguous, multi-orientation edge detector of the last chapter, and fulfils the goals in

extracting the desired contours. The new edge linker takes advantage of multi-oriented edge input

by allowing contours of different orientations to cross at the same pixel and also takes into account

the relative location of pixels with respect to the orientation of adjacent edge points.

This chapter also addressed the issues of representing contours as they contain variable sized

high level information in contrast to the fixed sized feature vectors required by common content-

based retrieval systems. Two novel approaches were investigated. One attempted to reduce the high-

level information into a fixed size feature vector using fuzzy histograms whilst the other attempted

to compare all contours using a brute force method. Both approaches require a summarisation of

contour features. We introduced four contour features and techniques for determining them. Only

four features are required because the edge linker and edge detector can ensure that contours do

not contain sharp angles and therefore can be represented as straight or slightly curved lines.

The contour histogram approach benefits from the fuzzy histograms presented in Chapter

3 allowing small, fast histograms to be constructed that provide good results. The brute force

Contour Similarity matching approach is novel in that it allows contours to be compared whilst

preserving colinearity grouping effects that occur in human perception. Both techniques performed

significantly better than the existing Hausdorff distance however contour histograms were much

143

Page 166: Content-based Retrieval of Digital Video

more suitable for content-based retrieval systems due to the reduced computational and storage

requirements.

Finally, the colour and contour histogram approaches were combined allowing even fewer total

bins to be used than the best individual colour and contour histograms. The result is an extremely

compact feature vector of only 28 histogram bins that performs as well, if not better, than either

individual colour or contour histogram of 54 and 64 bins respectively.

144

Page 167: Content-based Retrieval of Digital Video

Chapter 6

Video

The feature extraction techniques presented up to this point deal with the spatial characteristics of

images. Video, which is composed of individual images, adds the temporal dimension. In this chap-

ter the temporal aspects of video are identified and techniques for extracting these characteristics

are presented and evaluated.

Video may exhibit the following temporal properties:

• Animation (Motion, Deformation, Lighting/intensity changes, Special effects)

• Temporal structure (Frames, Shots, Cuts, Scenes, Acts, Episodes)

Video generally consists of a scene of physical objects whether the objects are real objects

captured by a video camera or artificial objects drawn by an artist such as those in cartoons and

3D animations. The primary difference between using images and videos in representing a scene of

physical objects is that video can portray the movement of objects. In addition to motion, some

objects may also deform such as a ball bouncing or a person performing acrobatics. Other changes

include changes in lighting if lights are turned on, off, or dimmed and intensity changes that may

occur by walking into shadows. Each of these effects can only be seen over a period of time and

to provide the illusion of smooth motion should occur over many frames with a small time period

between each frame. These properties can all be classified as animation properties of video. One

additional form of animation is special effects. A special effect might be used between two camera

shots. Some examples of special effects include fades, wipes, and a rotating 3D cube. These effects

have no real parallel in the physical world but commonly occur in video sequences to bridge two

shots.

The second type of temporal characteristic that a video may exhibit is temporal structure.

Since time is one dimensional, and for the intents and purposes of watching a video sequence such

as a story or documentary is intended to be viewed in one direction, there is a specific ordering of

the contents of the video sequence. The simplest ordering is obvious and that is that each frame

145

Page 168: Content-based Retrieval of Digital Video

follows the previous frame. However, a video sequence can be composed of a multi-level hierarchy

of temporal objects. The temporal video hierarchy is shown in Figure 6.1.

This chapter will focus primarily on the structural aspects of video sequences as opposed to

animation properties.

6.1 Video Structure

The temporal structure of a video is an important aspect for video retrieval as it provides a

logical hierarchy that allows the user to drill down to find the target object. Figure 6.1 classifies

the hierarchical levels as either being syntactic or semantic. Syntactic levels should be extractable

automatically with little domain knowledge. Semantic levels on the other hand require a knowledge

base or annotation of the content for the levels to be appropriately constructed. An act or episode

for example has little to do with the physical characteristics of the video and requires a semantic

knowledge of film scripts. Syntactic levels may also require some domain knowledge but it is

generally small. For example, for a cut effect that fades between one shot and the next the CBR

system must be aware that the fade effect does exist and may occur in the video sequence. However,

the domain knowledge required for the syntactic levels is usually limited to a relatively fixed set

and can more easily be generalised than for that of the semantic levels and is therefore the focus

of this chapter. In this section the syntactic temporal levels of Figure 6.1 are discussed in more

detail.

6.1.1 Frames

All videos consist of frames and are designed to be played back at a preset number of frames

per second (fps). Extracting individual frames is not a challenge for video retrieval research as

video decoders are designed to be able to present each frame from the video sequence. Broadcast

quality videos generally have quite high frames rates ranging from 24-60 fps resulting in a large

number of frames. For example, a 30 minute film encoded at 30 fps will contain 54,000 frames.

Since the time difference between frames is small, often the content between two consecutive frames

is very similar. The small time period is used simply to provide the smoothest effect of motion. A

larger time period could be used, such as 0.5s, but the smoothness of the motion would be lost.

Since a video retrieval system is primarily interested in the contents of frames rather than the

smoothness of presentation, a video retrieval system does not need to represent all of the frames

individually from the original video sequence. Instead frames can be sampled with a larger time

period or alternatively higher level structural aspects can be extracted such as those described in

the following subsections.

146

Page 169: Content-based Retrieval of Digital Video

Video

Episodes

Acts

Scenes

Shots and Cuts

Camera Operations

Frames

Semantic

Syntactic

Figure 6.1: Temporal structure of a video sequence.

147

Page 170: Content-based Retrieval of Digital Video

(a) (b)

Figure 6.2: Optical flow. (a) Pan and (b) Zoom.

6.1.2 Camera Operations

Camera operations include panning, tilting, dollying, and zooming. One single camera shot may

consist of a number of camera operations following one after the other. Camera operations result

in global optical flow that can be detected using optical flow analysis. Optical flow analysis results

in an array of motion vectors for the video. Different motion vector patterns are generated by

different camera operations such as those shown in Figure 6.2. Determining the type of camera

operation from the motion vectors is relatively simple however its performance can be affected by

motion within the scene.

6.1.3 Shots and Cuts

A shot is a single continuous camera recording. A shot may consist of multiple camera operations.

Shots are separated by cuts between two shots. There are two types of cuts, abrupt and gradual.

An abrupt cut simply has a frame from the previous shot followed by a frame from the next shot.

Gradual cuts involve a transition from one shot to the next over a number of frames. A gradual

cut involves some special effect such as a fade, dissolve, or wipe where most frames contain some

pixels from each shot. A gradual cut may also contain other objects that are not part of either shot

to add to the effect. Shots are generally extracted by detecting the cuts between shots. Abrupt

cuts are relatively easy to detect as almost all of the pixels are likely to change in value between

two frames. Gradual cuts are more difficult to detect as they occur over a number of frames so the

amount of change between frames is smaller and each frame is likely to contain pixels from both

shots.

148

Page 171: Content-based Retrieval of Digital Video

6.1.4 Scenes

A scene is a physical or virtual location. A scene may consist of many camera shots of the same

scene but from different locations and orientations. The change between two scenes can not be

detected by simply analysing individual pixels as such a change in pixel value may only be a

cut between two shots of the same scene. Scene change detection requires features from frames

that will change less between shots and more between scenes. Alternatively, rather than looking

for the changes between scenes, scenes can be constructed by grouping shots that have similar

characteristics. An interesting aspect of a scene is that it may appear again later on in the video.

This allows for two levels of scenes, those that consist of adjacent shots and those that consist of

disparate shots.

6.2 Video Retrieval Requirements

For video retrieval, the more levels that can be represented from Figure 6.1 the easier it will be for

the user to browse or query an entire video sequence. However, for the purposes of this research we

are primarily interested in the non-semantic levels which can be automatically extracted without

human intervention. In addition, there is no research challenge in extracting frames as the frame

is the basic unit that a video decoder provides. Therefore, the levels that we are interested in

extracting are camera operations, shots, cuts, and scenes, where scenes may be further subdivided

into groupings of adjacent shots and disparate shots.

For each level of Figure 6.1 a frame from the video can be extracted that best represents the

grouping, this frame is known as a representative frame or R-frame. R-frames are used for thumb-

nails when browsing video sequences. R-frames can also be indexed and placed into a content-based

retrieval database for content-based querying. Therefore, the second challenge after extracting the

video structure is to extract appropriate R-frames. The camera operation level consists of one

discrete camera operation, such as a pan from left to right. In this case a suitable R-frame that

could represent the camera operation would involve both the start and end frames. However, there

is no need to include both start and end frames if multiple camera operations are concatenated

against each other. In this case it is suitable to simply store the first frame of every discrete camera

operation.

For shots, one R-frame must be selected from the R-frames of the camera operations. Usually

it is acceptable to choose the middle R-frame. Cuts don’t usually need to be presented to the user

and hence don’t require R-frames to be extracted. Scenes, like shots can simply choose the middle

R-frame of its composing shots however depending on the techniques used to identify a scene it

may be best to choose an R-frame that best represents the features of the scene.

For this research, time has not allowed for the identification of camera operations therefore we

will focus on extracting shots and scenes and the R-frames for those groupings. In the following

149

Page 172: Content-based Retrieval of Digital Video

sections we analyse and compare techniques for extracting shots and scenes and their associated

R-frames.

6.3 Shot Identification

Shots are bounded by cuts, therefore shot identification involves identifying the cuts between shots.

Detecting cuts involves comparing the pixels between adjacent frames. In this section a number

of cut detection methods are presented along with two new techniques. These are compared and

evaluated.

6.3.1 Template Matching

Template matching is the oldest and easiest method for detecting cuts between frames [55]. It

involves comparing the values of corresponding pixels between adjacent frames. Techniques such

as absolute difference and mean squared error (MSE) can be used. A graph of intensity values for

the absolute difference between frames for the first 2 minutes of the film ‘Spy Game’ is shown in

Figure 6.3(a). The movie is sampled at 15 fps resulting in 1800 frames. For comparison, the ground

truth cuts are displayed below each graph. The absolute difference between pixels is summed

between frames using the following formula:

d(Ii, Ij) =x<M,y<N∑x=0,y=0

|Ii(x, y)− Ij(x, y)| (6.1)

where M and N are the dimensions of the image I. Figure 6.3(b) shows the intensity values for

the differences between frames computed using the MSE difference:

d(Ii, Ij) =

∑x<M,y<Nx=0,y=0 (Ii(x, y)− Ij(x, y))2

M ×N(6.2)

Both Figures 6.3(a) and (b) are very similar with the absolute difference method being slightly

less sensitive to inter-frame noise. Cut detection is performed by thresholding the intensity values.

However, it can be seen from Figure 6.3 that selecting a robust threshold value would be difficult

as some cuts could be missed or erroneous cuts could be included. This is because pixel-based cut

detection techniques are very sensitive to noise and motion [55]. Comparing the average colour

of large blocks of pixels would reduce the effect of noise and motion. Nagasaka and Tanaka [122]

partitioned frames into 4 × 4 equal sized windows and used the difference between the average

colour of each block. Results of the block-based template matching method are shown in Figure

6.3(c). It can be seen that the block method is less affected by motion however the cuts are also

harder to identify and no fixed threshold is able to identify them.

150

Page 173: Content-based Retrieval of Digital Video

(a)

(b)

Abso

lute

diff

eren

ce

Time

Time

MSE

diff

eren

ceTemplate Matching Cut Detection

(c)

Bloc

k di

ffere

nce

Time

Figure 6.3: Intensity graphs for the difference between frames using template matching. (a) Abso-

lute difference between frames, (b) MSE difference between frames, (c) Difference between average

colour of 80× 60 pixel blocks.

151

Page 174: Content-based Retrieval of Digital Video

6.3.2 Histogram

To lessen the impact of motion, the overall distribution of colour within a frame can be compared

rather than pixel values. Methods for representing colour distributions have been presented in

Chapter 3. In particular a common method for representing colour distributions is the colour

histogram. In Chapter 3 we focussed on the colour space and bin sizes however in this section we

will primarily focus on histogram comparison methods.

Common methods for comparing histograms include the absolute difference, Euclidean distance,

χ2 distance, and intersection. The Euclidean distance has the same form as the mean squared error

function and is often used to compare feature vectors in a multidimensional space. The Euclidean

distance is applied to colour histograms using the following formula:

d2RGB(Ii, Ij) =

n∑k=1

((Hr

i (k)−Hrj (k)

)2 +(Hg

i (k)−Hgj (k)

)2 +(Hb

i (k)−Hbj (k)

)2)

(6.3)

An RGB colour histogram with eight bins on each axis for a total of 512 bins was used to detect

cuts in the ‘Spy Game’ video sequence. Applying the Euclidean distance to the colour histogram

produces the graph shown in Figure 6.4(a). It can be seen that the colour histogram approach is

much less sensitive to motion than the template matching techniques. The Euclidean distance is

not always suitable for comparing histograms as the histogram space is not Euclidean. Nagasaka

and Tanaka [122] used the χ2-test equation for histogram comparison:

d(Hi,Hj) =n∑

k=1

(Hi(k)−Hj(k))2

Hj(k)(6.4)

However, Zhang et al. [123] found that not only did the χ2 distance increase the difference between

camera breaks it also increased the difference between frames containing motion. The χ2 distance

applied to the ‘Spy Game’ sequence is shown in Figure 6.4(b). As can be seen in the graph,

some camera breaks are represented very distinctly whilst others are barely visible along with the

non-camera break frame differences.

The histogram intersection technique [21] is a technique designed specifically for histograms.

Histogram intersection is the sum of the minimum value of every corresponding pair of bins in each

histogram:

d(Ii, Ij) =∑n

k=1 min(Hi(k),Hj(k))∑nk=1 Hj(k)

(6.5)

Since the intersection provides a measure of similarity rather than difference the complement

of the intersection must be found. Two identical histograms will have an intersection value that is

equal to the total number of pixels in a frame. Therefore the maximum intersection value is the

number of frame pixels and the complement to the intersection can be found by subtracting the

intersection from the number of frame pixels. Figure 6.4(c) shows that the intersection comparison

method is able to provide more even cut peaks than the other methods.

152

Page 175: Content-based Retrieval of Digital Video

Eucli

dean

dist

ance

Time

Time

X2 dist

ance

Colour Histogram Cut DetectionIn

ters

ectio

n

Time

Figure 6.4: Intensity graphs for the difference between frames using histogram matching. (a) Eu-

clidean difference between frames, (b) χ2 difference between frames, (c) Histogram intersection.

153

Page 176: Content-based Retrieval of Digital Video

6.3.3 Optical Flow

The biggest challenge in cut detection is misdetection, that is, incorrectly identifying a cut between

two frames when one doesn’t exit. Misdetection most commonly occurs due to motion, whether

it be caused by the camera or objects within the scene. One technique to avoid misdetection is

to identify the motion within the scene. Optical flow analysis extracts motion vectors between

frames. Figure 6.2 gives an example of the types of motion vectors that can be caused by camera

operations. Optical flow can be used to detect cuts by identifying a cut when there is a change in

the consistency of optical flow vectors along with a change in the colour between frames.

Optical flow analysis is often performed by breaking an image up into blocks and determining

the motion vector for each block. A block from the first frame is compared with blocks within a

fixed sized neighbourhood in the next frame. The block in the second frame with the minimum

MSE between the two blocks is chosen as the destination of the block in the first frame. The two

block locations are used to calculate the motion vector. For cut detection, both the minimum MSE

and motion vector can be used to detect a cut. Adding all minimum MSEs for a frame results in

a value that, when low, indicates that both frames are very similar in content whether there is

motion between the two frames or not.

A brute force optical flow analysis technique was implemented where the source block is com-

pared with every possible block location within a fixed-sized window in the next frame. A block size

of 8×8 pixels was used with a search window of 16×16 pixels resulting in 16,384 computations per

8 × 8 block. There are more efficient motion compensation techniques such as logarithmic search

however these are less accurate [124].

Motion compensation is more accurate when every frame is used so that large movements are

not missed. For the best results, a 30 fps sampling rate was used for the optical flow analysis test

as opposed to the 15 fps used for the other techniques. The results of the minimum MSE between

frames from the optical flow analysis for the ‘Spy Game’ test sequence are shown in Figure 6.5.

Figure 6.5 shows that the optical flow analysis produced less noise but also produced more

variation in the peaks. The advantage of using optical flow analysis is that the motion vectors can

also be used to extract camera operations. However, optical flow analysis is often very processor

intensive and must be performed on every frame for reliable results. A cheaper way of analysing

motion within the video is to use the motion vectors present in the compressed sequence which is

discussed in the next section.

6.3.4 Compressed Sequences

Since video data is of a high bandwidth and is usually long in duration some form of lossy compres-

sion is almost always used for storing video sequences. In addition, the highest quality compression

algorithms are generally chosen that will allow the video data to be decoded at the correct playback

rate. Therefore the shortest time for all of the frames to be acquired from the video sequence will

154

Page 177: Content-based Retrieval of Digital Video

Conv

entio

nal X

-ray

Time

Optical Flow Analysis

Figure 6.5: Intensity graph for the difference between frames using optical flow analysis.

be not much shorter than the playback time of the video sequence itself. Since schemes such as

template matching are relatively simple and fast it would be better if the video frames did not have

to be decoded but instead the cut detection algorithms could operate directly on the compressed

data.

A commonly used video format is MPEG [125] which is used for DVD, VCD, and other appli-

cations. MPEG and similar formats store three types of frames: I-frames, P-frames, and B-frames.

Each frame is encoded in a different way and for a different purpose. I-frames are coded indepen-

dently of other frames and use a compression scheme similar to JPEG where the image is divided

into 8 × 8 pixel blocks which are then DCT coded. I-frames are used by MPEG to synchronise

the decoder. P-frames code the difference between the current frame and the previous frame using

motion vectors. Each motion vector represents the translation of a 16×16 pixel block between two

frames. B-frames are similar to P-frames but perform bi-directional prediction including forward

prediction to the next frame. P-frames and B-frames generally appear more regularly in a video

sequence because they provide greater compression.

DCT Coefficients

DCT coefficients represent the spatial frequency of an image block. The DC coefficient is generally

stored using differential pulse code modulation (DPCM) whilst the AC coefficients are quantised,

zig-zagged, and entropy coded. The DC coefficient can be used by itself for performing template

matching [126]. Even though the spatial resolution is reduced by using the DC coefficient perfor-

mance is not affected. In fact the lower spatial resolution makes it less sensitive to object motion.

Yeo and Liu [126] also tested this method using the colour histogram and found it to be less

155

Page 178: Content-based Retrieval of Digital Video

sensitive to object motion but more expensive to compute.

Another approach by Arman et al. [47] used a subset of AC coefficients for comparison between

each block. The advantage of this technique is that the texture between frames can be analysed.

However, more processing is involved because the AC coefficients must be decoded.

Motion Vectors

The motion vectors of the P and B-frames can also be used to detect scene changes in a similar

way to optical flow but the computationally intensive optical flow analysis can be avoided. Zhang

et. al. [45] used a count of nonzero motion vectors to detect scene changes. Motion vectors will be

coded if a suitable trajectory is found, however, if none is found then the motion vector isn’t coded

for that block and some other scheme is used such as DPCM. Therefore, a cut can be detected

when there are very few valid motion vectors between frames.

Compression Issues

Using compressed sequences for cut detection can be fast, however there are also a number of

problems with using them. Firstly, the ability for a scene change to be detected depends largely on

how the video was stored. MPEG is a flexible format which allows for a variable number of I, P,

and B frames to be stored. In fact some video sequences may consist of only I-frames which makes

it difficult for schemes which are dependent on motion vectors. Schemes that use DC coefficients

are also affected by reduced temporal resolution because of the lower occurrence of I-frames in the

video sequence which may result in a cut being undetected. Finally, motion compensation schemes

are designed to efficiently code the video sequence rather than accurately represent the motion

within a video sequence. This makes motion vectors unreliable. Furthermore motion compensation

tends to be unreliable and unpredictable for gradual transitions [55].

6.3.5 X-ray

Optical flow analysis can be useful for detecting cuts as well as the camera operation however it

is exceedingly slow. Tonomura et al. [18] have proposed a method for detecting cuts and camera

operations using a technique called X-rays that is faster than traditional optical flow analysis.

X-ray images simplify the motion estimation search by only requiring the search to be performed

in one dimension. An X-ray image is a representation of the movement within an image along the

x and y axes. As shown in Figure 6.6 the X-ray image consists of two subimages which represent

motion along the x axis and motion along the y axis. Optical flow analysis can then be performed

on these images in only one dimension reducing the complexity from W ×H ×N2 to (W + H)N

where N is the width of the search window and W and H are the width and height of the images.

X-ray image analysis is a combination of template matching and optical flow analysis (Figure

156

Page 179: Content-based Retrieval of Digital Video

x

y

x

y

Time

Time

(a) X-ray Image

(b) Fast X-ray Image

Figure 6.6: (a) X-ray image of ‘Spy Game.’ (b) Fast X-ray Image of ‘Spy Game.’

6.7). In the first stage of the process the difference image between two adjacent frames is calculated.

Then the difference image is processed separately for the x and y axes. For the x axis, the pixels

in each column are summed into a single row and stored as the next column in the x-t X-ray

image. Similarly, the y axis is processed by summing the rows and storing the result in a column

which becomes the next column in the y-t X-ray image. The two images are then combined to

form the final X-ray image. An example of an X-ray image for the first two minutes of the ‘Spy

Game’ sequence at 15 fps is shown in Figure 6.6 (a). The black bars above and below the y axis

X-ray indicate the black bars above and below the movie frame due to the letterboxing effect of

widescreen movies presented in 4:3 aspect ratio. The short vertical white lines at the bottom of

the X-ray image are the subtitles of the film whilst the vertical black lines under the X-ray image

represent the ground truth cuts.

X-ray images are not useful unless optical flow analysis is performed. For example if only the

X-ray lines were compared then the aggregate of each pixel in an X-ray would form the average

difference between adjacent images which is identical to basic template matching. This can be seen

in Figure 6.8 (a) where the only difference to Figure 6.3 (b) is that the X-ray results are quantised

157

Page 180: Content-based Retrieval of Digital Video

t

y

x

y-t sliced image

x-t sliced image edge detection

edge detection

summation(y-axis)

summation(x-axis)

t

x

yy-t video

x-ray image

x-t videox-ray image

Figure 6.7: X-ray process.

due to being stored in an 8-bit per channel image.

6.3.6 Fast X-ray

Rather than using X-ray images to analyse camera motion we modified the technique to produce

an enhanced template matching method that is faster than the conventional X-ray technique. Our

enhancement is to compute the average value for every column and row in a movie frame before

finding the difference between two frames. The difference then only needs to be performed between

two frames on W +H pixels. Since the average is calculated for every column and row, the technique

has similarities to block-based template matching techniques but is able to separate out horizontal

and vertical motion. The other major deviation from the standard X-ray technique is the use of

MSE between frames rather than absolute difference.

The resulting X-ray image is shown in Figure 6.6(b). The higher intensity values are due to

using the MSE rather than the absolute difference. The frame difference results for cut detection

are shown in Figure 6.8 (b). As can be seen the peaks that represent cuts have less variation in

height than any of the other techniques.

One problem with the intensity graphs is that it is difficult to see the difference between peaks

158

Page 181: Content-based Retrieval of Digital Video

(a)

(b)

(c)

Conv

entio

nal X

-ray

Time

Time

Fast

X-ra

yX-ray Cut Detection

Peak

det

ectio

n

Time

Figure 6.8: Intensity graphs for the difference between frames using X-rays. (a) Aggregate X-ray

difference, (b) Fast X-ray difference, (c) Peak detection of Fast X-ray intensities.

159

Page 182: Content-based Retrieval of Digital Video

Table 6.1: Peak detection convolution kernel

-0.5 1.0 -0.5

that represent cuts and overall intensity differences within a frame that don’t represent cuts. Since

an abrupt cut should be represented by an abnormally high peak it should be possible to use signal

processing techniques to identify these high frequency spikes amongst the low frequency noise. A

small linear filter kernel (shown in Table 6.1) was convoluted with the intensity data to enhance

the peaks. To maintain the importance of high peaks the original peak data is multiplied again

with the convoluted peak data. The results of applying the peak detector are shown in Figure 6.8

(c). Even though the peak heights vary more than the raw data they are clearly separated from the

non-peak data. A threshold line can easily be drawn through the peaks without hitting non-peak

data.

6.3.7 Colour + Contour

So far in this chapter we have investigated using low-level features for video segmentation. However,

the physiologically-based feature extraction process of Figure 1.4 implies that temporal segmenta-

tion relies on higher level features such as contours. Using higher level features such as contours

allows for more variation between frames caused by motion since even with a lot of motion the

moving contours will remain much the same. Using the colour representation of Chapter 3 and

the contour representation of Chapter 5, higher level features were extracted from video frames of

the ‘Spy Game’ sequence. Since contour processing is a slow process (approximately 1 minute per

frame) the video was sampled at the lower rate of 5 fps. The colour + contour differences between

frames are shown in Figure 6.9(a). The graph shows that the peaks that represent cuts are not

easily distinguished from the peaks that represent motion. Applying the peak detector produces

the graph in Figure 6.9(b). The peaks are now more easily distinguished from the noise.

6.3.8 Experiments

It is difficult to compare cut detection techniques by merely looking at the intensity difference

graphs. Cut detection techniques were evaluated quantitatively by counting the number of correct

cuts identified, missed cuts, and incorrectly identified cuts. To compute these metrics a ground

truth record must be available. The ground truth record was constructed by manually analysing

the first 2 minutes of the ‘Spy Game’ video sequence and recording the time of each cut between

shots. The cut time was taken as the time of the first frame in the next shot. There was one gradual

scene change which is between the first and second shots. The remaining cuts are all abrupt with

the exception that some include an additional fade frame between the two shots. For fade cuts the

cut time was taken as the time of the fade frame.

The following metrics were recorded: hits, false hits, and misses. A hit was recorded when the

160

Page 183: Content-based Retrieval of Digital Video

Colo

ur +

Con

tour

Time

Colour + Contour Cut DetectionCo

lour

+ C

onto

ur w

ith p

eak

dete

ctio

n

Time

(a)

(b)

Figure 6.9: Intensity graphs for the difference between frames using Colour + Contour Histograms.

(a) Colour + Contour, (b) Colour + Contour with peak detector applied.

difference between two frames was greater than the predetermined threshold and the cut time

was within 0.6 of the frame period from the ground truth cut time. The margin of error of 0.6

of the frame period allowed for rounding errors as well as sampling quantisation caused by the

lower sampling rates of the cut detection techniques compared with the source video’s frame rate

of 30 fps. A false hit was recorded when the difference between two frames was greater than

the predetermined threshold but the cut time was not within 0.6 of the frame period from the

ground truth cut time. A miss was recorded if the difference between two frames was below the

predetermined threshold but the frame time was more than 0.5 of a frame period past the last

undetected ground truth cut time. Once again the 0.5 of a frame period margin of error allows for

rounding errors and sampling quantisation effects.

The thresholds were determined by finding a threshold that results in roughly equal values

161

Page 184: Content-based Retrieval of Digital Video

of false hits and misses since increasing the threshold produces less false hits and more misses

whilst reducing the threshold produces more false hits and less misses. For a real world application

it would be more suitable to minimise misses so that no shots are missed but for the purposes

of evaluation equalising the false hits and misses provides a basis for comparison. The optimal

threshold was determined using a binary search.

6.3.9 Results

The hit or miss experiments were conducted on all of the techniques discussed in this chapter

except for techniques that work directly on compressed data. Table 6.2 shows the results. For each

technique the time to process the data was recorded along with the threshold, hits, false hits, and

misses. The number of misses are not displayed in the table since the number of misses is the

number of hits subtracted from the 32 ground truth cuts of the 2 minute sequence. The number

of false hits is also not displayed since the thresholds selected ensure that the number of false hits

is the same as the number of misses.

The best results recorded were for the new Fast X-ray technique with peak detection applied

correctly identifying 27 out of the 32 cuts. The peak detection technique substantially improved

the raw Fast X-ray results which only identified 19 cuts. Since the peak detection was able to make

a substantial difference to the results it was also applied to the histogram χ2 and intersection

techniques. Both histogram techniques with peak detection did not improve as dramatically as the

Fast X-ray technique but the improvement was enough to place both techniques in second and

third place.

The colour and contour technique used four orientation, two length, and two curvature bins on

each axis (422F) because they produced the best results from Chapter 5. Since the results were quite

poor (only 1 cut was detected) the number of bins on each axis was increased to eight orientation,

four length, and two curvature bins with the addition of two x and y bins each (84222F). There

was no improvement in the result. Disabling the fuzzy histogram also did not improve the results

(84222). However, applying the peak detection technique improved the results to be comparable

with the template and histogram matching techniques. The HV C colour space was used instead

of the HSV colour space and was also able to marginally improve the results.

6.3.10 Discussion

The first two minutes of the ‘Spy Game’ video sequence is full of motion and is a challenging test

sequence to use. The template matching techniques did not perform too well returning as many

correct cuts as incorrect cuts. The block-based template matching technique performed better

due to the blocks reducing the influence of motion. The histogram techniques performed better

than the template matching techniques with the Euclidean distance performing the poorest. Even

though the χ2 intensity graph of Figure 6.4 appears more inconsistent than the intersection graph

162

Page 185: Content-based Retrieval of Digital Video

Table 6.2: Video cut detection results.Technique Rate Time Threshold Hits

Template Absolute 15 fps 00:02:28 21 14

Template MSE 15 fps 00:02:29 41 14

Template Block 15 fps 00:02:20 11 17

Histogram Euclidean 15 fps 00:01:39 1.3 16

Histogram χ2 15 fps 00:01:39 0.043 20

Histogram Intersection 15 fps 00:01:39 0.13 20

Optical Flow 30 fps 01:21:09 540 13

X-ray 15 fps 00:01:53 20 13

Fast X-ray 15 fps 00:01:30 125 19

Fast X-ray Peak 15 fps 00:01:30 6800 27

Histogram χ2 Peak 15 fps 00:01:39 0.0013 21

Histogram Intersection Peak 15 fps 00:01:39 0.01 25

HSV + Contour 422F 5 fps 05:50:00 0.045 1

HSV + Contour 422F Peak 5 fps 05:50:00 0.00013 15

HSV + Contour 84222F 5 fps 05:50:00 0.041 1

HSV + Contour 84222F Peak 5 fps 05:50:00 0.00014 19

HSV + Contour 84222 5 fps 05:50:00 0.041 1

HSV + Contour 84222 Peak 5 fps 05:50:00 0.00014 19

HVC + Contour 422F 5 fps 05:50:00 0.047 1

HVC + Contour 422F Peak 5 fps 05:50:00 0.00014 17

HVC + Contour 84222F 5 fps 05:50:00 0.043 2

HVC + Contour 84222F Peak 5 fps 05:50:00 0.00015 19

HVC + Contour 84222 5 fps 05:50:00 0.033 2

HVC + Contour 84222 Peak 5 fps 05:50:00 0.000075 20

163

Page 186: Content-based Retrieval of Digital Video

it provided the same number of hits.

The optical flow technique performed more poorly than most other techniques. This could be

due to the 16×16 pixel window size used for motion estimation. Better results may be obtained by

using a larger search window. However, the already slow optical flow analysis technique which takes

over an hour to analyse the frames would be further slowed down by an increased window size.

Therefore optical flow analysis should only be performed if detailed camera operation information

is also required.

As noted earlier the X-ray technique without modification is essentially a template matching

technique using absolute differences which is reflected in the results. The slight difference in number

of hits between the X-ray technique and the absolute difference template matching technique is

due to the quantisation that occurs during X-ray processing. The Fast X-ray technique provided a

substantial improvement over the standard X-ray technique and was better than all of the other

template matching techniques. The improvement is due to both the x and y axes being treated

independently allowing for fine differences to be detected due to the single pixel column and row

thickness but is largely resistant to motion due to the average that occurs across an entire row

or column. Applying the peak detection filter improves the results substantially again providing

the best results in the experiments conducted. The application of the peak detection filter to the

histogram techniques did not improve their results as much as the Fast X-ray technique. This could

be due to the fact that the histogram techniques are already quite impervious to motion and that

their poor results are more due to the limitations of the colour distribution representation.

The colour + contour results performed much more poorly than expected. Without the peak

detection filter it was only able to detect two of the 32 cuts. Applying the peak detection filter

improved the results but the immense contour processing time of almost 6 hours makes colour +

contour an undesirable option for cut detection. Increasing the number of bins per axis, including

the x and y axes, and removing the antialiasing of the fuzzy histogram was able to improve the

results slightly. The improvement in results is probably due to the x and y axes providing some

indication of contour location which is helpful in distinguishing between frames of different shots.

Since only two bins were used for each location axis an indication of location is provided without fine

location differences causing a problem. Removing the antialiasing of the fuzzy histogram probably

improved results because, by default, the colour + contour comparison technique has only a small

dynamic range in its similarity metric, by removing the antialiasing the difference between frames

would become slightly more distinguishable. The colour + contour technique was handicapped by

only having a sampling rate of 5 fps compared with 15 fps used for the other techniques, however

an increased sampling rate of 15 fps would have resulted in a total processing time of almost 18

hours.

These results show that, for now, the colour and contour information extracted using the

techniques in this thesis is not suitable for cut detection. However, the improved Fast X-ray with

peak detection filter is able to detect cuts more reliably than existing techniques.

164

Page 187: Content-based Retrieval of Digital Video

6.4 Scene Identification

A scene is a sequence of shots of the same location. Detecting a scene involves grouping shots that

have similar content together. Reliable scene extraction can be difficult since semantic knowledge

may be required to identify that two shots are of the same scene. If a complete three dimensional

reconstruction of a shot is possible, and if the 3D data from multiple shots was correlated then

forming scenes would be more reliable. Existing methods for identifying scenes are primarily based

on colour distribution since the lighting and environment colours are similar for different shots

within a scene. However, variances can still occur due to different camera angles and using contour

information may improve the reliability of extracting scenes.

Scene extraction is usually performed on representative frames extracted from a shot. There

are two methods for segregating the R-frames into scenes. The first is similar to the cut detection

methods where a sharp difference between two adjacent shots indicates a scene change. The second

method is to group shots together based on similarity. In this section we will focus on the first

method to evaluate the improvement of including shape information. Clustering methods will be

analysed in the next two chapters as a fundamental part of a CBVR user interface.

6.4.1 Experiments

Scene extraction was performed on representative frames of shots extracted from the first hour of

the ‘Spy Game’ video. Shots were extracted using the Fast X-ray with peak detection technique

because it provided the best results in the shot detection experiments. A threshold of 6800 was

used. 1,530 shots were extracted giving an average shot duration of 2.35 seconds which is consistent

with the editing style of the movie. Even though the Fast X-ray technique performed better than

any other cut detection technique it still produces some false hits and misses. However, for the

purpose of scene extraction a false hit will only provide more R-frames for the same shot, whilst

a miss has probably occurred because the colour information is so similar between two shots that

there is a good chance that they are part of the same scene anyway.

The R-frames were analysed using two feature extraction techniques, colour + contour and

colour histogram. The colour + contour feature extraction technique uses fuzzy histograms and

the HVC 322 and contour 422 histograms. The colour histogram feature extraction technique uses

more bins with a HVC 644 configuration and does not use fuzzy histograms. R-frame differences

were computed using histogram intersection. The R-frame differences were considered a boundary

between scenes if the difference was above a certain threshold.

Evaluation of the scene boundary data was slightly different to the technique used to evaluate

cut detection methods. Where cuts between shots occur at a specific frame (for an abrupt cut),

there is often some ambiguity between which scene a shot should belong to. For example, a series

of shots showing a person walking from one room, through a series of corridors, to another room

could be consider one scene, two scenes where the boundary is somewhere in the corridor, three

165

Page 188: Content-based Retrieval of Digital Video

Table 6.3: Video scene detection results.Technique Threshold Hits False Hits Proximity

HVC 644 Colour Histogram 0.025 131 114 24.52 seconds

Colour + Contour 0.0005 133 113 22.58 seconds

scenes where each room and the corridor are individual scenes, or even more if the corridor is to

be split into multiple scenes. Since there is some ambiguity rather than a specific time point for

scene boundaries the hit or miss evaluation was applied using a modified approach that allows for

scene boundaries to occur at different time points to the ground truth data based on proximity.

The proximity hit or miss algorithm that was developed for this evaluation is shown in Algo-

rithm 7. A hit is recorded if the difference between two adjacent R-frames is greater than a certain

threshold. The hit is linked with the closest ground truth scene boundary. If the closest ground

truth scene boundary has already been linked to a previous scene boundary detection then it can

be stolen if the current detection is closer in time than the previous detection to the ground truth

scene boundary time. If a ground truth scene boundary is stolen then the previous detection is

counted as a false hit. If the next ground truth scene boundary is closer than the current ground

truth scene boundary then the hit occurs with the next ground truth scene boundary and a miss

is recorded for the current ground truth scene boundary. For every hit the proximity of the time of

the detection from the ground truth time is recorded and tallied. The final proximity value is nor-

malised by the number of hits providing an indication of the average time deviation of detections

from ground truth scene boundaries.

6.4.2 Results

The intensity differences for the colour and colour + contour techniques with the peak detection

filter applied are shown in Figure 6.10. The 1,529 differences were compared with the 208 ground

truth scene boundaries. The proximity hit or miss results are shown in Table 6.3. The results show

that neither technique performed very well which indicates that a grouping technique may perform

better than the adjacent difference technique being used. The colour + contour method performs

better than the colour histogram method, if only marginally, in all aspects including number of

hits, false hits, and proximity.

6.4.3 Discussion

The results show that even though incorporating contour information improves results, the results

themselves are considerably poor that a different approach to identifying scenes should be investi-

gated all together. Corridoni and Del Bimbo [127] detected scenes based on a semantic model of

shots within scenes. They detected scenes that fall within the shot/reverse-shot (SRS) filming of

scenes. SRS scenes have a high correlation between the last frame of the first shot and the first

166

Page 189: Content-based Retrieval of Digital Video

Algorithm 7 Proximity hit or miss scene detection evaluation technique.i← 0 Index of ground truth scene boundaryp← 0 Current proximityP ← 0 Total proximityfor all shot: shots extracted from video sequence do

Determine proximity of current shot to three neighbouring ground truth scene boundaries

previousProximity ← |shot.endTime - Gi−1| Gi: Time of ground truth scene boundary icurrentProximity ← |shot.endTime - Gi|nextProximity ← |shot.endTime - Gi+1|if currentProximity ≤ nextProximity then

if previousProximity < currentProximity AND previousProximity < p then

The current scene boundary is closer to the last hit ground truth boundary than

it is to the current ground truth boundary and is also closer to it than the last

scene boundary recorded as the hit

Increment f Record the last hit scene boundary as a false hitDecrement h Remove the last hit scene boundaryDecrement i

P ← P − p Remove the last proximitycurrentProximity ← previousProximity

end if

p← currentProximity

P ← P + p Increment total proximity with this hitelse

The next ground truth boundary is closer than the current ground truth boundary

therefore the current ground truth boundary must be counted as a miss

Increment m

Increment i

Record that there is hit with the next ground truth boundary

p← currentProximity

P ← P + p

end if

Increment h

Increment i

end for

167

Page 190: Content-based Retrieval of Digital Video

Colo

ur H

istog

ram

Time

Scene Detection

Colo

ur +

Con

tour

Hist

ogra

m

Time

(b)

(a)

Figure 6.10: Intensity graphs for the difference between R-frames using (a) Colour Histograms, and

(b) Colour + Colour Histograms. Both graphs have had the peak detector applied.

frame of the last shot. Cumulative colour histograms and cross-correlation were used to compare

the frames between shots. If the similarity exceeded a threshold then the two shots and all inter-

vening shots were labelled as an SRS scene. An error rate of 21% was reported for three video

sequences which is considerably better than the error rate of 90% for the techniques presented

in Table 6.3, however it should be noted that different video sequences were used. The primary

limitation of the SRS scene detector is that it is only useful for detecting scenes that conform

to the SRS structure which only accounted for 31% of the scenes in the tested video sequences.

Corridoni and Del Bimbo detected the remaining shots based on the correlation of camera motion

between shots where an error rate of 15% was reported, however, even though the technique for

determining camera motion is presented, it is not clear how shots were segmented into scenes.

Another approach for extracting scenes uses clustering rather than segmentation. Yeung et al.

[128] clustered shots into a multi-level hierarchy for the purposes of visualisation. Quantifying the

performance of such techniques is difficult since they depend on visual organisation as opposed to

distinct memberships to a set of scenes. Using clustering for scene identification and visualisation

will be investigated in the next two chapters.

168

Page 191: Content-based Retrieval of Digital Video

6.5 Conclusion

The purpose of this chapter has been to extract a temporal structure of a video sequence as

opposed to the spatial representations of the previous chapters. A number of syntactic and semantic

levels were identified in a video sequence, however, only shots and scenes were investigated in this

chapter. The concept of shot detection is relatively simple in that most shots are bounded by

abrupt cuts. A number of existing cut detection techniques were investigated and evaluated as well

as two new approaches to cut detection involving Fast X-rays and Colour + Contour. The Fast

X-ray technique is a template matching technique similar to Tonomura’s [18] X-ray cut detection

method. A peak detection filter was designed and applied to the Fast X-ray results to enhance the

peaks that represent boundaries between shots. The combination of the Fast X-ray technique and

peak detection filter produced the best results of all of the methods investigated and was also the

fastest.

The colour and contour information, extracted using techniques developed in the previous

chapters, was also used as a method for cut detection. Unfortunately the colour and contour

information was not able to detect cuts as easily as much simpler template matching techniques

at the expense of large amounts of processing power. The intention of this research is to model

human perception as closely as possible, however, with the existing techniques the Fast X-ray

performs significantly better than the perceptually-based colour and contour features. This could

be explained by the fact that the colour and contour extraction methods currently do not process

low-level contour motion which is detected early on in human vision processing [129, 130, 10].

Fortunately, cuts are relatively simple to detect using the Fast X-ray method and therefore the

perceptually-based features can be used for higher level scene-based processing.

Scenes are generally extracted using colour distribution features. In this chapter perceptually-

based contour information was also included to evaluate whether there is any performance improve-

ment. The results show that contour does improve the results however using the scene extraction

technique employed in this chapter the results were very poor whether contour information was

included or not. This is due to two reasons. The first is that scene boundaries are much more

difficult to detect than shot boundaries and often require very high-level information about scene

contents. Even though the contour information extracted from the previous chapter is a higher

level of representation than colour distribution, the current frame comparison technique does not

compare individual contours and also does not perform any higher-level feature extraction such as

a complete three dimensional scene decomposition. The second reason for the poor performance

achieved in this chapter is due to the scene boundary approach of extracting scenes rather than

using a clustering approach. Scene boundary approaches can perform poorly because the entire

contents of a scene which may consist of many shots are determined by the difference between only

two frames for each boundary. Clustering techniques that consider the similarities between shots

may provide more reliable scene extraction. Since clustering techniques can also be used for visual

presentation the evaluation of their performance will be left to the next two chapters.

169

Page 192: Content-based Retrieval of Digital Video

170

Page 193: Content-based Retrieval of Digital Video

Chapter 7

User Interaction

The user’s interaction with a content-based retrieval system is one of the most important compo-

nents of the system. If the user is unable to communicate their query to the system or if the system

is unable to effectively communicate the results back to the user, then the performance of other

components of the system, such as feature extraction and indexing, becomes impeded by the user

interface. Existing content-based image retrieval systems require the user to enter query parameters

which the retrieval system uses to return a list of the most similar images in the database. Video

retrieval systems on the other hand generally use a browsing interface to browse the structure of

a single movie. In this chapter the limitations of existing content-based retrieval user interfaces

are investigated and a broader analysis of user interaction methods is presented that encompasses

techniques that have been used for purposes other than content-based retrieval. Using the resulting

taxonomy, a user interaction framework is presented that identifies user interaction features that

are important for content-based retrieval systems and four novel user interfaces are presented that

incorporate these techniques of interaction. The new user interfaces are analysed within the new

user interaction framework and compared with the existing techniques.

7.1 Existing Content-based Retrieval User Interfaces

An analysis of the current state of user interfaces for content-based retrieval systems was presented

in Chapter 2. From this analysis it can be seen that even though there is significant diversity in

the user interfaces being used for content-based retrieval systems, each user interface suffers from

major drawbacks. For query-result user interfaces the major drawback is the skill requirements of

the user to map their visual query intentions to a widget-based graphical user interface. Browsing

user interfaces overcome this problem by allowing the query to be implicit in the location within the

browsing space. However, browsing user interfaces have primarily been applied to video sequences

rather than image databases and even though there are many innovative approaches each on their

own does not allow the user to explore all of the the available aspects of a video sequence. There

171

Page 194: Content-based Retrieval of Digital Video

is also a lack of cohesion in current content-based retrieval user interfaces with a vastly different

approach being used for video sequences compared to image databases.

The browsing-style user interface is well suited for browsing video sequences, primarily because

of the implicit structure within a video sequence and also because the user is not required to enter

query parameters. Our view is that the browsing user interface is the ideal primary user interface

for both image and video databases as it overcomes the problems of query-result user interfaces.

However, the browsing approaches explored for video sequences are also limited. Some represent

the hierarchical nature of video sequences whilst others represent temporal aspects but there is no

user interface that represents all of the characteristics of a video sequence.

Our purpose is to develop an improved primary user interface for both image and video content-

based retrieval systems that unifies many of the existing approaches and also overcomes many of

their limitations. Since our approach is a browsing user interface, techniques from the general field

of user interaction can be explored without being limited to content-based retrieval user interfaces.

The remainder of this chapter investigates the basic requirements of any user interface, browsing

user interfaces, and more specifically content-based retrieval user interfaces. Existing browsing

user interfaces are presented in a taxonomy and new user interfaces are presented that satisfy the

requirements identified.

7.2 User Interface Requirements

A user interface should be:

• Responsive

• Intuitive

• Efficient

Responsiveness There are a number of factors that can affect the user interface response time

including network, storage, and memory bandwidth limitations, processing overhead, and rendering

time. Bandwidth can limit response times if a large amount of data needs to be processed. This is

often the case in browsing user interfaces where a subset of a large information set must be accessed

and displayed. Each node may contain a thumbnail which must be accessed from the storage device

and cached in memory. There may also be additional processing overhead to compute the location

of objects on screen. A clustered layout may require thousands of iterations before the data is

in a presentable form. Rendering time can also affect responsiveness if complex rendering effects

such as translucency, shadows, antialiasing, texturing, and three dimensional rendering are used. A

browsing user interface for a content-based retrieval system will most likely incur large bandwidth,

processing, and rendering overheads. For this research we are not too concerned by these factors

but instead are more concerned with the form of user interaction. Techniques such as caching,

172

Page 195: Content-based Retrieval of Digital Video

indexing, and other rendering optimisations can be used to reduce the effects of these overheads.

In addition, with time, computer hardware improves reducing overall response times.

Intuitiveness A user interface is intuitive when only a small amount of time is required for the

user to become adept at using the new interface. In the early days of computing where most users

had little prior experience with computers, user interfaces could be entirely different to each other

and would use explicit textual instructions or graphical representations of real world objects to

improve the usability of a system. Today, most users have used a personal computer and have

some prior experience with a graphical desktop operating system. Therefore, any system that uses

standard desktop widgets or a web page has a fair degree of intuitiveness built-in. Users will also be

familiar with browsing user interfaces as desktop systems provide scrollable windows of icons and

hierarchical navigation tools. However, content-based retrieval user interfaces may still be quite

different to existing desktop user interfaces. A Hierarchical Video Magnifier [49], VideoSpaceIcon

[18], or spatial query [22] may be quite foreign to the typical desktop user. Therefore, such user

interfaces should make it apparent what each aspect of the user interface represents, how to interact

with them, where the user currently is, and where they can go. Icons, animations, widgets, and

textual descriptions can assist in communicating these aspects to the user.

Efficiency The efficiency of a user interface is determined by how much mental and physical

power must be exerted by the user to achieve a certain task or set of tasks. Mental power is

exerted when the user must ‘figure out’ what to do next. Physical power is exerted when the

mouse needs to be moved or clicked or when the user is required to type or press keys on the

keyboard. Browsing user interfaces generally employ a point, click, and drag model. The number

of mouse actions required to navigate through a hierarchical model is dependent on the number

of levels in the hierarchy and by how certain the user is that the target object is in the path they

have chosen. The number of children of each node will also impact the mental exertion required,

more children will require greater mental power to determine where the target object is. Grouping

similar objects together on the other hand will require less mental and physical exertion by the

user.

7.2.1 Browsing User Interface Requirements

A browsing user interface fundamentally implies that the user is navigating spatially whether it be

in one, two, or three-dimensions and whether the grouping is hierarchical or planar. Research into

browsing large information spaces has found that users should be provided with both detail and

context simultaneously [34]. Detail may include a reduced image of the object being browsed, the

name of the object, temporal characteristics, and relationships with other objects. Context on the

other hand provides an indication of where the user is in the information space, where they came

from, and where they can go. Visualisation techniques to achieve simultaneous display of context

173

Page 196: Content-based Retrieval of Digital Video

and detail will be discussed in the next section but first we need to determine the requirements

that are specific to content-based video retrieval systems.

7.2.2 Content-based Video Retrieval User Interface Requirements

Christel et al. [50] identified two forms of interacting with video sequences: ‘finding’ and ‘gisting’.

Finding is the process of searching for particular portions of a video sequence. Gisting on the other

hand is a method of presenting the contents of a video database to the user in such a way that

they very quickly get the ‘gist’ of the contents of the video or video database. A browsing type user

interface can successfully allow both forms of interaction. To begin with, the user is provided with

a starting point view of the data set. The starting point should provide the user with an indication,

or the ‘gist’, of the contents of the video database even to the extent that the user may be able to

determine whether their target object will be in the database. If the user determines that the target

object is most likely in the database then they go into ‘finding’ mode looking for the most likely

cluster of objects to investigate in more detail to find the target object. A user interface must be

responsive and efficient, therefore the ordering of the data must allow the user to reach the target

in as little time as possible with minimal mental and physical exertion. In content-based retrieval

systems this is best achieved by ordering the objects by the features that are most useful for the

user’s search. A content-based retrieval system can also benefit from slightly larger thumbnails

than is typically used for computer icons as the user is interested in the detail of the content of

multimedia objects. The thumbnails may also represent other aspects of multimedia objects such

as temporal information.

7.3 Visualisation Techniques

The requirements of content-based retrieval browsing user interfaces identified in the previous

section allow existing user interaction techniques that have yet to be applied to content-based

retrieval systems to be analysed for suitability in this domain. In this section the analysis of user

interfaces is extended to cover the broader category of information space visualisation techniques

and their usefulness to content-based retrieval systems.

7.3.1 2D Techniques

Two dimensional visualisation techniques present the information space in a planar format similar

to a street directory where the user can pan or scroll through the information space horizontally

and vertically. While a planar format provides the user with detail, it provides little context as to

where the user is within the complete information space. Pad++ [131] overcame this problem by

allowing the user to rapidly zoom in and out of a two-dimensional display of directories and files.

Woodruff et al. [132] refined zooming to preset zoom levels thereby avoiding the need for the user

174

Page 197: Content-based Retrieval of Digital Video

to fiddle with zoom controls to achieve the best level of detail for the data they are viewing.

Zooming allows the user to have context at one level and detail at another, however, the user

cannot have simultaneous context+detail. Lieberman [133] proposed the macroscope which overlays

multiple levels of detail so that the user has context+detail. The layers are drawn transparently so

that higher levels can be seen through the lower levels. Unfortunately, it can be difficult separating

the individual levels as they draw over each other. The system has been applied to maps and the

Macintosh Finder.

7.3.2 Distortion-oriented Techniques

Distortion-oriented techniques distort the two-dimensional display so that more data can be dis-

played at the extremities of the screen providing simultaneous context+detail. Distortion-oriented

techniques were first proposed by Furnas [34] in the form of the fisheye view. The fisheye view

gives the same effect as if a photo had been taken of the information space with a fish eye lens.

Sarkar and Brown [134] later generalised Furnas’ fisheye views and applied them to planar and

hierarchical graphs.

Mackinlay et al. [135] proposed the perspective wall which displays a wall in three dimensions

with sides angled away from the user so that more detail can be displayed at the centre and less at

the sides. The perspective wall is essentially the same as the fisheye lens except that the gradient

scale at the edges of the screen is linear and only the horizontal aspect of the display is used for

displaying context. The perspective wall was designed for structured data such as timelines. For

unstructured data such as text documents Robertson and Mackinlay [136] proposed the document

lens which can be considered as a type of two-dimensional perspective wall. The user can scroll a

large rectangle of data horizontally and vertically. As with the perspective wall there is a rectangle

in the centre which is undistorted whilst panels surrounding the centre are angled away from the

viewer providing continuously lower detail but greater context. Leung and Apperley [137] assert

that the document lens makes better use of screen real estate than the perspective wall since the

vertical portions of the display are also used to display context. Another technique similar to the

document lens is the table lens [138] which provides a similar type of distortion for tabular data.

Lamping et al. [139] proposed a distortion technique for viewing large hierarchies. Their lens

uses hyperbolic geometry so that objects become infinitesimally small as they reach the limits of

the viewing area. The root of the hierarchy is initially displayed in the centre of the browser and

each child is given a wedge to display itself and its children. The user can navigate through the

hierarchy simply by scrolling the display. Lamping et al. [139] claim that the hyperbolic distortion

can display ten times the number of nodes than a traditional uniform display.

Hovestadt et al. [140] also proposed a hyperbolic user interface for CAD environments. Their

system could use different base corpuses such as hyperboloides, cones, and spheres. An untrans-

formed region can be displayed in the centre of the display for editing undistorted CAD documents.

175

Page 198: Content-based Retrieval of Digital Video

Whilst developing a Java virtual machine for the Palm Computing platform Taivalsaari [141]

found a need for a user interface to browse the class files of a Java program. The Palm user interface

is very small and there are generally a large number of class files associated with Java programs.

Taivalsaari [141] proposed the event horizon user interface. There are a number of parallels between

the problems faced by Taivalsaari and the problems of desktop computer screen sizes in relation

to the large amounts of data that need to be navigated. The event horizon user interface could be

seen as the opposite of a fisheye view. In a fish eye view the zoom is at the centre and context

is provided at the extents of the display. In contrast, the event horizon user interface disappears

towards a dot in the centre of the screen called the event horizon (or sink). Towards the extents

of the user interface more detail is displayed. Unlike the fisheye view, panning is not possible with

the event horizon user interface. One zooming action allows the user to navigate through the entire

space.

The event horizon interface can be likened to a large tube with files located on the inside surface

of the tube. To see more icons the user can either move forwards or backwards through the tube.

To handle hierarchical structures conventional folders (sinks) can be added to an existing sink.

Tapping a sink makes it the active sink. A problem with this interface model is that visualising the

hierarchical directory structure is difficult, although this may not be an issue for browsing query

results.

Each of the distortion techniques presented in this section are based on magnification of the

data. Leung and Apperley [137] provided a taxonomy of presentation techniques and described

each in terms of their magnification function. The magnification function describes the level of

magnification at a certain position on the display. Even though a fisheye type magnification may

appear more natural in that there are no sharp discontinuities in the magnification it will often

distort the detail portion of the display. Since no context is required in the detail portion of the

display the distortion only makes the detail more difficult to view. Other techniques such as the

perspective wall and event horizon user interfaces have parallels with real three dimensional objects

which can also be used for visualising data as discussed in the next section.

7.3.3 3D Techniques

Most distortion-oriented techniques can also be interpreted as 3D techniques. In the same way all

3D techniques can also be interpreted as distortion techniques, as a 3D scene rendered to a 2D screen

uses a projection function. The projection function is essentially just another magnification function

that is used to describe distortion user interfaces. The advantage of using the third dimension is it

allows for greater flexibility and can provide a familiar environment for the user.

Panning, zooming, and context+detail properties provided by distortion and 2D visualisation

techniques are provided implicitly by a 3D user interface through movement in the 3D space. Pan-

ning can be achieved by moving sideways or vertically, zooming can be achieved by moving forwards

or backwards or by changing the zoom on the camera, and context+detail are implicitly provided

176

Page 199: Content-based Retrieval of Digital Video

by the perspective projection of 3D renderers. Therefore, the navigation properties required by

browsing user interfaces are afforded by 3D user interfaces through navigation techniques that the

user is familiar with in the real world.

The team from Xerox PARC were pioneers in three-dimensional user interfaces through the

development of the Information Visualiser [142], which contained the Perspective Wall [135] and

Document Lens [136] techniques already discussed, as well as the Cone Tree [35]. The Cone Tree

is a three-dimensional structure designed to navigate hierarchies. Each node is represented by a

cone of child nodes. Even though the structure can be rotated to view nodes that are behind the

structure the user can not see the entirety of the structure without interacting with it.

Lucas and Schneider [143] allowed the user to layout the documents in a three dimensional

Workspace. The user can easily remember where documents are by their spatial location. Custom

layout in a content-based retrieval system is less useful than an automatic layout, however, allowing

the user to customise aspects of the layout may allow for more efficient interaction. Card et al. [144]

developed a similar system for browsing web pages called WebBook and WebForager. WebForager

allows the user to place web pages in a three-dimensional environment. Web pages can also be

viewed in a WebBook and kept on a bookshelf which represents tertiary storage. An important

characteristic of these forms of user interfaces is that documents, books, and storage are represented

by real world three dimensional objects thereby providing a more intuitive environment for the

user. However, as noted earlier, satisfying the user interface requirement of intuitiveness through

similar representations to real world objects is not a significant issue as most users have a basic

level of familiarity with standard graphical user interface widgets.

Robertson et al. [145] extended and simplified the WebForager metaphor for organising Internet

Explorer favourites with the Data Mountain user interface. Data Mountain allows users to arrange

web page thumbnails on a sloping landscape to take advantage of the user’s spatial memory.

Experimental results showed that it was easier for users to arrange their web pages and remember

where they were if they came back a month later with Data Mountain compared with Internet

Explorer.

Data Mountain is similar to Cartia’s ThemeScape [146] which presents a two-dimensional view

of a landscape. ThemeScape analyses a database of documents and displays them as clusters in a

landscape. Mountain peaks represent many documents with similar keywords. ThemeScape could

be extended for use in a content-based retrieval system by representing similarities between images

and video sequences rather than textual phrases.

Another three dimensional user interface is Task Gallery by Robertson et al. [147] which

presents the user with a room with a windowing user interface on the walls of the room. The

interface allows more information to be displayed than conventional displays. The room metaphor

of Task Gallery is similar to the tube metaphor of the Event Horizon [141] user interface.

Three-dimensional visualisation techniques provide a more familiar navigation experience for

users than two-dimensional or distortion-oriented user interfaces whilst providing the same benefits

177

Page 200: Content-based Retrieval of Digital Video

of simultaneous context+detail. Three-dimensional techniques for visualising large multimedia data

sets have been restricted by computing performance. However, over the last few years dedicated

high performance rendering cards have become readily available with large memory capacities

capable of simultaneously displaying thousands of small thumbnails making three-dimensional

visualisation more suitable than either two-dimensional or distortion-oriented techniques.

7.3.4 Hypermedia Maps

An alternative approach for information visualisation is to represent relationships between mul-

timedia documents as hyperlinks. Zizi and Beaudouin-Lafon [148] explored interactive dynamic

maps for browsing text documents in two dimensions. They found that displaying all of the links

between documents cluttered the display and made it difficult for the user to see the relationships

of the currently selected document. They solved this problem by only showing the links of the

currently focused document.

Chen and Czerwinski [149] explored spatial hypertext in three dimensions using latent semantic

indexing (LSI). The user interface could also display search results by adding columns which extend

from the nodes. Longer columns represented more relevant documents.

For multimedia information spaces the use of hyperlinks may be of little value as similarities

between closely located objects would be seen visually through image thumbnails. Similarities that

can not be seen through the thumbnails themselves can be implied by the closeness of surrounding

objects which represents similarity.

7.4 Taxonomy

Before a new user interface can be designed for a content-based video retrieval system the strengths

and weaknesses of existing user interfaces must be identified. A number of techniques for browsing

information spaces and video sequences have been presented in this chapter and Chapter 2, each of

these exhibit one or more features that are beneficial for a content-based video retrieval system. The

following eight features have been identified as being beneficial when browsing a video information

space:

Panorama: The user interface provides an overview of the spatial layout of the shot.

Motion: Whether the user interface provides an indication of motion within the scene, including

object and camera motion.

Distortion: Whether distortion techniques are used to achieve context+detail.

3D: Whether 3D is used. Not necessarily immersive, for example Video Icon [53] uses three di-

mensional cues to indicate the duration of a shot. 3D interfaces use distortion to achieve the

178

Page 201: Content-based Retrieval of Digital Video

3D effect, however, interfaces that are 3D aren’t classified as being distortion techniques in

this taxonomy unless they use 3D especially to achieve distortion (e.g. [135, 136]).

Shots: Whether the user interface implicitly or explicitly segregates video based on scene changes

or camera shots.

Hierarchical: Whether the interface is designed for displaying hierarchical data. Interfaces of this

type will be useful for displaying video data which is inherently hierarchical.

Clustering: Automatically indicates the relationship between objects based on content. For ex-

ample, by placing similar objects close to each other or drawing a line between them.

Video: Whether the interface is currently being applied to video (including images).

A taxonomy was constructed using these 8 attributes of all of the user interface techniques

presented in this chapter as well as the user interface techniques of Chapter 2. Table 7.1 shows

which attributes each visualisation technique exhibits.

7.4.1 Analysis

In our user interface reviews in this chapter we have discovered that the most important feature

of a user interface to browse large data sets is context+detail. A number of techniques have been

used to provide context+detail including hierarchical graphics, 3D user interfaces, and distortions

such as the fish eye lens.

Another important feature for viewing video databases is the concept of gisting or getting the

overall gist of the video [50]. This is achievable if the user interface allows representative objects

to be viewed before drilling down further. Hierarchical user interfaces are generally the best for

providing representative images at each level. Also other techniques such as motion indicators or

panoramas which display the overall scene with a single image or 3D object are useful for gisting.

Another important concept for browsing large data sets is for objects which are similar to be

spatially located near each other. Finally, a system must be able to effectively browse both video

and image databases. This means that the user interface must support the hierarchical structure

of videos and secondly that images must be viewable in a hierarchical approach similar to videos

even though there is no inherent hierarchical structure in a collection of images.

Therefore the four main requirements that have been identified for a content-based video re-

trieval system are context+detail, gisting, clustering, and integration of images and video. The eight

attributes of the taxonomy in Table 7.1 can be associated with the four main requirements of the

user interface. Table 7.2 shows the relationship between content-based video retrieval user interface

requirements and the taxonomy attributes. Associating the taxonomy attributes to user interface

requirements is important because more than one type of attribute may satisfy a user interface

179

Page 202: Content-based Retrieval of Digital Video

Table 7.1: Video browsing taxonomy. Video browsing techniques are above the double line whilst in-

formation space browsing techniques are below the double line. (P)anorama, (M)otion, (D)istortion,

3D, (S)hots, (H)ierarchical, (C)lustering, (V)ideo.Name P M D 3D S H C V

PaperVideo [18] • •IMPACT [48] • • • •Rframes [47] • • • •Hierarchical Video Magnifier [49] • •Key-frame Hierarchical Video Browser [17] • • •VideoStreamer micro-viewer [51] • •Video Skims [50] • • •Mosaicking [18, 52] • • •Video Icon [53] • •Video Streamer [51] • • • •VideoSpaceIcon [18] • • • •

Fisheye [34] •Perspective Wall [135] • •Document Lens [136] • •Hyperbolic [139, 140] • •Event Horizon [141] •Data Mountain [145] •Cone Tree [35] • •Pad++ [131]

Goal-Directed Zoom [132]

Macroscope [133] •Workscape [143] •WebForager [144] •TaskGallery [147] •Interactive Dynamic Maps [148] • •Spatial Hypermedia Maps [149] • •ThemeScape [146] •

Table 7.2: Requirements for a content-based video retrieval user interface and their relationship

with the taxonomy attributes.Requirement Taxonomy Attributes

Context+detail Hierarchical, 3D, Distortion

Gisting Hierarchical, Panorama, Motion

Clustering Clustering

Video and Image Integration Shots, Video

180

Page 203: Content-based Retrieval of Digital Video

requirement, such as hierarchical, 3D, and distortion-based attributes fulfilling the requirement of

context+detail.

Table 7.1 shows that of the 27 user interfaces, 2 do not support any of the attributes, 6 only

support 1 attribute, 11 only support 2 attributes, 3 support 3 attributes, and 4 support 4 attributes.

The user interfaces above the double line are user interfaces designed to browse video. Of these none

support the content-based video retrieval requirement of clustering. Therefore, of these existing

video browsing user interfaces none fulfil more than three of the four user interface requirements.

The lack of clustering support became a driving factor in the development of two of the user

interfaces presented in the next section.

7.5 New Video User Interfaces

The taxonomy of Table 7.1 can be used as a guide for developing content-based retrieval user inter-

faces. Features of various user interfaces can be combined to produce an improved user experience.

In the following sections four user interfaces are presented that attempt to address some of the

weaknesses that exist in current video retrieval user interfaces.

7.6 MountainView

Where image database user interfaces are lacking is in providing an adequate starting point for

the user. The user requires a great deal of skill in presenting the initial query to the content-

based retrieval system. In contrast video user interfaces provide a good starting point but only

for browsing one video, neither individual images or multiple videos benefit from existing video

browsing user interfaces. Our initial goal was to provide a good starting point for the user regardless

of whether they are browsing one video, many videos, or individual images. Since a collection of

images has no implicit hierarchy a user interface was conceived that uses spatial clustering to

represent groupings and similarities between independent images. Spatial clustering can also be

useful in video sequences, where groups of similar images represent frames of the same scene. Since

image databases and video sequences can consist of many hundreds of thousands of images, only

a few representative images can be displayed at a time on the screen. Therefore dense clusters of

images were replaced by a single image. To indicate the density of a cluster the terrain was elevated

providing mountain peaks, with taller and broader mountains indicating denser clusters.

A concept rendering of MountainView is shown in Figure 7.1. Each peak represents a scene in

the video sequence. Arrows between the peaks indicate temporal relationships between scenes (a

similar approach has been used by [128]). When zoomed out only one image per peak is displayed.

When the user selects a peak, the camera zooms down (Figure 7.1 (b)) and the user is able to

see more images on the peak and nearby peaks and hence more temporal relationships can also

be viewed. Animation is used for navigation, dynamically changing the size of thumbnails, and for

181

Page 204: Content-based Retrieval of Digital Video

fading arrows in and out.

The user interface was implemented in Java and OpenGL and applied to the test image database

(Figure 7.2). The details of the clustering scheme used are presented in the next chapter. Densities

are calculated by dividing the landscape area into a grid. Each grid element is assigned a density

based on the objects contained within it. For each object within the grid element the distance

from the object to the centre of the grid element is subtracted from the diagonal length of the grid

element and added to the density tally for that grid element:

D(x, y) =N∑

i=1

l − |OiG| (7.1)

where D is the density at (x, y), N is the number of objects within grid element G, l is the length

of the grid element’s diagonal, and |OiG| is the distance from the centre of object Oi to the centre

of grid element G. The density is used to set the elevation of the point in the landscape. A peak

occurs if its neighbouring points have a lower elevation. The image used to display at the peak is

the one that is closest to the peak.

The major drawback of the MountainView user interface was that it did not work well at

different scales. When the the user interface was zoomed out the user could easily get the overall

gist of the database or video, however after zooming into a peak, navigation became difficult

for two reasons. Firstly, the peak inherently occludes objects that are on the other side of the

peak requiring the user to rotate the user interface to see the other objects. Secondly, further

clusters of objects within the peak were difficult to represent since they are represented as further

subpeaks. When many images exist in the database, many levels of subpeaks exist and are difficult

to represent in the proposed mountainous user interface. Spatial clustering techniques also have

difficulty representing micro clusters.

The weaknesses of the MountainView user interface led to considering using hierarchical clus-

tering techniques rather than solely spatial clustering techniques.

7.7 Disc Tree and Goldleaf

A user interface based on hierarchical clustering was considered for two reasons. Firstly, multimedia

objects can often be decomposed into a hierarchy of subobjects, such as multimedia presentations,

video sequences, and even images using segmentation techniques. Secondly, the spatial clustering

used in MountainView did not perform well at multiple scales.

Our initial foray into hierarchical user interfaces did not consider automatic grouping or layout

of the data, instead, the initial intention was to determine a suitable form of hierarchical navigation

for multimedia data. A prototype user interface was developed in a 3D editing studio to explore the

possibility of representing hierarchical clusters of multimedia data. Disc Tree was designed so that

at every level of the hierarchy the immediate parents were also visible (Figure 7.3). Transparency

182

Page 205: Content-based Retrieval of Digital Video

Figure 7.1: MountainView concept rendering.

183

Page 206: Content-based Retrieval of Digital Video

Figure 7.2: MountainView user interface.

and animation were also used to allow greater visibility to higher layers in the hierarchy. The image

set used was from a clip art database that had all of its images categorised by a named hierarchy

which was used for the hierarchy used in Disc Tree.

To test the performance of navigating such a hierarchical user interface, a separate two di-

mensional user interface called Goldleaf [1] was implemented. Goldleaf was designed for browsing

hierarchical file systems that also contained multimedia data. The user interface was designed to

use as much of the screen real estate as possible. Figure 7.4 (a) shows the root of a file system on a

Windows PC. The user is able to see three levels of the hierarchy with labels and five levels of the

hierarchy without labels simultaneously. Folders are arranged radially around the parent folder and

files are shown within the folder (Figure 7.4 (b)). The same clip art database used for Disc Tree was

used for Goldleaf and the images were displayed as thumbnails in the user interface. Goldleaf also

supports the display of HTML and text document thumbnails. Users are able to navigate multiple

levels at a time through a single click and thumbnails are arranged in an innovative stacked layout

where the thumbnail under the mouse pointer is revealed dynamically. Animation is used for any

changes in the user interface.

Experiments were conducted with Goldleaf comparing it with another standard hierarchical

browser, Microsoft Windows Explorer [1]. User studies comparing both user interfaces showed that

Goldleaf required less clicks and less time to find items, showed greater improvement with repeat

184

Page 207: Content-based Retrieval of Digital Video

Figure 7.3: Disc Tree user interface.

usage, and was found to be more enjoyable to use. The experiments conducted confirmed that a

hierarchical clustered layout was an efficient form of navigation for the user. What both Goldleaf

and Disc Tree lacked was the ability to automatically cluster the data hierarchically based on its

content.

7.8 DomeWorld

The lessons learned from MountainView, Disc Tree, and Goldleaf were combined to produce the

final user interface. MountainView was flexible in that no distinct groups needed to be formed but

was difficult to navigate as distinct clusters were difficult to see. Disc Tree and Goldleaf on the other

hand were able to distinctly display groups but did not show the individual relationships between

subgroups instead subgroups were merely arranged radially around its parent. What is needed is

a nested user interface to represent hierarchical clusters that is also able to show the individual

relationships between all levels of the hierarchy simultaneously. The result is a flat landscape similar

to MountainView that is filled with thumbnails. Groups are indicated by translucent domes that

encase subgroups and/or images giving the user interface’s name DomeWorld (see Figure 7.5). Each

dome has its own thumbnail that is representative of all of its component thumbnails. Selecting a

thumbnail zooms down to that level (Figure 7.5 (b)). The grouping of images is also represented

by a circular shadow that grows progressively darker the deeper the group is within the hierarchy.

Figure 7.5 (a) shows that four levels of images are easily visible. A modified agglomeration technique

185

Page 208: Content-based Retrieval of Digital Video

(a)

(b)

Figure 7.4: Goldleaf user interface.

186

Page 209: Content-based Retrieval of Digital Video

is used to hierarchically cluster the images which is discussed in the next chapter on clustering. The

following subsections describe the layout, thumbnails, rendering, and interaction in more detail.

7.8.1 Layout

The agglomeration clustering method used in the DomeWorld user interface only specifies the

hierarchical grouping and not a spatial layout. Since DomeWorld is a nesting of domes, which are

circles, the problem is how to lay the subcircles out within the parent circle without overlapping

other circles. One of the goals of the DomeWorld user interface was to make maximum use of the

screen real estate, so circle sizes need to be calculated to occupy as much of the parent disc as

possible.

The proposed layout technique is to layout clusters radially in a circle adjusting child disc

radii to occupy as much of the parent disc as possible (see Figure 7.6). As the number of child

discs increase, their radius will correspondingly need to decrease. Since the child discs are laid out

against the perimeter of the parent disc, when the child disc radius decreases sufficiently one of

the child discs will also fit in the centre (see Figure 7.5 (a)).

Figure 7.6 shows how the child disc radii are calculated. The sum of the child disc radius b and

the distance from the centre of the parent disc a must be less than or equal to the radius of the

parent disc r:

r = a + b (7.2)

Also, the angle formed between two child disc centres and the parent disc centre must be φ

which is 2π divided by the number of children, n:

φ =2π

n(7.3)

φ is also related to a and b:

sinφ

2=

b

a(7.4)

Substituting equation 7.4 into equation 7.2 allows the optimal values of a and b to be calculated

to fill as must of the parent as possible:

b = rsin(φ)

sin(φ) + 1(7.5)

b can also be multiplied by a scaling factor s to reduce the child radius to allow a small amount

of space between child discs. s is currently set to 0.95.

When the child radius is less than or equal to one third of the parent radius then there is

enough room to also inset a child into the centre of the parent disc. This occurs when there are

six children. Therefore, when a parent has seven children the seventh can be placed in the centre

and the remaining six arranged as if there were only six children. This is done for all parents with

seven or more children.

187

Page 210: Content-based Retrieval of Digital Video

(a)

(b)

Figure 7.5: DomeWorld user interface.

188

Page 211: Content-based Retrieval of Digital Video

b

a

r

ø

Figure 7.6: Circle layout.

7.8.2 Representative Objects

The DomeWorld layout requires that a representative object be selected for each cluster. The

representative object is chosen as the child representative object which is most similar to all of the

other child representative objects, that is, the sum of similarities is greatest. Representative objects

are recursively determined from the lowest levels to the highest. Therefore a high level node only

selects a representative object of the representative objects of its immediate children reducing the

calculations required to compare all children.

Representative objects are displayed at the top of domes. The vertical position of a represen-

tative object is the height of the dome and the horizontal position is the centre of the dome. The

width of the representative object is set to one quarter of the width of the dome.

Representative objects are used so that users can decide whether the object they are after is in

a particular subtree. Therefore no representative object is required for the root of the tree as the

user will always begin there. Also, representative objects for lowest level domes are optional as the

children thumbnails will probably be as visible as the representative object thumbnail.

7.8.3 Rendering

A software-only 3D rendering engine has been implemented to support the effects required by

the DomeWorld user interface. Two issues made a software-based engine more suitable than a

hardware-based engine. The first was the fact that many images may be in the database and

hardware renderers require textures to be present in video memory. If they aren’t then they are

swapped between main memory and video memory which can be very slow. Our software engine

189

Page 212: Content-based Retrieval of Digital Video

leaves all textures in main memory and only draws the rendered textures to video memory, greatly

reducing bandwidth.

The second reason was to provide the best quality dome rendition. Hardware renderers are

polygon-based which would mean that the dome would need to be subdivided into many smaller

polygons. This would affect the viewing quality of the domes plus increase the memory requirements

with the additional polygons. A software renderer has been implemented to draw the domes without

segmenting into polygons.

All textures and object models are retained in main memory and rendered to a buffer in main

memory and then copied to a buffer in video memory before being page-flipped and displayed.

Translucent Dome Rendering The goal of software-based translucent dome rendering was to

provide a containment effect which was as realistic as possible with minimal distractions such as

the polygons visible in normal hardware rendering. The approach taken was to scan every point

to determine whether a screen point was within the dome and if so then lighting and translucency

were calculated before rendering the final point.

A dome is simply an ellipsoid cut in half, so the rendering technique begins initially with the

ellipsoid equation:

r2 =x2

w+

y2

h+

z2

d(7.6)

where r is the radius and w, h, and d are the proportions of the ellipsoid.

Rather than rotating the equation itself, the x, y, and z values were transformed into the

basic ellipsoid equation (7.6). The position begins initially as the screen position (xs, ys, zs).

The point must be converted to a co-ordinate in 3D space by reversing the perspective transform

which involves multiplying x and y by z and dividing by the zoom factor. The point must then be

translated by the camera offset and rotated in the opposite direction to the camera orientation.

The point is then also offset by the ellipsoid’s position and rotated in the opposite direction to its

orientation.

The transformation equations for x, y, and z can then be inserted into equation 7.6. However,

to simplify the process the camera translation and rotation can be factored into the ellipsoid’s

position and orientation simplifying the resulting formula.

The goal of the formula is to determine the z value for a point on screen. The equation needs

to be rearranged into quadratic form using z as the unknown variable:

az2 + bz + c = 0 (7.7)

The resulting values for a, b, and c become:

a = z2E1 (7.8)

b = −2xtzE1 + 2yz2E2 − 2zytE2 + 2z2E3 − 2zztE3 (7.9)

190

Page 213: Content-based Retrieval of Digital Video

c = x2t E1 − 2yzxtE2 + 2xtytE2 − 2zxtE3 + 2xtztE3 (7.10)

+y2z2E4 − 2yzytE4 + y2t E4 + 2yz2E5 − 2yzztE5 (7.11)

−2zytE5 + 2ytztE5 + z2E6 − 2zztE6 + z2t E6 − r2 (7.12)

where xt, yt, and zt are the translation values and Ei, i = 1→ 6 are the rotation coefficients:

E1 =x2

1

w+

y21

h+

z21

d

E2 =x1x2

w+

y1y2

h+

z1z2

d

E3 =x1x3

w+

y1y3

h+

z1z3

d

E4 =x2

2

w+

y22

h+

z22

d

E5 =x2x3

w+

y2y3

h+

z2z3

d

E6 =x2

3

w+

y23

h+

z23

d

and xi, yi, and zi are:

x1 = cos b cos c

x2 = sin a sin b cos c− cos a sin c

x3 = cos a sin b cos c + sin a sin c

y1 = cos b sin c

y2 = sin a sin b sinc + cos a cos c

y3 = cos a sin b sin c− sin a cos c

z1 = − sin b

z2 = sin a cos b

z3 = cos a cos b

where a, b, and c are the rotation angles for the x, y, and z axes respectively.

If a root exists for the equation then the pixel is inside the ellipsoid and can be drawn. Rather

than scan the entire display the renderer calculates the centre of the ellipsoid on the screen and

begins testing pixels extending horizontally until either a point is reached that has no roots or the

bounds of the screen is reached. This process continues for each line above and below the ellipsoid

centre until a line is reached whose first point has no roots.

A number of optimisations were applied to speed up dome rendering. Firstly, xi, yi, zi, Ei,

and c only need to be calculated once each time the dome is rendered, leaving only a and b to

be calculated for each pixel. Equations 7.8 and 7.9 can be further factored to allow for only the x

value to change as each pixel is rendered from left to right. The result is that the root of equation

7.7 can be determined with only four products and five additions per pixel.

The amount of shading to apply for the translucency effect is based on the normal of the

surface of the ellipsoid relative to the camera. The greater the angle away from the camera the

191

Page 214: Content-based Retrieval of Digital Video

more shading that is applied. The result is that the borders of the dome tend towards opaqueness

where as the centre of the dome tends towards transparency.

The angle between the normal vector and the camera vector is determined using the dot product

of two vectors:

d = cxnx + cyny + cznz (7.13)

where c is the camera vector and n is the normal vector that determine the dot product d. The

final alpha value is the dot product normalised by the magnitude of the camera and normal vector:

α =d

|c||n|(7.14)

This form of rendering is quite computationally intensive due to the square root functions used

for determining the root of equation 7.7 and the magnitude of the normal vector. Therefore, an

outline dome rendering technique was investigated as an alternative rendering approach.

Outline Dome Rendering Since the translucent form of dome rendering requires two square

root operations for every pixel rendered the outline dome rendering technique was investigated as

it only renders pixels around the horizon of each dome. Outline rendering is similar to translucent

dome rendering except that the algorithm first finds the extents of the dome horizontally by

testing all pixels outwards from the centre point until a pixel is reached that has no roots. Then

the algorithm extends vertically in both directions but this time walks inwards until a point is

found that has roots. Outline rendering is much faster than translucent rendering because very few

points are tested and also very few points are drawn.

Thumbnail Rendering Thumbnails were rendered as billboards meaning that they will always

face the screen and will not require any rotations, simply translation and scaling, allowing very

fast rendering. To improve scaling performance textures were mip-mapped to 3 levels.

z-Ordering z-Ordering was optimised specifically for the dome layout. The sky and infinite plane

will always be behind all other objects and are drawn first. Then for each dome the ground disc is

drawn first followed by all children then the dome is drawn followed by the representative object.

Therefore the only z-Ordering required is that of the children of the immediate dome. Since children

are stored in linked lists a simple bubble sort is used for z-sorting.

Optimisations The target platform was an 800MHz PowerPC processor with a 133MHz main

memory bus and 66MHz video memory bus. These bus speeds limited the screen resolution and

pixel depth that could be used to achieve acceptable frame rates. 16 bit video was selected over 32

bit video as it allowed frames to be blitted about 50% faster, due to the 66MHz video bus. Initially,

only 640 × 480 resolution was used for full screen playback due to the increased time required to

render a frame at 1024 × 768 resolution and also because of the increase bandwidth required to

192

Page 215: Content-based Retrieval of Digital Video

copy each frame to the video buffer. However, a compromise was reached which allowed frames to

be drawn quickly and also at high resolution. This was achieved by drawing frames at a quarter

of the resolution, 512× 384, during animation and rendering at higher resolution when animation

had stopped. The result was very effective as high resolution isn’t required for the animation effect

but is when the user must see the details of small thumbnails to choose the next dome. Rendering

is approximately twice as fast at the lower resolution.

During animation, domes are drawn only as outlines and when the animation has stopped the

final frame is still rendered as an outline but at the higher resolution. Then, one more frame is

rendered, rendering the domes translucently. On our target platform the final frame took approx-

imately 0.5 second to render.

The dome drawing routine requires one square root function to be called for each pixel tested

for drawing outlines and two square root functions to be called for each pixel drawn for translucent

domes. The standard square root function can require up to 200 processor cycles to accurately

determine the square root of a number. Since accuracy is not a primary concern with the user

interface, faster square root implementations were investigated. It was discovered that the PowerPC

processor contains a floating point square root estimate instruction, frsqrte, that can be called

iteratively to approximate the square root [150]. The square root instruction actually returns the

inverse of the square root which is perfect for determining the translucency at a point (equation

7.14). It was found that even just one call to the square root instruction was sufficient to provide

the translucency effect, thereby reducing the complexity of the square root function by a factor of

over 100 times. The one division required was also replaced by a divide estimate instruction, fres

[150], increasing performance further still.

7.8.4 Interaction

The user interface must allow the user to explore the dome tree. This is achieved through a point

and click interface. Domes themselves are not clickable as the user must be able to navigate a

number of levels deep through a few domes. Instead representative images and discs are selectable.

When the user selects a representative image or disc the camera navigates towards the dome so

that the dome fills the view. The camera’s final x, y, and z position is modified based on the

position and radius of the dome:

xC = xD

yC = yD + 1.2rD

zC = zD + 1.8rD

The final orientation of the camera is set to (−π6 , 0, 0). Animation is used so that the user does

not become disoriented. Inter-dome navigation occurs within a duration of one second. Position

193

Page 216: Content-based Retrieval of Digital Video

Figure 7.7: VideoBrowser user interface.

and orientation values are linearly interpolated for each time increment:

position = start + destination× timeduration

(7.15)

7.9 VideoBrowser

The final user interface investigated was not intended as a primary browsing user interface as the

other user interfaces discussed in the preceding sections, instead the motivation for the Video-

Browser user interface was a hole that was revealed in individual video browsers in the taxonomy

of Table 7.1. The key-frame hierarchical video browser of Zhang et al. [17] was innovative in being

able to display multiple levels of hierarchy of a video sequence simultaneously. The VideoStreamer

micro-viewer [51] on the other hand was innovative in applying the fish eye technique of distortion

horizontally across one row of key frames to provide context and detail within the same level of

hierarchy. The VideoBrowser user interface aims to combine the benefits of both user interfaces to

provide a video browser that incorporates even greater context than either browsers individually.

The VideoBrowser is shown in Figure 7.7.

The user can slide the slider controls to scroll through the key frames in each level. The key

frames scroll smoothly gradually increasing in width as they approach the centre of the window.

The effect is as if the images are on a cylinder that is being rotated. Selecting a frame on one

level modifies the contents of the levels below. The VideoBrowser allows the user to quickly find

a frame by efficiently drilling down through a hierarchy of frames and by having the ability to see

194

Page 217: Content-based Retrieval of Digital Video

Table 7.3: Features of new video browsing user interfaces. (P)anorama, (M)otion, (D)istortion, 3D,

(S)hots, (H)ierarchical, (C)lustering, (V)ideo.Name P M D 3D S H C V

MountainView • • • •Goldleaf • •DomeWorld • • • • •VideoBrowser • • • •

five frames simultaneously on each level through the fish eye distortion technique.

7.10 Evaluation

Of the four user interfaces presented in the previous sections, time only permitted for the Goldleaf

user interface to be evaluated through user testing [1]. Since only a handful of other user interfaces

have been evaluated through user testing and the methods used are varied, the best approach to

compare the new user interface with the existing user interfaces is through the taxonomy of Table

7.1. The four new user interfaces were classified using the same taxonomy and the results are shown

in Table 7.3.

Table 7.3 shows that the MountainView and VideoBrowser user interfaces support four at-

tributes each. However, only the MountainView user interface satisfies all four of the user interface

requirements of Table 7.2 whereas the VideoBrowser user interface does not support clustering

or the integration of images and video. The Goldleaf user interface supports context+detail and

gisting but does not support automatic clustering and is not designed for video retrieval. Dome-

World supports five of the eight attributes which is more than any other user interface investigated.

Like MountainView it also fulfils all four user interface requirements but provides a hierarchical

organisation fulfilling both the context+detail and gisting requirements. However, at this stage

DomeWorld does not provide any shot level information such as motion indicators or panoramic

representations.

Based on the taxonomy analysis the DomeWorld user interface integrates more features than

existing systems and satisfies all four user interface requirements. The next four sections evaluate

the user interfaces developed during this research subjectively within the framework of the four

user interface requirements of context+detail, gisting, clustering, and image and video integration.

7.10.1 Context+Detail Analysis

All of the proposed user interfaces support context+detail but in different ways. The VideoBrowser

user interface supports context+detail through a hierarchical representation and magnification,

Goldleaf uses a hierarchical organisation and distortion, MountainView uses a three-dimensional

195

Page 218: Content-based Retrieval of Digital Video

user interface, and DomeWorld uses a three-dimensional user interface and hierarchical clustering.

Each of the proposed user interfaces adequately support context+detail. An important aspect

of context is for the user to know where they are. Both VideoBrowser and Goldleaf provide an

indication of the parents of the currently selected object. MountainView does not show parents as

such but relies on the user being able to see the landscape from their current location. DomeWorld

relies on the same technique plus allows the user to see parents through enveloping domes and

shadows formed on the ground.

One of the problems with using distortion to provide context+detail is that it is not natural

for users to be looking through a fisheye lens. 3D techniques which provide the same benefits as

distortion techniques are a more natural way for users to navigate structures. A 3D user interface

requires more powerful hardware for effective operation and constrained interaction methods so

that users don’t become easily disoriented. MountainView and DomeWorld solve the interaction

problem by allowing the user to navigate simply by clicking on destination objects and animating

towards them. In addition, simple rotation and movement are available through the keyboard if

necessary.

7.10.2 Gisting Analysis

Gisting is similar to context+detail but also includes other indicators such as motion. None of the

implemented solutions currently support intra-shot gisting. Instead, inter-shot gisting is providing

through clustering and representative objects. Goldleaf was not designed for video browsing and

does not inherently support any video features. The VideoBrowser user interface has the comple-

mentary weakness in that it can only show the sub-shots of one top-level video object at a time

making it difficult for the user to see what the other top-level objects are composed of. DomeWorld

and MountainView allow shots to be clustered into scenes and scenes into higher-level groupings

through the similarity-based clustering techniques employed. In DomeWorld users can easily see

child objects because of the thumbnails and translucent domes. However, further work would in-

volve the incorporation of intra-shot summaries that indicate camera and object motion through

the use of arrows or panoramas. As both user interfaces are three dimensional a 3D summary such

as VideoSpaceIcon [18] could be employed.

7.10.3 Clustering Analysis

Only DomeWorld and MountainView support automatic clustering. The agglomeration form of

clustering in DomeWorld is able to provide more distinctly visible clusters than the spring-based

form of clustering used in MountainView, therefore it is easier for a user to determine which cluster

an object is a part of. Hierarchical clustering allows for a more dense packing of objects. A non-

hierarchical form of clustering can only use spatial distance as a measure of similarity, however, in

DomeWorld the domes encase objects which infers similarity between objects encased and a distinct

196

Page 219: Content-based Retrieval of Digital Video

difference to objects that are not encased. Therefore objects may be spatially close together but

the user can easily see that they are sufficiently different due to the dome demarcating a boundary

between the objects resulting in a more dense packing of the user interface and a more efficient

use of the screen real estate. In addition, the user is able to see more objects simultaneously and

see the overall structure more clearly. The major limitation with DomeWorld’s clustering layout at

the moment is that it does not organise child nodes by similarity. Future work may involve using

a spring-based clustering technique between child nodes.

7.10.4 Video and Image Integration

The Goldleaf user interface was not designed for video retrieval and the VideoBrowser was not

designed for image retrieval however both could be extended to support both video and images.

MountainView and DomeWorld are designed to support both video and images. However, the

DomeWorld user interface is able to represent the video hierarchy more clearly because of its

inherent hierarchical nature. DomeWorld could be extended to provide more intra-shot information

such as motion and panoramas [18]. Screen shots of DomeWorld displaying all 1,530 shots extracted

from the ‘Spy Game’ movie using the Fast X-ray technique presented in the previous chapter are

shown in Figure 7.8.

7.11 Conclusion

In Chapter 2 a number of content-based video retrieval user interfaces were analysed to determine

their merits and weaknesses. It was found that most content-based image retrieval user interfaces

use the query-results technique whilst content-based video retrieval user interfaces use browsing

techniques. Due to the problems associated with the query-results approach a broader investigation

of browsing user interfaces was conducted to provide a single browsing user interface that could

satisfy the requirements of both content-based image and video retrieval systems. A taxonomy

was formed of eight attributes that were beneficial to content-based video retrieval interaction. An

analysis of 27 existing user interfaces found that techniques from non-content-based video retrieval

user interfaces could be integrated with existing video retrieval techniques. The results of the

analysis were used as a guide to design four new user interfaces that addressed the weaknesses of

existing video retrieval user interfaces in different ways.

The new user interfaces were evaluated within the same taxonomy used to evaluate the existing

user interfaces. It was found that the DomeWorld user interface was the only user interface out of

the existing and new user interfaces to provide five of the eight possible user interface taxonomy

attributes.

The four requirements of a user interface for a CBVR system were identified as context+detail,

gisting, clustering, and video and image integration. Only the MountainView and DomeWorld user

197

Page 220: Content-based Retrieval of Digital Video

(a)

(b)

Figure 7.8: DomeWorld presenting the ‘Spy Game’ movie. (a) Overview; (b) After selecting the far

right dome.

198

Page 221: Content-based Retrieval of Digital Video

interfaces satisfied all four requirements. A subjective analysis found that DomeWorld provided the

best support for all of the required attributes. However, it was identified that DomeWorld could

be improved in the areas of gisting and clustering. Future work would involve providing intra-shot

information such as motion indicators and panoramas and using weighted spring clustering within

nodes.

The proposed DomeWorld user interface provides an advancement in the area of image and

video database visualisation by integrating many of the features required for such user interfaces.

The technique of transluscent domes is innovative and allows the user to easily see the embedded

hierarchy whilst still being able to clearly see the children a number of levels deep. The technique

also allows the user to see the 2D spatial relationships between objects.

The clustering techniques used for both the MountainView and DomeWorld user interfaces are

discussed in the following chapter.

199

Page 222: Content-based Retrieval of Digital Video

200

Page 223: Content-based Retrieval of Digital Video

Chapter 8

Clustering

Clustering is the reorganisation of data. In content-based retrieval clustering is used for two different

purposes. The first is to reorganise data for more efficient access and querying through indexes.

The second is for organising data for presentation to allow for more efficient visual cognition of the

information space.

In the previous chapter on user interaction a differentiation was made between query-result and

browsing user interfaces and it was found that the browsing user interface overcomes many of the

limitations associated with query-result user interfaces especially when dealing with visual data.

Query-result user interfaces generally use clustering as an indexing technique to provide a more

efficient means of accessing and querying data. Browsing user interfaces on the other hand use

the visual layout of data as both the query and the result and hence use clustering techniques to

arrange the data spatially. The new user interfaces presented in the previous chapter are browsing

user interfaces and two of these, Mountain View and DomeWorld, support automatic clustering of

the data. The MountainView user interface requires a purely spatial clustering whereas DomeWorld

requires a hierarchical clustering technique that can also be integrated with a spatial clustering

technique.

In this chapter clustering techniques are investigated for content-based video retrieval systems

primarily to support the clustering requirements of the MountainView and DomeWorld user in-

terfaces. In addition, the benefits of clustering techniques for indexing data for query-result user

interfaces are also discussed.

8.1 Clustering

The purpose of clustering is to group similar objects together. Groups formed may either be

discrete groups where each object belongs to only one group, or they may be inferred groups where

there is no explicit group ownership of objects but rather the user must use their own powers of

201

Page 224: Content-based Retrieval of Digital Video

perception to infer the grouping. Spatial clustering is a form of inferred grouping where there are

no distinct boundaries surrounding groups but the user can infer the presence of a group by the

spatial relationship between objects. Hierarchical clustering is a form of discrete grouping where

each object will belong to all parent groups between the object and the root.

There are two approaches to perform clustering. The first is to consider the data set as a whole

and to begin to organise it into meaningful groups. The second is to add one object from the data

set at a time to the clustering space allowing the clustering space to adjust dynamically to each

object introduced. The dynamic form of clustering is often used in indexing techniques where the

contents of the data set may change regularly. Spatial clustering techniques often use the first form

of clustering, beginning with the whole data set laid out in potentially random positions in space

and iteratively adjusting the object positions until certain layout criteria have been met.

Discrete clustering techniques are based on either subdivision or agglomeration [151] and can

use the ‘one object at a time’ approach or the ‘whole data set’ approach. If subdivision is used

when objects are to be added one at a time to the clustering space, then each one will initially go

into the same cluster, once the cluster becomes too large or there is sufficient dissimilarity within

the objects the cluster is subdivided forming two or more clusters and one or more parent nodes.

This process continues until all of the objects have been added. If the clustering technique begins

with all of the objects in the clustering space then the objects will start by being in one cluster

which is iteratively subdivided until all of the clusters support the criteria of cluster size and

intra-cluster similarity. Subdivision is essentially a top-down approach whereas agglomeration is a

bottom-up approach. With agglomeration all objects begin by being in their own cluster. Clusters

are merged based on similarity between clusters. Alternatively one object can be added at a time

to the clustering space, being in its own cluster, and the object is merged into other clusters if

possible.

Since clustering is the grouping of objects by similarity, the similarity measure must be de-

termined. For some data sets the similarity between two objects is one value that can not be

decomposed any further, for other multidimensional data sets there may be a valid similarity value

for each feature axis that can be combined together to form a single similarity value. Indexing

schemes often use the component similarity value for a feature axis to determine a subdivision as

it allows the data space to be easily split linearly across one dimension [28, 29]. However, using

a composite similarity value to determine subdivisions or agglomerations can be more meaningful

for visual presentation.

The preceding paragraphs described the possible properties of clustering techniques. These

properties each represent a feature of a clustering technique and the five features, including struc-

ture, population, grouping, similarity measure, and layout are shown in Table 8.1 associated with

the possible properties for each feature. MountainView and DomeWorld require certain clustering

properties which narrows the types of clustering techniques that can be used for each user interface.

The MountainView user interface requires a clustering technique with a structure that is inferred,

uses a combined similarity measure, and a spatial layout. The DomeWorld user interface requires

202

Page 225: Content-based Retrieval of Digital Video

Table 8.1: Clustering properties.Feature Properties

Structure Discrete

Inferred

Population Dynamic - Dynamically add one object at a time

Whole - Consider initial data set as a whole

Grouping Subdivision

Agglomeration

Similarity Measure Composite - Single similarity measure

Component - Individual value for each feature axis

Layout Spatial

Hierarchical

a clustering technique with a structure that is discrete (with the possibility of also being inferred),

uses a composite similarity measure, and a hierarchical layout (although the spatial components

of the hierarchical layout are also important). The commonality between the two user interfaces

is that they both require a composite similarity measure as opposed to the similarity measure

being based on only one individual feature axis. The basis for this requirement will be discussed

further in the next section on spatial clustering. The attributes that do not directly impact the

requirements of the two user interfaces are whether the clustering approach is dynamic or whole

or whether the clustering technique forms groups through subdivision or agglomeration.

DomeWorld and MountainView both require clustering techniques for visual presentation rather

than indexing. Clustering techniques used for visual presentation must satisfy the following require-

ments:

• Similar objects must be close to each other;

• Objects must only occlude other objects of lesser importance;

• Clusters of similar objects must be easily distinguished from other clusters.

The next two sections discuss spatial and hierarchical clustering techniques in detail to deter-

mine the best clustering technique for both user interfaces.

8.2 Weighted Springs Spatial Clustering

The primary technique used for spatial clustering is weighted springs which is the technique fo-

cussed on in this section, however one technique that has not been applied to visual presentation

but remains spatial in nature is the Grid File [152]. The Grid File was designed as an indexing

203

Page 226: Content-based Retrieval of Digital Video

scheme similar to the techniques presented in the next section on hierarchical clustering. The Grid

File begins as a multidimensional hypercube with each dimension reflecting a feature axis. The

axes are subdivided non-linearly but are not indexed hierarchically. As a bucket becomes over full

the grid is split along one dimension resulting in a split in all regions that intersect the division

line. If the split causes an over segmentation of the grid then ‘buddy’ buckets can be merged. In

terms of its indexing performance Nievergelt et al. [152] did not compare the Grid File with other

indexing schemes and only performed experiments with a low number of dimensions, therefore it

is difficult to determine its usefulness for content-based retrieval systems for more than 20 dimen-

sions. However, from a visual presentation perspective the grid file provides a means of identifying

dense clusters through grid regions that contain many splits. Since visual representation optimally

occurs in two or three dimensions, the axes for the Grid File would need to be formed by reducing

the number of feature axes in the data set to two or three. Even though the Grid File may provide

an interesting alternative to conventional spatial clustering techniques its primary limitation is that

the axes must be composed of feature axes. The result is that the visual axes actually have mean-

ing. In contrast spatial clustering techniques such as weighted springs allow the similarity between

two objects to be represented in any direction which is more natural than confining meanings of

similarity to one axis.

Weighted springs clustering is based on graph drawing principles. An undirected graph consists

of a set of vertices that are connected via edges. Graph drawing generally has a number of goals

to provide an aesthetically pleasing layout, such as [153]:

1. Distribute the vertices evenly in the frame;

2. Minimise edge crossings;

3. Make edge lengths uniform;

4. Reflect inherent symmetry;

5. Conform to the frame.

None of these goals are particularly useful for content-based video retrieval and in particular the

MountainView user interface. Where the above principles require vertices to be evenly distributed,

MountainView requires some objects to overlap whilst others to be distant to show distinct clusters.

Likewise there is no requirement for edge lengths to be uniform in MountainView. Edge crossings

are not an issue as every object is connected to every other object. Symmetry may be of some

importance but only in the context of making good use of the screen real estate and would be

better stated as having a uniform distribution of clusters. Finally, conformance to the frame has

only limited usefulness in the MountainView user interface where the user will fly through the

scene in three dimensions. Therefore a new set of graph drawing goals has been identified for

CBVR clustering user interfaces:

1. Similar objects should be close together and possibly overlapping to form clusters;

204

Page 227: Content-based Retrieval of Digital Video

2. Clusters should be sufficiently far apart to be distinguishable;

3. Clusters should be uniformly distributed over the display;

4. The distance between objects should indicate similarity.

The first three goals suggest that there may not be a linear mapping between similarity and

object distance. Instead objects that are considered similar may appear closer together than their

real similarity whilst objects that are considered dissimilar may appear farther away from their

real similarity. In addition, the third goal indicates that dissimilar objects that are in different

clusters may not be as far away as their similarity indicates or could be closer to maintain uniform

distribution over the display. Therefore it is important to also include the fourth goal that distance

indicates similarity, even though the relationship may be non-linear.

Even though the goals for graph drawing are different to those of content-based video retrieval,

existing graph drawing techniques can still be adapted. Graph drawing techniques often convert the

elements of a graph into real world objects and simulate the physics of the environment to satisfy

the graph drawing goals. The most popular form of physical simulation is the weighted springs

approach [33] although other techniques such as simulated annealing [154] may also be used. Even

though graph drawing techniques attempt to simulate the real world, most techniques actually

use modified physical formulas. Fruchterman and Reingold [153] explain that the use of unrealistic

models is not an issue since they are being applied in an unrealistic space. This freedom from

the physical world can create a great deal of variation in techniques as researchers can essentially

invent their own formulas. This flexibility is well suited to our problem which has different goals

to graph drawing.

The weighted springs approach [33] to spatial clustering is by far the most widely used form of

spatial clustering, however there are many variations in how the weighted springs operate. Weighted

springs clustering is achieved by placing a spring between every pair of objects in the data set. The

characteristics of the spring are determined by the similarity between the two objects. Once all of

the springs have been placed and the objects placed in their initial positions the physical system is

simulated until the objects come to a resting point. The equilibrium is the optimal representation

of objects based on similarity that can be achieved in the number of dimensions used, which is

usually two. However, even though the resting place may be the optimal representation of objects

based on similarity, it may not be the best way to represent distinguishable clusters.

The other issue that faces weighted springs techniques is the complexity of the physical sim-

ulation. Every object is connected to every other object by a spring, therefore the computational

complexity is N2 with respect to the number of objects. In addition, the larger the number of

objects the longer it takes for the system to reach stability. An unstable system requires very fine

time increments to be used during the simulation. If fine time increments are not used then the

resulting positions of objects for the next iteration may be over estimated resulting in even less

stability in the next iteration and the process continues until the system approaches a point of

205

Page 228: Content-based Retrieval of Digital Video

complete instability. The result is that an increased number of objects requires a greater length of

time to reach stability, involves N2 computations, and finer grained simulation. All of these factors

lead to a significant amount of computing power required for the system to reach equilibrium.

Therefore the two main issues facing the weighted springs approach are:

• The time it takes for the system to reach equilibrium;

• The layout of items into distinctive clusters.

Several variations have been proposed to address these issues and are discussed in the following

sections. However, we will begin by outlining the basic weighted springs approach before discussing

the variations. Finally, a new weighted springs approach is presented that is an improvement on

existing techniques designed for the MountainView user interface.

8.2.1 Hooke’s Weighted Springs Approach

Eades [33] proposed the weighted springs approach based on Hooke’s Law, however Eades did

not use Hooke’s formulas. In this section the theoretical approach to using weighted springs is

presented based on Hooke’s Law, whilst Eades modifications are presented in the next section.

Hooke’s weighted springs approach simply places a ‘physical’ spring between every item in the data

set and simulates the physical system until an equilibrium is met. The goal of spatial clustering

is to bring similar objects close together whilst separating dissimilar objects. Therefore a spring

between two similar objects should be tight whilst a spring between dissimilar objects should be

loose. The tightness or looseness of a spring is determined by its resting length. The resting length

of the spring and the distance between the two objects determine the force that is applied to the

two objects according to Hooke’s Law.

The resting length of a spring is inversely proportional to the similarity since the resting length

of the spring increases as the similarity decreases. The attractive or repulsive force on an object

from one spring is proportional to the difference in length of the spring from its resting length:

F = −k(lc − lr)(P1 −P2)/lc (8.1)

where lc is the current length of the spring (which is the distance between the two objects), k is

the spring stiffness which is set to 1, and P1 and P2 are the two positions of the objects. The

vector from P1 to P2 is normalised by the current length of the spring to produce a normal vector

to indicate the direction of the force.

The forces applied to an object by all of its springs are summed to determine the resulting force

on the object. The acceleration of the object can then be calculated from the aggregate force and

its mass:

a = F/m (8.2)

where a is the acceleration and m is the object mass which is assigned a default value of 1.

206

Page 229: Content-based Retrieval of Digital Video

The acceleration can be used to determine how much the object’s velocity will change over a

given time interval:

∆v = a∆t (8.3)

The time interval used will determine how many iterations of simulation must be processed and

also the stability of the system. Without surface friction the system would oscillate indefinitely.

To reach an equilibrium, surface friction must be introduced. Friction is implemented by simply

multiplying the velocity (v) by a friction coefficient (f) which has been set to 0.5. The velocity is

then used to calculate the object’s new position.

A number of different factors can be used to determine when the system has reached an equi-

librium. One trigger for stopping the simulation is if all objects move by less than a predefined

distance threshold. Alternatively, the potential energy of the system can be used to determine

when the simulation should be stopped [155]. An equilibrium is reached when the potential energy

is minimised, therefore when the reduction in potential energy begins to plateau the simulation

can be stopped. The potential energy of the system, E, is proportional to the difference between

each spring’s resting and current lengths:

E =n−1∑i=1

n∑j=i+1

12kij(|Pi −Pj | − lij)2 (8.4)

An alternative to viewing the objects as a physical system of weighted springs is to use simulated

annealing [154]. Annealing is the process whereby crystals are formed out of solution through the

reduction of temperature. The process must occur slowly otherwise the crystal structures will not

form correctly. By lowering the ‘temperature’ of the system slowly it can be ensured that the

correct global minima is found. Therefore simulated annealing can be applied to Kamada and

Kawai’s [155] energy reduction technique to ensure that the correct global minima is found.

8.2.2 Logarithmic

Eades [33] made a number of modifications to the physical simulation of springs. Firstly, attrac-

tive and repulsive forces were computed separately and only attractive forces were considered for

neighbouring objects whilst repelling forces were considered for all objects. This was done to re-

duce the number of computations. Secondly, Hooke’s Law was modified to the following individual

attractive and repulsive formulas:

Fa = lr log lc (8.5)

Fr =lrl2c

(8.6)

The motivation behind these formulas is to uniformly distribute vertices across the frame. The

logarithm of lc tapers off as lc increases thereby reducing the attractive force of more distant

207

Page 230: Content-based Retrieval of Digital Video

objects. Likewise dividing lr by l2c reduces the effect of the repulsive force as lc increases. The

result is that objects can more freely arrange locally without the affect of distant objects as the

goal is uniformly distribute vertices. Fruchterman and Reingold [153] also used a similar technique

but eliminated the logarithm as it was inefficient to compute.

Fa =l2clr

(8.7)

Fr = − l2rlc

(8.8)

One advantage of tapering off forces as the distance between objects increases is that the

constraints placed on the system are reduced allowing the system to stabilise faster. The problem

with this approach is that a content-based video retrieval user interface does not require all objects

to be uniformly arranged across the frame as there wouldn’t be enough room in the frame for all

of the objects in the database. However, at the cluster level tapering off force strength may allow

clusters to be uniformly distributed across the frame.

8.2.3 Summing Attractive and Repulsive Forces Individually

One of the problems with the weighted springs approach is the computational complexity of cal-

culating forces for N2, for many iterations. Fruchterman and Reingold [153] proposed that since

the repulsive forces are small for far away objects, the repulsive forces should only be used for

neighbouring objects. Since it is simple to determine whether a spring would cause an attractive

or a repulsive force due to its relative length compared to its resting length, the reduction in re-

pulsive force calculations reduces the computational requirements of each iteration. A grid is used

to determine whether objects are within neighbouring grid cells. Both forces need to be computed

for objects lying in neighbouring grid cells but only the attractive force for objects lying outside

the neighbouring grid cells. The attractive and repulsive forces were computed using the following

equations:

Frep = −kl2rlc

(P1 −P2) (8.9)

Fattr = −kl2clr

(P1 −P2) (8.10)

The assumption of this technique is that most far away objects will effect a repulsive force,

however this really depends on the characteristics of the similarity measure. For instance, the

similarity measure may not result in similarity values that vary proportionally to an ideal visual

representation. Therefore, the similarity values need to be pre-processed so that larger similarity

values appear even farther away.

208

Page 231: Content-based Retrieval of Digital Video

8.2.4 Energy-based Placement

As discussed earlier one trigger for stopping the simulation is when the potential energy of the

system reaches a minimum. Another approach to placing objects is to minimise the energy directly

using the Newton-Raphson method [155]. The advantage of this approach is that it is able to reach

a minima more quickly than the standard force simulation. Kamada and Kawai’s [155] method is

similar to a multidimensional scaling method defined by Kruskal and Wish [156] and both have

been generalised by Cohen [157]. Simplifications to the objective function defined by Kamada and

Kawai allows exact optimisation in time that is polynomial with the number of vertices [158].

8.2.5 Inserting Dummy Vertices

Clusters are more easily identified when the objects within a cluster are close together and when

separate clusters are farther apart. One approach to shrinking the size of a cluster is to insert

dummy objects into the centre of a cluster of objects and attach tight springs to all of the objects

in the cluster to draw them closer together and farther away from other clusters [158]. The challenge

with this approach is in being able to successfully identify clusters in the clustering space.

8.2.6 MountainView Clustering

MountainView requires a clustering approach that allows dense clusters to be visible as mountain

peaks. The generation of the peaks was left to the user interface however the clustering technique

needed to produce sufficiently distinguishable clusters so that the mountain peaks did not overlap

unnecessarily. Except for the Grid File [152], which had not been applied to visual presentation,

the weighted springs approach was essentially the only method to begin with. The Hooke’s Law

weighted springs technique was implemented and applied to the image database used for the colour

and contour experiments. The result is shown in Figure 8.1.

As can be seen from Figure 8.1 the weighted springs technique successfully arranges objects by

similarity. However, it can also be seen that there are no distinguishable clusters as the objects

are arranged relatively evenly across the space. This problem is primarily due to the nature of the

similarity values between objects. For example, objects that would be considered within a group

generally have feature distances below 0.3 whilst objects considered outside the group generally

have feature distances between 0.3 and 1.0. So a dissimilar object may only be twice the distance

away from a similar object. When there are many clusters in the display it is not possible for them

to be easily distinguishable if they are at most only going to be twice the distance away from the

centre of the cluster as the members of the cluster will be.

Eades’ [33] logarithmic approach does not improve upon the basic approach in terms of pro-

ducing clusters (see Figure 8.2 (a)) however it is effective in evenly spacing objects which is a

requirement for graph drawing but not for content-based video retrieval. Fruchterman and Rein-

209

Page 232: Content-based Retrieval of Digital Video

Figure 8.1: Basic weighted springs implementation based on Hooke’s Law.

gold’s [153] approach produces slightly more visible clusters but not sufficient for content-based

video retrieval (Figure 8.2 (b)).

A solution to the problem of producing more distinguishable clusters is to apply a function

to the feature distances. A cubic function was applied to the feature distances to increase the

distance between dissimilar objects and reduce the distance between similar objects. The results

for the cubic function are shown in Figure 8.3.

The cubic function produces a better clustering than the other methods, but when zoomed in,

it is difficult to distinguish between clusters in the dense cluster at the bottom right of Figure

8.3 (b). Since it has been identified heuristically that objects that are considered similar have a

feature distance less than 0.3, a condition was added to increase the feature distance of objects

that are considered dissimilar. This was achieved by doubling the feature distance for feature

distances greater than 0.3. The results are shown in Figure 8.4. Comparing Figure 8.4 with Figure

8.3 there doesn’t appear to be much improvement in the clustering. This is because the system

is still trying to represent feature distances as accurately as possible giving equal importance to

attractive and repulsive forces. For dissimilar objects the attractive force is much less important

210

Page 233: Content-based Retrieval of Digital Video

(a)

(b)

Figure 8.2: (a) Eades’ logarithmic weighted springs implementation [33], (b) Fruchterman and

Reingold’s weighted springs implementation [153].

211

Page 234: Content-based Retrieval of Digital Video

(a)

(b)

Figure 8.3: Weighted springs with feature distance cubed. (a) Entire data set, (b) zoomed in.

212

Page 235: Content-based Retrieval of Digital Video

than the repulsive force. However, some attractive force is still required to prevent clusters from

floating away. The attractive and repulsive forces were calculated independently and the attractive

force was reduced 1000 times between dissimilar objects. The results are shown in Figure 8.5. The

clusters are much more distinguishable with this method than the other methods. Zooming in also

shows that subclusters are more easily distinguishable making this weighted springs approach the

most suitable for spatial clustering in CBVR and the MountainView user interface. A screenshot of

the clustering applied to the MountainView user interface can be seen in Figure 7.2 of the previous

chapter.

8.3 Hierarchical Clustering

The previous section on spatial clustering shows that presenting easily distinguishable clusters

is difficult when spatial techniques are used. The new method of weighted springs presented in

the previous section is able to present more easily distinguishable clusters than previous methods

however the distinction between clusters is not uniform across the space and begins to deteriorate

in subclusters. The limitations of spatial clustering techniques led us to formulate a completely

different user interface called DomeWorld which is described in the previous chapter. The goal

of DomeWorld was for the clustering scheme to provide uniform discrete clusters of images at

multiple levels. Since the structure is discrete it allows us to formulate quantitative goals. The

intention of browsing content-based retrieval user interfaces is to provide the user with a starting

point that provides an overview of where they can go in the database. Therefore the top level

of the structure should contain n substructures with each uniformly subdividing the information

space so as to optimise the user’s decision making process. The number of substructures, n, should

be small enough so that the user can make a navigational decision with not much more than a

glance at the user interface, but also large enough so that the height of the tree is minimised.

After the user decides which substructure to continue the query with the user interface zooms into

the substructure presenting it in a similar way to the top level. The user once more must make a

decision for which substructure to proceed the query with. Since the user’s decision making power

will be the same at the top level as the lower levels, all nodes in the tree should ideally have a

branching factor of n substructures. In this section we investigate hierarchical clustering techniques

that can satisfy these requirements.

The main drive behind clustering techniques in content-based retrieval systems has been to

improve query times through the use of indexes. Querying in a content-based retrieval system

almost always involves finding the most similar objects to the query parameters specified. Without

an index, every object in the data set must be evaluated to determine whether it should be in

the result set. A content-based retrieval system may contain hundreds of thousands of images. A

content-based video retrieval system or an internet image search engine may contain hundreds of

millions of images. Without an index, gigabytes of memory need to be accessed for every query.

An hierarchical index used with a content-based retrieval system provides the same benefits that

213

Page 236: Content-based Retrieval of Digital Video

(a)

(b)

Figure 8.4: Cubic weighted springs with feature distances doubled if the feature distance is greater

than 0.3. (a) Entire data set, (b) zoomed in.

214

Page 237: Content-based Retrieval of Digital Video

(a)

(b)

Figure 8.5: Weighted springs with relaxed springs for large feature distances. (a) Entire data set,

(b) zoomed in.

215

Page 238: Content-based Retrieval of Digital Video

B-trees [31] provide for conventional databases. The query system compares the nodes at each level

in the hierarchy looking for the most similar key until it reaches a leaf node. Only the items in the

leaf node and potentially the surrounding leaf nodes need to be evaluated to determine the most

similar objects. Given a suitable branching factor the amount of memory that needs to be accessed

per query is minimal.

The benefits of hierarchical indexes are clear for querying but our primary interest for this

research is in visual presentation. Even so, hierarchical indexing schemes may be useful for visual

presentation as they are designed to produce a structure that is efficient for a computer system

to locate objects, with the difference that the user’s visual processing would do the searching

rather than the computer system. In this section a number of indexing schemes are investigated for

suitability in a content-based retrieval user interface as well as general hierarchical schemes such

as agglomeration which were not designed with the specific purpose of indexing in mind.

8.3.1 Multidimensional Indexing

Indexing schemes used to improve query times in content-based retrieval systems use multidimen-

sional indexes. A multidimensional index assumes that each feature vector represents a point in

a multidimensional space. The index partitions the space hierarchically into a B-tree-like struc-

ture for optimal query execution. Multidimensional indexes assume a Euclidean space where each

element of the feature vector represents a location along one axis. The feature distance between

two objects is simply the Euclidean distance, which can be a problem for databases where the

Euclidean distance is not a suitable similarity measure. For example, the feature distance between

two histograms is often better represented by the histogram intersection rather than Euclidean

distance [21].

Multidimensional indexes are essentially B-trees [31] that have been extended to more than

one dimension. This is achieved by encapsulating groups of objects within geometrical structures

such as cuboids, spheres, or an axis partition. The first attempt at constructing a multidimensional

index was the KD-tree (k-dimensional tree) [159]. The KD-tree is a binary tree where each node in

the hierarchy represents a perpendicular split along one dimension. The KD-tree was not designed

to be dynamic and did not allow insertions or deletions like conventional B-trees. However, the

KD-tree was extended to support insertions and deletions, and further refined to support other

more optimal geometric structures. These multidimensional indexing techniques are discussed in

the following subsections.

KDB-tree

The KDB-tree [160] is based on the KD-tree with the ability of dynamically inserting and deleting

nodes using a technique similar to the B+-tree [32] hence the name k-dimensional B+-tree. When

performing k nearest neighbours using a KDB-tree, logarithmic search behaviour can be observed

216

Page 239: Content-based Retrieval of Digital Video

if the number of dimensions are small and the size of the database is large [161]. However, for a

larger number of dimensions almost every record must be examined. A nearest neighbour search

must compare every entry in the current page to find the k nearest neighbours and may also need

to search neighbouring pages. Hence, for large dimensions many neighbouring pages may need to

be searched. Since the similarity function is rotation invariant, Sproull [161] proposed that a multi-

dimensional tree be partitioned by arbitrarily oriented partition planes. Much better performance

was observed with arbitrary partition planes although more time was required to determine the

planes when constructing the tree [161].

Since the purpose of this research is to find a clustering technique suitable for browsing a video

database factors such as insertion time and retrieval time are less important. Instead, the structure

of the tree is more important. The primary limiting factor of the KDB-tree is that it is a binary

tree. Therefore, the user must make a decision between only two nodes at each level resulting in a

tall tree and many binary decisions before the target object is found.

R-tree

The R-tree [28], like the KDB-tree, uses a B+-tree mechanism for insertions and deletions, however,

it is different in its structure. The primary structural difference is that the R-tree supports more

than two children per node. Unlike the KDB-tree where a node represents one side of a partitioned

axis, a node in an R-tree is a bounded multidimensional rectangle. The original motivation behind

the R-tree was to index rectangles instead of point data for use in CAD systems [28]. Content-based

retrieval systems only need point data, however, there are advantages in having the enclosing nodes

as multidimensional rectangles such as supporting more than two children per node. A minimum

(m) and maximum (M) number of nodes can be specified for the R-tree. Keeping the height of

the structure low and balanced is beneficial for an hierarchical user interface. Guttman [28] found

that m = M/2 provided the best storage utilisation. Beckmann et al. [41] also found that the best

retrieval performance was gained when m was set to 40% of M . Computer retrieval performance

for a database index is a useful measure of the user’s performance as the user is able to analyse the

current node quickly whilst further time is required to navigate to another node which is analogous

to the caching effect that tree nodes provide for database indexes.

Nearest neighbour queries can be performed on an R-tree using a branch and bound search

algorithm [162]. Nearest neighbour queries are complicated by the elongated shape of rectangles

and their overlap.

If the content-based retrieval system is primarily static the index can be optimised as a back-

ground task. For primarily static R-trees an optimal packing algorithm can be used [163]. A packing

algorithm would be useful for browsing user interfaces where changes to the database are infre-

quent. Roussopolous and Leifker [163] developed a packing algorithm for R-trees which fills nodes

as full as possible using a recursive algorithm which chooses nearest neighbours for each node. The

packed R-tree was shown to be much more efficient than one created with Guttman’s [28] insert

217

Page 240: Content-based Retrieval of Digital Video

algorithm.

One of the more successful variants of the R-tree is the R*-tree [41]. The primary features of

the R*-tree are forced reinsertions and a different mechanism for splitting. Rather than just using

area as the minimising criteria for an optimal split, Beckmann et al. [41] also minimised the margin

and overlap values. Beckmann et al. found that the R*-tree performed better than all other R-tree

variants. The R*-tree is currently being used in the QBIC system [16] as a feature index.

SS-tree

One of the major problems with R-trees is that calculations involved in nearest neighbour queries

can be complex. White and Jain [29] have developed the SS-tree which is derived from the R-

tree and is designed to improve the performance of nearest neighbour queries. An SS-tree uses

minimum bounding spheres rather than minimum bounding rectangles for each node. An interesting

aspect of the SS-tree is that the spherical structures are similar to the encapsulating circles and

domes required for DomeWorld. Nearest neighbour searches are greatly simplified using the SS-

tree because the calculations involve simple subtractions between the centroids and radii of nodes.

Katayama and Satoh [30] have shown that the SS-tree performs much better than kDB- and R*-

trees especially on real data sets. This may indicate that DomeWorld’s use of encapsulating circles

may also allow the user to more efficiently find their target object.

The SS-tree was implemented and images from the test database were inserted into the tree

resulting in the clusters of Figure 8.6. The clusters were laid out for rendering by setting a fixed

radius for the root cluster and setting the diameter of the subclusters to 60% of the arc that the

subclusters reside in. The maximum number of nodes, M , was set to 12 and the result is a relatively

uniform clustering of the nodes. However, some images, such as the car images, have been split

across three clusters. This is due to the inability of the Euclidean distance measure to accurately

represent feature distance between histograms.

SR-tree

Katayama and Satoh [30] have shown that the performance improvement of SS-trees over R-trees

is because of the reduced diameter of spherical nodes compared to rectangular nodes. However SS-

trees occupy more volume than R-trees which will increase overlap and hence reduce performance

for nearest neighbour queries in high dimensional data sets. Katayama and Satoh [30] developed

the SR-tree which uses both minimum bounding spheres and minimum bounding rectangles. The

spheres reduce the diameter whilst the rectangles reduce the volume. The combined result is that

nodes become more disjoint and performance of nearest neighbour queries is increased. The SR-tree

performs better than both the R-tree and SS-tree when processing queries but tree updates are

more complex [30].

218

Page 241: Content-based Retrieval of Digital Video

Figure 8.6: SS Tree structure.

219

Page 242: Content-based Retrieval of Digital Video

Multidimensional Indexing Limitations

There are two limitations with the multidimensional indexing schemes presented in the previous

sections:

• Only fixed side feature vectors are supported;

• Feature vectors are assumed to exist in a Euclidean space.

The first limitation may not be too much of a concern as we have shown in the previous chapters

that fixed sized feature vectors can perform as well as variable sized feature vectors. The second

limitation however, is quite impeding to the similarity techniques that have been found to produce

the best results, namely, histogram intersection and combining individual feature similarities such

as contour and colour through multiplication. Such similarity measures do not map to a Euclidean

space and neither can the preceding indexing techniques support multi-dimensional spaces based

on intersection or multiplication. These limitations led us to more generic hierarchical clustering

techniques.

8.3.2 Agglomeration

Agglomeration clusters hierarchically through a bottom up approach. The algorithm begins with

each object in its own cluster and iteratively combines two clusters that form the smallest combined

cluster until only one cluster remains [164]. In a multi-dimensional feature space, cluster size can

be calculated as the volume of the cluster. However, in feature spaces that can’t be represented

as points in a multi-dimensional space, cluster size is calculated as the maximum feature distance

between objects. The primary limitation of the agglomeration method is that it produces a binary

tree which is unsuitable for a browsing user interface.

8.3.3 Hierarchical Divisive Methods

Hierarchical divisive methods are the logical opposites of agglomerative methods [151]. All objects

belong to one cluster initially and this cluster is iteratively subdivided. There are two approaches to

subdividing clusters: monothetic and polythetic. Monothetic techniques require that objects have

similar values on one attribute, whereas polythetic techniques allow objects to have similar values

on any attribute. Monothetic divisive techniques would create clusters which all had, for instance,

the same colour, whereas polythetic techniques would have some which have the same colour and

others which have similar shapes. One example of an hierarchical divisive method that is suitable

for content-based retrieval systems is the feature index tree.

220

Page 243: Content-based Retrieval of Digital Video

Feature Index Tree

A feature index tree constructs a binary tree where each node contains a reference feature which

represents features of the subnodes [165]. The reference features are used to navigate through the

tree to find a match. A tree is constructed by starting with the entire data set and determining

its reference feature. The collection is split in half creating two child nodes, one contains elements

most similar to the reference feature and the other contains elements least similar to the reference

feature. The process continues splitting collections until the size of a collection is 1. The problem

with this approach is that the binary tree becomes skewed and is not balanced, and therefore would

not be suitable for browsing user interfaces. A better approach would be to use one of the packing

techniques proposed by Roussopolous and Leifker [163]. Grosky and Mehrotra [165] have extended

the feature index tree so that it can have multi-way nodes suitable for secondary storage.

8.3.4 Other Methods

Other methods for clustering exist but do not provide a hierarchical organisation. Therefore such

clustering techniques must also employ a hierarchical technique to form the first pass clusters into

a hierarchy. Other non-hierarchical clustering techniques include iterative partitioning, density

search, factor analytic, clumping, graph theoretic, and k-way partition [151].

8.3.5 DomeWorld Clustering

The existing methods for hierarchical clustering all have their weaknesses. Some methods only

produce binary trees, whilst others create unbalanced trees, and others use a Euclidean feature

distance. Multidimensional indexing schemes such as the SS-tree [29] of Figure 8.6 are able to

produce balanced trees but the Euclidean distance measure causes similar objects to be scattered

throughout different clusters. Therefore, a technique is required that can use non-Euclidean feature

distances. Conventional clustering techniques such as agglomeration [164] and newer techniques

such as the feature index tree [165] support a non-Euclidean feature distance but both are binary

trees.

Since multidimensional indexing schemes require a Euclidean space they can not be used. The

binary trees of the agglomeration technique are not suitable for a browsing user interface, however

for DomeWorld we have extended the agglomeration technique to support the creation of multi-way

trees.

When the binary agglomeration method is used to produce a tree from the test database of

350 photos it results in a binary tree 19 levels high (see Table 8.2). Since a balanced binary tree

of 350 elements would only be 9 levels high (log2 350), it indicates that the agglomeration method

produces highly unbalanced trees. Agglomeration forms skewed trees because similar objects tend

to be ‘linked’. That is, the two most similar objects form the smaller cluster. The next most similar

221

Page 244: Content-based Retrieval of Digital Video

object might also join that cluster forming the parent branch. The next most similar object might

join the previously added object forming its parent branch. Therefore, a cluster of similar objects

results in a skewed tree where the most similar objects are at the bottom whilst the least similar

are towards the top. This characteristic of the agglomeration tree makes it difficult to reorganise

the binary tree into an m-way structure. Instead the agglomeration process has been modified to

support the creation of a tree with branching factor greater than two. Humans can handle larger

branching factors from between 7 to 10 branches per node. A tree with a branching factor of 7

would allow the 350 photos to be represented in just 4 levels.

The problem with the standard agglomeration method is in how clusters are grouped. In the

standard method grouping two clusters always produces a new level in the tree. This results in

a binary tree. The grouping algorithm has been modified to allow one subtree to be merged into

another subtree under some circumstances rather than create a new level in the tree. Subtree

merging allows for a greater branching factor and less levels in the tree.

The grouping algorithm has been modified to depend on the height of the subtrees being

grouped with the aim to keep the tree as low as possible. For two subtrees with heights hA and

hB the following grouping rules apply:

• If hA! = hB then the subtree with the lowest height is added as a direct child of the higher

subtree, see Figure 8.7 (a).

• If hA == hB then a new node is created and both subtrees are added as children of the new

node, which is the same as the standard agglomeration approach, see Figure 8.7 (b).

• If one subtree (A) contains only one object and the other subtree (B) has a height greater

than one then the subtree A will be merged into a subtree of B which contains the most

similar object to subtree A, see Figure 8.7 (c).

The new grouping rules result in a tree that is now only 5 levels high and has a branching

factor of 2.86 (see Table 8.2). A tree 5 levels high is substantially better than 19 levels and much

closer to the ideal 4 levels. However, a branching factor of 2.86 is still quite low and not much

better than a binary tree. A closer examination of the tree showed that there were many subtrees

containing only two children and that these were non-leaf nodes. Of the 68 subtrees produced, 22

had a height greater than 1 and only two children.

To reduce the number of subtrees containing only two children an additional pass was applied to

the tree to collapse non-leaf node subtrees that only have two children. Collapsing involves taking

the two children of the current subtree and making them direct children of the subtree’s parent.

Collapsing increases the branching factor to 3.39 which still does not appear to be very large. One

reason for this is that subtrees with a height of 1 aren’t affected by the collapsing algorithm. By

just analysing non-leaf nodes the branching factor of the tree has actually increased from 2.89 to

4.78 which indicates that the collapsing algorithm is quite successful in increasing the branching

222

Page 245: Content-based Retrieval of Digital Video

(b)

(c)

(a)

A B

21

3

1

B

3

1A

21

A B

21

21 A B

21

21

A B

12

1

B

211

Figure 8.7: DomeWorld agglomeration grouping rules.

223

Page 246: Content-based Retrieval of Digital Video

Table 8.2: Agglomeration implementation comparison of height and branching factor (BF).Technique Height BF: All Levels BF: Level > 1

Agglomeration 19 2 2

Merging 5 2.86 2.89

Collapsing 5 3.39 4.78

factor. Collapsing did not provide any reduction in the height of the 5 level tree. However, only 16

of the 350 objects (4.6%) are 5 levels deep. The remaining objects are 4 levels or lower resulting

in an overall lower tree. Therefore, for over 95% of the objects, the user only needs to make 4 or

less navigation decisions.

The resulting layout for the DomeWorld agglomeration approach is shown in Figure 8.8. The

figure also includes the representative objects and a layout algorithm which are discussed in the

previous chapter. The layout for the sample database begins with 9 primary clusters. The overall

tree structure is low enough and broad enough to allow most object thumbnails to be seen even

from the top level.

A clustering layout displaying all 1,530 shots extracted from the ‘Spy Game’ video using the

Fast X-ray shot detection method from Chapter 6 is shown in Figure 8.9. Figure 8.9 appears less

balanced than Figure 8.8, however a closer inspection of the subtrees indicates that the lack of

balance is justified. For example, the bottom left subtree contains the shots from the only black

and white scene from the first hour of the movie.

8.4 Clustering Comparison

It is difficult to compare clustering schemes for data browsing empirically. One of the goals for

the proposed image browser was to use the available screen real estate efficiently whilst providing

clearly identifiable cluster boundaries. Comparing the weighted springs layout in Figure 8.5 and

the agglomeration layout in Figure 8.8 it can be seen that the agglomeration layout is able to

make more efficient use of space because clusters are not defined by their distance but by their

membership within a circle. The discrete membership provided by agglomeration allows clusters to

be clearly identified but also allows similar objects to still be grouped together because of its hier-

archical nature. The weighted springs implementation still has some advantages in that a discrete

membership is not required and if the chosen features do not provide a discrete membership their

similarity is still evident in the resulting clustering layout. The weighted springs implementation

however does not lend itself easily to scaleable viewing where as the agglomeration scheme can

scale well because of its hierarchical nature.

224

Page 247: Content-based Retrieval of Digital Video

Figure 8.8: DomeWorld agglomeration clustering technique.

225

Page 248: Content-based Retrieval of Digital Video

8.5 Summary

In this chapter clustering techniques were investigated for suitability to provide the clustering re-

quired for the MountainView and DomeWorld user interfaces of the previous chapter. Each user

interface required a different approach to clustering, being spatial or hierarchical. Existing clus-

tering approaches were found to be limited for a number of reasons. Existing spatial clustering

techniques were not designed to provide easily distinguishable clusters, but instead were designed

to present aesthetically pleasing graphs. Existing hierarchical clustering techniques were either de-

signed for indexing purposes and therefore used an Euclidean feature space or produced unbalance

binary trees. Existing methods for both spatial and hierarchical clustering methods were extended

to support the requirements of browsing content-based retrieval user interfaces. A new weighted

springs approach was developed that provides more easily distinguishable clusters through provid-

ing a non-linear feature distance and tapering spring forces as the feature distance increases. A new

agglomeration approach was also developed that provides the balance and efficiency of indexing

schemes such as the SS-tree by extending the existing agglomeration approach that is limited by

its unbalanced binary tree.

As noted in the previous chapter, along the course of this research it was realised that the spatial

clustering of MountainView may not be sufficient for user interface browsing. The DomeWorld

hierarchical agglomeration technique produces a structure that allows the user to view 1530 objects

simultaneously whilst successfully grouping them into clusters of similar objects and retaining a

reasonably balanced tree with a large branching factor. The hierarchical grouping allows the user

to choose between a small number of subtrees at each stage and the relatively balanced tree with

a large branching factor reduces the height of the tree resulting in fewer navigational steps for the

user before reaching the target object.

226

Page 249: Content-based Retrieval of Digital Video

Figure 8.9: DomeWorld agglomeration clustering of ‘Spy Game’ shots.

227

Page 250: Content-based Retrieval of Digital Video

228

Page 251: Content-based Retrieval of Digital Video

Chapter 9

Conclusions and Future Work

The aim of this research has been to improve the state of the field of CBVR in the areas of feature

extraction, representation, and user interaction. The preceding chapters presented the results of

the progress made in these three areas of CBVR during this research. In this chapter the results

of our research are culminated, and conclusions and paths for future work are presented.

In Chapter 1 the three requirements of CBVR were presented:

1. Users must be able to communicate their query through an interface;

2. Relationships between queries and content must be understood;

3. The system must be able to automatically decompose content for requirement (2).

In addition, it was identified that existing CBVR systems exhibited limitations in their ability

to fulfil these requirements and more research was required to improve the features extracted, the

query mechanism, and the interaction of the CBVR system with the user. Each of these three

areas is equally important to the CBVR system and a limitation in only one component will affect

the capabilities of the entire system. Therefore, this research focussed on all three areas allowing

each component to benefit from the contributions made in the other components. The next three

sections present how the outcomes of each area of research contributed to our original goals.

9.1 Feature Extraction

The first stage of a CBVR system involves feature extraction. The performance of the remaining

components of the system is highly dependent on this first stage. Ideally, a CBVR system would use

the same features that the human brain uses for determining video, image, and object similarity.

The literature review of human vision research in Appendix A showed that not enough is currently

known about the function of the brain to emulate its feature extraction and matching abilities.

229

Page 252: Content-based Retrieval of Digital Video

More is known about low-level vision processing and progressively less is known as the neural

signals progress to higher-level processing stages. As a result, existing video and image processing

techniques may model human vision for low-level processing but diverge to using analytical and

computational approaches for medium and high-level processing.

Low-level processing techniques that closely model human vision include the Canny [13] and

Gabor [12] filters which exhibit similar receptive fields to simple cells in the visual cortex [92]. An

analysis of edge detectors presented in Chapter 4 led to potentially the most significant contribution

of this thesis. It was found that the orientation and positional tuning performance of existing edge

detectors exhibited noise in non-aligned scenarios irrespective of aspect ratio. An analysis of non-

aligned scenarios highlighted the lack of stimulus symmetry across the lateral axis of the edge

detector. A new filter was designed to detect stimulus asymmetry and was used to inhibit noise

in the orientation and positional tuning curve of the standard Canny filter. The result was that

the new Asymmetry edge detector was able to produce the ideal single tuning curve response for

an aligned edge, and a double response for a non-aligned edge allowing the true orientation of

non-aligned edges to be interpolated.

Even though there is no direct evidence of an asymmetry filter in the neurophysiological lit-

erature there is evidence of differently shaped and oriented receptive fields and inhibitory action

[92]. There is also evidence for positional and orientation specificity in the first stages of vision

processing with reduced positional dependence but increased shape dependence as the signals flow

along the visual pathway [10]. Therefore if there was an asymmetry detector it would occur most

probably in Layer III or IV of V1.

The asymmetry filter is simple to implement as it is based on the Canny filter, however the

improvement on contour extraction is dramatic. The precision in representing the position and

orientation of edges allows assumptions to be made in later processing stages such as thinning and

edge linking. The combination of Asymmetry edge detection, Gaussian multi-orientation thinning,

diagonal removal, and multi-orientation edge linking provide a robust contour extraction technique.

The edge linking technique is fine-tuned so that contour following stops at a sharp bend in the

contour resulting in near linear and circular contours without false links between contours that

are part of the object and contours that are part of the background. The improvement in contour

extraction allows for greater representation of the shape features of an image and therefore allows

better retrieval of images and video based on shape information, improving the ability of CBVR

systems to satisfy requirement (3). Therefore a change in the way low-level edge detection is per-

formed (which has remained largely the same for over 20 years [166]) has resulted in improvements

in the higher-level stages of processing.

230

Page 253: Content-based Retrieval of Digital Video

9.2 Representation

Colour representation has been extensively researched in the area of CBIR. Colour is an important

and reliable feature for performing image queries. Much of the research in CBIR has focussed on

the colour space and the form of distribution representation. The focus of our research was not to

make any new discoveries in the area of colour representation but instead to use it as a reliable

feature that can be integrated into a complete CBVR system. However, our initial experiments

with colour histograms showed their sensitivity to subtle changes in colour due to slightly different

colours being placed in different bins. One solution to the problem is to greatly increase the number

of bins and allow the distribution of colours to occupy many more bins increasing the chances

that corresponding bins would contain similar values, which was the approach taken by Smith

and Chang [27] with their colour sets. Storage and processing restrictions make it more desirable

to have smaller histograms than larger ones making the increased number of bins solution less

desirable. An alternative is to not only compare corresponding bins but also their neighbours. This

approach was taken in the QBIC system [16] but increases query complexity.

Our solution to the problem involved changing the way pixels were assigned to bins as opposed

to changing the structure of the histogram or the histogram comparison method. The solution

known as anti-aliased histograms involves only adding a portion of a unit to a bin depending on

how close the pixel’s colour value is to the bin’s centre colour relative to the other adjacent bins.

The result is a form of anti-aliasing which is commonly used in computer graphics to present

smooth vector graphics on low resolution displays. The anti-aliased histograms are actually formed

using fuzzy logic where the optimal fuzzy membership function appears to be a linear function.

The anti-aliased or fuzzy histograms allowed less bins to be used and produced better results than

histograms with more bins.

Even though the purpose of this research has been to provide a complete structural decompo-

sition of video for video retrieval, the benefits of fuzzy histograms can also be applied to structural

features. Contours were summarised and represented by their location, curvature, length, and orien-

tation allowing the distribution of contours to be indexed with fuzzy histograms. Fuzzy histograms

were evaluated against two brute force methods of comparing all of the individual contours between

two images. The fuzzy histograms approach performed just as well as the brute force approaches

and used considerably less storage and processing power.

When the colour and contour fuzzy histograms were combined even less bins were required

than either individual technique to provide the same results. The total number of bins required

for the combined colour and contour representation was only 28 bins. Fuzzy histograms allow for

an efficient representation of either colour, contour, or combined features by providing a smaller

representation and lower query complexity, whilst still satisfying CBVR requirement (2).

231

Page 254: Content-based Retrieval of Digital Video

9.3 User Interaction

The third major focus of this research has been on the way the user interacts with a CBVR system.

Our review of existing CBIR and CBVR user interfaces found that both had useful but disjoint

feature sets. The problem with existing CBIR user interfaces is that they focus on a query-oriented

user interface that ignores the temporal hierarchy of videos and requires the user to have special

skills in either selecting query attributes, drawing a sketch, or presenting a pre-existing image.

CBVR user interfaces on the other hand support the browsing of the temporal hierarchy of video

sequences but do not integrate the ability to view video objects by content. As a result a number of

user interfaces were designed to overcome these problems. The most successful at integrating the

features necessary for CBVR was the DomeWorld user interface that organises the video objects

hierarchically by content. DomeWorld is presented in three dimensions which provides both context

and detail in a natural environment, presents a hierarchy that can be used for either individual

images or video, and does not require the user to input query parameters, draw a sketch, or present

a pre-existing image. DomeWorld’s user interface greatly simplifies the CBVR system’s ability to

satisfy CBVR requirement (1).

9.4 Future Work

The field of CBVR is still a relatively new area of research and a great deal of work lies ahead

to produce a system that matches the abilities of the human brain and is also able to effectively

communicate with the user. There are many aspects of CBVR that still need to be addressed, how-

ever the following three remain the priorities for a usable CBVR system: structural decomposition;

spatial representation and querying; and the user interface.

9.4.1 Structural Decomposition

In this research the low-level processing of edges has been greatly improved paving the way for

more complete medium and high-level processing stages. This research has focussed on the contour

representation of images which is considered a medium-level feature. In addition to contours,

contour-ends and vertices were also extracted. It is envisioned that in higher-level processing stages

the contours and vertices would be combined to form complete regions. Such combinations of

contours and vertices would employ laws of perceptual grouping. A greater integration of colour

and texture could also be used in the formation of regions. Beyond forming a 2D decomposition of

a scene a 2.5D or full 3D decomposition of a scene can be formed. A 2.5D composition would use

features that indicate occlusion (such as T vertices) to construct a z-ordering of 2D regions. Shape

from contour and shape from texture techniques can be used to determine the 3D dimensional

surface of a region. The final and most difficult stage is to construct a complete 3D dimensional

representation of the scene. It is the most difficult stage because the information received through

232

Page 255: Content-based Retrieval of Digital Video

the camera lens is not always sufficient for a complete 3D reconstruction. Correlating information

across frames would be beneficial especially when there is camera and object movement which

allow the three dimensional characteristics of the scene to become more apparent. However, to

begin with, the problem of reliable grouping of contours and vertices into regions based on laws of

perceptual grouping is a challenging problem in itself and is required before the higher level stages

can work reliably.

9.4.2 Spatial Representation

Assuming a complete 3D representation of a scene there are still great challenges in efficiently

representing the contents of a scene for storage and retrieval. A scene contains a hierarchy of objects

with each object consisting of regions that contain colour, texture, shape, and motion information.

There are existing techniques that represent regions in an image using 2D strings [22], however,

representing an object hierarchy is much more difficult. In this research we attempted to simplify

the problem of shape representation by using fuzzy histograms to represent all of the contours in

an image however this is an oversimplified approach and does not allow comparisons of individual

groupings of contours that form objects. It is possible that with a complete decomposition into 3D

objects that a series of fuzzy histograms could be used that represent distributions of aspects of the

3D objects in the scene, however a distribution representation is not able to represent the individual

relationships between objects. Perhaps more challenging than the problem of storing an object

hierarchy is comparing the object hierarchy in the query phase. Image similarity no longer consists

of simply comparing corresponding elements of a feature vector but involves a structural comparison

between two hierarchies where the similarity of the two images may depend on the similarity of

two subhierarchies that occur in different places of the entire hierarchy in the two images. A hint to

the solution of this problem may be found in the indexing and clustering techniques investigated in

Chapter 8 where an image is represented by a hierarchical clustering formed using techniques such

as the R-tree [28]. New methods of comparison must be formed beyond the traditional Euclidean

distance, χ2 distance, and histogram intersection techniques [21], that can support hierarchical data

rather than flat data. A major problem with hierarchical comparison techniques is the increased

computational complexity required. However, such complexity may become a necessity for CBVR

systems.

9.4.3 User Interface

The user interface of CBIR systems is challenging enough if the system has access to a complete

3D decomposition of an image because there are many types of queries that the user can present

to the system, however, adding the temporal dimension of video makes the problem even more

challenging. In this research we have presented the DomeWorld user interface that allows the

user to take a browsing approach to CBIR and CBVR rather than the standard query-result

user interface employed by existing CBIR systems. However, even though the DomeWorld user

233

Page 256: Content-based Retrieval of Digital Video

interface provides a complete ordering of the image space, it does not currently support specific

queries that a user may have. For example, the user may want images that contain red shoes.

Currently DomeWorld only provides a global ordering on the entire contents of images as opposed

to an ordering based on certain subsets of features in images. Also DomeWorld does not explicitly

support queries based on the spatial relationships between objects in images, however this support

could be provided transparently by the retrieval engine. Therefore DomeWorld could be extended

to provide user customisation on the types of features that are used to order the space. Reordering

the feature space requires a significant amount of processing power and increases with the number

of images in the database.

In the temporal domain the DomeWorld user interface could represent more of the temporal

aspects of video objects such as motion, camera operations, and temporal relationships with other

objects. The temporal arrows of the MountainView user interface (Figure 7.1) would assist the

user in identifying the temporal relationship of shots if they were clustered into scenes. Since

DomeWorld is hierarchical, arrows could be drawn at different levels providing global as well as

local temporal relationships between shots and scenes. Temporal arrows would also aid the user in

their orientation if the feature space was reordered based on a feature they had selected.

9.5 Content-based Retrieval of Digital Video

Existing CBIR and CBVR systems have matured to the point of being commercially deployable

[42, 16]. In the last 10 years the computational complexity of CBIR and CBVR systems has not

increased significantly and any increased overheads are primarily due to the increased numbers of

objects that must be indexed. For example, rather than tens of thousands of objects being indexed,

today tens of millions of objects are being indexed. Therefore the increased complexity is largely

due to the increased scale of deployment rather than the increased complexity of feature extraction,

representation, or user interaction techniques. In the previous section, discussing possible future

directions for CBVR research, each component of a CBVR system faces large increases in com-

putational complexity. A complete 3D decomposition based on neurophysiological evidence and

perceptual laws is no small feat and is much more complex than the colour histogram, edge detec-

tion, and texture processing approaches used today. Representing such a 3D decomposition breaks

the existing mould of a fixed length feature vector and simple vector comparisons and requires

hierarchical storage. Comparing two 3D structures becomes even more complex especially when

the number of objects in the database is in the hundreds of thousands or millions. Presenting the

user with a representation of the feature space for browsing also has issues of complexity especially

if the user is able to change the order of the feature space. Therefore future work in the field

of CBVR will result in great increases in computational complexity exceeding the capabilities of

existing computer hardware. However, once a technique has been discovered there are two methods

to reduce the impact of computational complexity. The first is to optimise the algorithms, trading

off completeness for speed. The second is the fact that computational power appears to double

234

Page 257: Content-based Retrieval of Digital Video

every 18 months. Time itself will allow complex techniques to eventually become commercially

deployable. Additionally, since image processing is largely a parallel task many of the complex

aspects of CBVR can be executed more quickly on many simpler execution units than on one high

speed processor. Parallel execution units could be developed specifically for CBVR, although this

may not be necessary as existing CPUs and GPUs are beginning to exhibit increased parallelism

through SIMD (Single Instruction Multiple Data) units and parallel pipelines.

This research has taken steps along this path of improving the features extracted, the rep-

resentation, and the user interface, even at the cost of increased complexity to provide a more

complete experience. This research has contributed to the field of CBVR by improving low and

medium level feature extraction through Asymmetry edge detection and other techniques, pro-

viding a greater representation of colour and shape data through contour summaries and fuzzy

histograms, and providing a three dimensional unified CBIR and CBVR browsing user interface

called DomeWorld.

235

Page 258: Content-based Retrieval of Digital Video

236

Page 259: Content-based Retrieval of Digital Video

Appendix A

Human Vision

In this appendix a review of current literature on human vision processing is presented. Current

vision processing understanding can be divided into three areas:

• Inverse Optics: Based purely on how light reflects off objects and is subsequently detected.

• Psychophysics: Based on psychophysical experiments which treat humans as “black boxes”

and determine response parameters for different types of stimuli (such as finding the three

dimensions of texture).

• Neurophysiology: Based on neurophysiological experiments which determine the neural path-

ways in the brain (such as edge detection based on Gabor filters).

Inverse optics requires no knowledge of how the brain processes images. Objects can be distin-

guished by discontinuities in physical properties such as luminance, colour, and texture. However,

there are many problems with this approach. The first is that the images we are working with are

two-dimensional and there are many 3D scenes that can generate the one two-dimensional image.

How do we know which is correct? One way is to use motion information which would narrow the

number of possibilities. The second problem is that the discontinuities required to detect objects

may not be present, or only partially present, as in occluded objects. More possibilities arise for

the structure of the original scene. Finally, and most importantly, any assumptions that an inverse

optics system makes may not be the same as those made by a human. Hence the extracted struc-

ture will not represent the structure seen by a human observer even if the human’s perception is

wrong as is the case with optical illusions.

Psychophysics provides a high-level description of what humans ‘see’. Psychophysics doesn’t

necessarily provide an indication of the internal functioning of such a system but allows us to

determine what percepts may occur. An example is the Gestalt laws of perceptual grouping. They

tell us what phenomena may occur with regards to grouping based on similarity, continuity, etc. But

237

Page 260: Content-based Retrieval of Digital Video

Optic nerve

Optic chiasm

Lateral geniculate nucleus

Visual cortex

(a) (b)

Figure A.1: (a) Kaniza triangle; (b) Optic nerve pathway.

not how such a system would be implemented and whether some of these phenomena are products

of the one system rather than multiple systems. However, an approach based on psychophysics can

provide more human-like interpretations of images. For example, the occlusion problem where an

object outline is no longer fully visible can be solved using the Gestalt law of continuity which will

link broken lines into a single outline.

Psychophysics can also provide an indication of how vision functions through illusions. These

illusions generally occur because of visual functions necessary to resolve ambiguities in an image.

An example is the Kaniza triangle which is an image of three black circles on a white background.

A white triangle emerges from the also white background (Figure A.1 (a)). These illusions could

be used to verify the performance of a feature extraction system.

Neurophysiology involves determining the structure and function of neurones in the brain.

Researchers probe cat and monkey visual cortices with microelectrodes to determine the subsystems

of vision and the types of neurones therein. If the structure and function were known of all the

neurones in the brain then the job of building a visual processing system would be greatly simplified.

Unfortunately very little is known about the structure and function of neurones within the brain

beyond the primary visual cortex (which is essentially an edge detector). Some recent experiments

have been able to shed some light on how higher level processing may occur but not enough

information is available to confidently build a human vision ‘replica’. Furthermore, even if more

was known about the function of human vision, implementing it may be difficult. Currently most

computers are fast, but only serial. In contrast the individual neurone processors in the brain are

relatively slow but the immense number allows for efficient parallel operation. Such parallelism is

not seen today even in supercomputers. However, it is not difficult to believe that computers will

become increasingly parallel as applications and the market demand them. For now, a system which

can process images in a reasonable time frame may need to draw from both neurophysiological

238

Page 261: Content-based Retrieval of Digital Video

evidence and also existing feature extraction techniques.

This appendix provides a review of the neurophysiological architecture of the vision system and

high-level vision processing theories. Most data is obtained from monkeys and cats since primate

and feline visual cortices are similar to the human visual cortex. Neurophysiological evidence is

limited and even less is known about the structure and operation of the visual system as we progress

along the visual pathway. At these levels we must rely on high-level vision processing theories.

A.1 Visual System Overview

The human visual system consists of a transformation from light energy to electrical energy, de-

tecting spatial and temporal change, recognising objects, understanding the structure of objects,

and perceiving motion. For a visual understanding system the most important components are un-

derstanding the structure and motion of objects, however the perception of structure and motion

are based on a long and complex visual pathway.

The first stage of the visual system involves the detection of light by photoreceptors in the retina

of the eye (Figure A.1 (b)). These photoreceptors synapse to a number of intermediate neurones

in the retina for some initial processing. In many respects these photoreceptors are likened to the

CCD of a digital camera used in the image acquisition phase of image and video retrieval. The

neural signals flow from the photoreceptors along the optic nerve to the optic chiasm. At the optic

chiasm fibres from the left hemisphere of both eyes continue to the left side of the brain while fibres

from the right hemisphere continue to the ride side. The fibres synapse in the lateral geniculate

nucleus (LGN) before continuing on to layer IVc of the primary visual cortex (V1).

From V1 two pathways exist, the ventral pathway processes colour and texture and the dorsal

pathway processes structure and motion (Figure A.2 (a)). The diagram in Figure A.2 (b) is taken

from [75] and shows the relationships between each stage of the visual pathway. Each component

of the diagram will be explained in the following sections.

A.2 Retina – Colour and luminance reception

The eye is a device which focuses light onto the photoreceptors of the retina. The retina con-

sists of two types of photoreceptors: rods and cones. Rods detect luminance whilst cones detect

chrominance. The distribution of photoreceptors in the retina is non-uniform. In the periphery

the photoreceptors are widely spaced and consist mostly of rods. Towards the centre of the retina

the concentration of photoreceptors increases. At the point of highest concentration is the fovea

which consists only of cones (which can also detect luminance, but aren’t as sensitive as rods). The

fovea is about the size of this “o” [75]. It contains 50,000 cones of the total 5 million within the

retina. In contrast there are 120 million rods in the retina. This is quite different to the uniform

239

Page 262: Content-based Retrieval of Digital Video

P

V1 V4

V2MT

IT

P-ganglionCell

ParvoLGN

V1V2

M-ganglionCell

MagnoLGN

V1V2V3

MT

Parietal

IT

V4

Parietalpathway

Temporalpathway

Movement

Form

Colour

(a) (b)

Figure A.2: (a) Location of vision processing subsystems; (b) Visual pathway.

+B -Y +Y -B +G -R +R -G +Wh -Bl +Bl -Wh

S M L

Excitation

Inhibition

Figure A.3: Opponent colour.

layout of CCD elements in digital cameras and shows that the human vision system is designed

for interaction using techniques such as attention to efficiently comprehend a scene.

There are three types of cone receptors that respond to short (S), medium (M), and long

(L) wavelengths. The S, M, and L cone receptors correspond roughly to the red, green, and blue

channels of the RGB colour space which is often used for raw image representation. Figure A.3

shows how responses from different photoreceptors are combined by ganglion cells (see Section

A.2.2) to produce opponent colour responses. The neurones respond to blue-yellow, red-green, and

black-white. This form of colour coding is similar to the Y UV colour space which is often used for

the broadcast and compression of image and video content (see Section 3.1).

240

Page 263: Content-based Retrieval of Digital Video

+ +++

+ +

- --

-

--

-

-

-- -

--- -

+ ++

+

++

+

+

+

(a) (b)

Figure A.4: Ganglion cell receptive field. (a) On-centre off-surround, (b) Off-centre on-surround.

A.2.1 Retinal Neurones

There are both vertical and horizontal connections in the retina. The vertical connections continue

up to the optic nerve while horizontal connections allow for communication among photoreceptors

and neurones. Horizontal cells allow communication among photoreceptors and also interconnect

bipolar cells. They can also communicate with each other. Amacrine cells allow communication

between bipolar cells and between ganglion cells and are also connected with each other [167].

Information flows from the photoreceptors to bipolar cells to ganglion cells.

A.2.2 Ganglion Cells – Non-directional Edge Detectors

Ganglion cells have been shown to detect luminance discontinuities, motion [129], and spatial

frequencies [168]. Some ganglion cells act as simple edge detectors by having a centre-surround

receptive field. An example is shown in Figure A.4. These ganglion cells are either on-centre, off-

surround or off-centre, on-surround.

For an on-centre, off-surround ganglion cell an excitatory response will be generated when the

on-centre region is stimulated, or an inhibitatory response will be generated when the off-surround

is stimulated. The cell will also respond to lines and edges.

There are three types of ganglion cells which have been identified. X cells (also called P-cells

[75]) generally have small receptive fields and are important for perceiving detail [167]. Y cells

(also called M-cells [75]) have larger receptive fields and do not differentiate between colours. They

are probably used for the perception of motion [167]. Less is known about W cells which have no

centre-surround receptive fields. They exhibit a wider range of attributes than X or Y cells but

seem to respond best to moving stimuli [167]. Separate pathways for colour, detail, and motion

continue throughout the visual system.

One problem associated with photoreceptors is the time it takes for them to respond to a visual

stimulus. However, we can perceive motion quite well even with such a long latency (30-100ms).

Berry et al. [129] have found evidence that ganglion cells in salamanda and rabbit respond to a

moving bar even before it reaches the centre of the receptive field. This can be explained by the

size of a ganglion cell’s receptive field. Because the field is large it can begin firing even before the

stimulus reaches the centre of the cell.

241

Page 264: Content-based Retrieval of Digital Video

A.3 Lateral Geniculate Nucleus

Before reaching the lateral geniculate nucleus (LGN), the optic nerve from the ganglion cells must

pass through the optic chiasm where fibres from the left hemisphere of each eye continue to the

left side of the brain and fibres from the right hemisphere continue to the right side of the brain.

There is an LGN in both hemispheres of the brain and each process their respective halves of the

visual field.

The LGN acts as a regulator and is influenced by incoming signals from the retina, other

neurones in the LGN, neurones elsewhere in the thalamus, and signals from the brain stem and

cortex [75].

The LGN is layered, separating inputs from each eye and also the responses from the different

types of ganglion cells. There are 6 layers and each layer receives input from one eye only [75]. The

ipsilateral eye (the eye on the same side of the body as the LGN) sends neural signals to layers

2, 3, and 5 of the LGN. The contralateral eye (the eye on the opposite side of the body as the

LGN) sends neural signals to layers 1, 4, and 6 of the LGN [75]. The layout of the LGN across

each layer is a retinotopic map which means that neighbouring neurones in the LGN receive input

from neighbouring photoreceptors in the retina. Also the cells in each layer are lined up with those

in layers next to it.

It has been shown that signals travelling “downward” from the visual cortex outnumber those

travelling “upward” from the retina [75]. Signals from the visual cortex may be used to predict

motion and to enhance the contrast of non-accidental features. Sillito et al. [98, 130] have found

that there is a circuit from layer IV in the visual cortex to LGN in the cat. Some cells in the visual

cortex are length tuned and only respond to stimulus of specific lengths. This has been thought to

be a product of hierarchical processing in the visual cortex. However Murphy and Sillito [98] have

found X and Y cells in the LGN which are length tuned and that the length tuning comes from layer

VI of the visual cortex. Also Sillito et al. [130] found that LGN cells from a cat could predict the

motion of a stimulus through inputs from layer VI. The feedback time should be between 5-10ms,

short enough to be useful. These results show that the LGN can be influenced by responses from

layer VI possibly to improve the responses of cells stimulated by perceptually significant stimuli.

A.4 Primary Visual Cortex (VI, Area 17)

The visual cortex includes Areas 17, 18, and 19. Area 17 is also known as the primary visual cortex

or V1. Area 17 is organised in layers which are labelled with roman numerals. There are 6 layers

and LGN fibres terminate in layer IV (Figure A.5 (a)). Like the LGN, the primary visual cortex

has a retinotopic arrangement. The retinotopic arrangement supports the notion of using local

masks and edge detectors for low-level feature extraction for content-based retreival. However,

more cortical neurones are dedicated to areas including the fovea and less towards the periphery.

242

Page 265: Content-based Retrieval of Digital Video

L

R

L

R

(a) (b)

Figure A.5: (a) Primary visual cortex from Hubel and Wiesel [10]; (b) Hypercolumns.

This non-uniform arrangement is called cortical magnification.

Layer IV consists of 3 regions labelled a, b, and c. LGN fibres terminate at layer IVc, and hence

cells in layer IVc exhibit similar receptive fields to LGN and ganglion cells. Inputs from the left

and right eyes are still kept separate at this point and are kept separate for all of the layers of

Area 17. This arrangement is advantageous for feature extraction as it indicates that stereo vision

is not necessary for perception.

Hubel et al. [169] have shown that the primary visual cortex is organised in columns of occular

dominance. Where a column of cells, spanning all layers, will respond predominantly to the left

eye and another column to the right eye. These columns are adjacent to each other and are

approximately the same size.

Cells in all layers except layer IVc have elongated receptive fields and will only respond to

stimulus at a particular orientation. Hubel and Wiesel [10] have shown that V1 is also organised

into orientation columns of 10 intervals. A complete set of columns including both ocular dom-

inance columns and all of the orientation columns is called a hypercolumn [10]. All neurones in a

hypercolumn process input from roughly the same receptive field. A diagram of the hypercolumn

is shown in Figure A.5 (b). The arrangement of columns of multi-orientation detectors is similar

to the use of oriented Gabor [60] or Canny [13] operators for local edge detection.

Orientation selective cells in hypercolumns do not differentiate between colours. However, re-

searchers have also found cells contained within hypercolumns that are not orientation selective

but respond to opponent colours. These cells are different to ganglion opponent colour cells in

that they have centre-surround receptive fields (Figure A.6). Colour cells within hypercolumns

are organised separately from orientation cells into columns called blobs [75]. Not only are colour

243

Page 266: Content-based Retrieval of Digital Video

R-

G+

R+

G-

R+

G-

R-

G+

B-

Y+

B+

Y-

B+

Y-

B-

Y+

Figure A.6: Blob cell receptive fields.

+

++

++

+

+

+

++

++

+

+

-

--

--

-

-

-

--

--

-

-

-

--

--

-

-

+

++

++

+

+

+

++

++

+

+

-

--

--

-

-

+

++

++

+

+

-

--

--

-

-

(a) (b) (c) (d)

Figure A.7: Simple cells.

cells separate from orientation cells but blue-yellow cells are organised in separate columns from

red-green cells [75] (Figure A.5 (b)).

Two types of cells with oriented receptive fields have been found in the primary visual cortex:

simple and complex. These cells are described in the following sections.

A.4.1 Simple Cells – Line and Bar Detectors

Simple cells are found in layers IV and the upper half of layer VI, both of which receive input

from LGN [170, 171]. Simple cells have elongated receptive fields which detect either lines or edges

[172]. Edge detectors consist of excitatory regions on one side and inhibitory on the other (Figure

A.7a & b). Line or bar detectors consist of either an excitatory centre and inhibitory flanks (Figure

A.7c) or inhibitory centre and excitatory flanks (Figure A.7d) [92]. Simple cells respond well to

stationary stimulus but can usually be more strongly activated by moving stimulus (1 − 2 /sec).

They are sensitive to position and a slight change in position of the stimulus can cause the cell to

stop responding [10]. Receptive fields range from small (0.6 [168]) in layer IVc to large (16 ) in

layer VI, indicating different roles for different layers [170].

Simple cells have a preferred orientation and may not respond at all to a stimulus rotated 20

from the simple cell’s preferred orientation [75]. This is quite different to conventional edge detectors

which have broader orientation responses. An orientation tuning curve shows a cell’s response to

stimulus at varying orientations (Figure A.8). Even though there are cells which respond to a

stimulus at 10 intervals, the orientation tuning curve of Figure A.8 shows that each cell will still

244

Page 267: Content-based Retrieval of Digital Video

10

20

30

Impuls

es/s

ec

40˚ 20˚ 0 20˚ 40˚

Orientation

Figure A.8: Orientation tuning curve.

show a response to a stimulus 10 to the preferred orientation. It is possible that having distributed

responses between cells allows the exact angle of the stimulus to be more accurately represented.

Psychophysical experiments confirm the receptive fields of simple cells. Polat and Tyler [173]

found that elongated gratings with a height of 4 cycles and a width of only one cycle produced the

highest contrast sensitivity.

A.4.2 Complex Cells – Movement Detectors

Complex cells exhibit similar orientation tuning curves to simple cells but have larger receptive

fields and are less specific to the position of a stimulus. It has been proposed by Hubel and Wiesel

[10] that complex cells receive input from simple cells. Evidence for the connection of simple cells

to complex cells has been presented by Gilbert and Wiesel [170] where complex cells in layers 2+3

receive input from simple cells in layer 4 and complex cells throughout layer VI probably receive

input from simple cells in the upper half of layer VI. Complex cells are also found in layer V which

receive input from layer 2+3 complex cells.

Sweeping a stimulus over a complex cell receptive field (5−15 /sec) usually evokes a sustained

discharge [10]. For complex cells that respond to lines, increasing the line’s thickness to some value

far less than the width of the stimulus renders it ineffective [93]. This can be explained by simple

cell line detectors which are also sensitive to line width [168].

A.4.3 End-inhibited Cells

End-inhibited cells go by many names. Some researchers have called them end-stopped, length-

tuned, hypercomplex, and patch-suppressed. The primary feature of end-inhibited cells is that

increasing the length of a stimulus can reduce the response of a cell [93]. End-inhibition has

been found in both simple and complex cells [170]. This can be explained by an inhibitory region

245

Page 268: Content-based Retrieval of Digital Video

extending from the end of the excitatory region. End-inhibition may occur at one or both ends of

a cell, indicating detection of ends of lines and small line segments respectively (alternatively edge

detector-type cells can detect corners and tongues).

A.4.4 Spatial Frequency Tuning

Researchers have found cells in the primary visual cortex which respond to different bar widths.

The question is, Do cells respond best to bars or to spatial frequencies? Albrecht et al. [174]

determined that simple and complex cells were narrowly tuned for particular spatial frequencies

and would also respond to bars that contained these frequencies which supports the use of filters

such as the Gabor [60] and Canny [13] edge detectors.

Maffei and Fiorentini [168] analysed the spatial tuning of ganglion, LGN, simple, and complex

cells. They found that cells further along in the visual pathway became more narrowly tuned to

certain spatial frequencies. However, a broad range of responses was maintained from ganglion cells

to complex cells for use at higher levels.

A.5 V2 and V3 (Areas 18 and 19) - Line-end and Corner

Detection

Hubel and Wiesel [93] conducted pioneering research in determining the structure of areas V2 and

V3 of the cat. Outside V1 there are no simple cells. V2 consists mainly of complex cells (90%)

whereas only 42% are complex in V3 [93]. The remaining cells are hypercomplex. Complex cells in

V3 are similar to those in V2 suggesting that V3 cells receive projections from V2 complex cells.

The receptive fields of complex cells are larger than those in V1 and increase in size towards the

periphery (2 to 32 (degrees of arc)2).

The more interesting type of cells in V2 and V3 are hypercomplex cells. Hubert and Wiesel [93]

distinguished between two types of hypercomplex cells, lower order and higher order. Lower order

hypercomplex cells respond to only one direction of motion and to either a line-end or a corner.

Higher order hypercomplex cells appear to be less common than lower order hypercomplex cells

and are characterised by responses to stimulus orientations and directions of movement at 90

intervals. Such specificity can only be achieved with a vast number of parallel processors and may

not be implementable on serial computers even if they were significantly fast than what is currently

available.

Neural recordings by Hubel and Wiesel [93] indicate that the function of V2 and V3 appears

to be in detecting complex features such as corners and line-ends. However, other researchers have

found evidence that V2 is used to signal illusory contours. Both psychophysical and neurophysi-

ological data exist to support this claim [112, 175]. Neural recordings from V2 by von der Heydt

et al. [175] have found cells which respond to real edges and lines and also to illusory edges and

246

Page 269: Content-based Retrieval of Digital Video

lines. More recently Grosof et al. [111] found cells in V1 which also responded to illusory contours.

However the illusions were generated by abutting gratings which may be a simpler illusory contour,

detectable at the lower levels of visual cortex.

Soriano et al. [112] performed psychophysical experiments using abutting gratings and con-

cluded that the illusory contour must be generated within V2 and is part of the magnocellular

pathway (motion-structure pathway). Furthermore, they proposed that the end-stopped receptive

fields activated by grating lines must be about 6 long and 2 wide and the cells integrating

end-stopped cell responses must be 5 long and less than 1 wide. They also found that the grat-

ing lines either side of the illusory contour may differ in orientation by 20 suggesting a possible

interaction between end-stopped cells [112].

Shipley and Kellman [176] performed psychophysical experiments using the Kaniza square to

determine whether illusory contours are triggered by the length of the real contour or by the ratio

between real and illusory contours. They found that as the ratio between real contour and total

contour length increased so did the clarity of the illusory contour, independent of stimulus size. The

relationship between the contour length ratio and illusory contour clarity were linear indicating a

possible summation mechanism.

From these results it appears that V2 and V3 can detect complex features such as corners

and line-ends and also illusory features generated by corners and line-ends. The mechanisms that

generate illusory contours may be used by cells further along the magnocellular pathway to deter-

mine motion and structure. The illusory contours may also be used by the parvocellular layer to

determine form as V4 receives input from both V2 and V3 [75] (Figure A.2 (b)).

A.6 V4 and Inferotemporal Cortex (IT) - Shape, Colour,

and Texture Detection

As we move up the visual pathway, neurones respond to more and more complex stimuli. Beyond

V2 and V3, the pathway splits into form-colour processing and motion-structure processing. V4 and

inferotemporal cortex (IT) have been found to process shape, colour, and texture [95]. In addition

to becoming more specific to complex stimuli, neurones higher up in the visual pathway tend to

become less specific to orientation, size, and position on the retina [177, 178]. They appear to follow

a pattern similar to each processing stage of the visual pathway as cells in V4 and posterior IT

respond to specific shapes, textures, and colours whilst cells in anterior IT respond to combinations

of shape, colour, and texture and become less dependent on size and position of stimulus.

Another pattern that continues from LGN through to V4 and IT is columnar organisation.

Fujita et al. [179] have found that columns of neurones in anterior IT respond to variations in the

same type of stimulus. In addition, Tanaka et al. [6] found that columns in anterior IT consist of

cells which respond to basic shapes, colours, and textures and also cells which combine these basic

247

Page 270: Content-based Retrieval of Digital Video

features to detect complex combinations in a similar manner to higher order hypercomplex cells

combining inputs from lower order hypercomplex and complex cells in V3 [93].

Researchers believe that IT is used for object recognition [75]. However, the shapes that anterior

IT cells respond to, even though complex, are not complex enough to represent objects we readily

recognise. An explanation may be distributed processing where the combined responses of a number

of IT cells represent an object. The responses of the IT cells project up to the prefrontal cortex

to the central executive system which handles memory [8]. Top-down signals from the prefrontal

cortex can activate IT [7] and even the primary visual cortex in memory retrieval [180].

V4 and IT appear to represent the end of form-colour-texture processing in the ventral pathway

of which its main use is for object recognition and recall. In fact some neurones can be trained to

recognise views of particular objects [75]. The significance of this pathway to content-based retrieval

may be minimal as we are concerned with feature extraction rather than object recognition or recall.

However, it appears that V4 and IT process texture and colour which remain to be a challenge in

content-based retrieval. The more relevant stages of the visual pathway may be those that process

structure and motion. Even so, the necessity for understanding V4 and IT remains because the

structure and motion pathway interacts with the colour and texture pathway [5].

A.7 Medial Temporal Area (MT) - Global and Local Motion

Detection

The parietal pathway is responsible for processing structure and motion. Orientation selective cells

of V1 can only detect motion orthogonal to the orientation of the cell [181]. Further along in the

visual pathway these component responses are integrated to form pattern responses which indicate

the direction of motion for a group of V1 cells [182]. Pattern cells are found in the Medial Temporal

Area and are characterised by wide receptive fields [183], broad spatial frequency tuning, broad

temporal frequency tuning, and sensitivity to low contrasts [182].

The Medial Temporal Area provides the brain with information about local and global motion.

In addition, the Medial Temporal Area is thought to process motion. However, few experiments

have been conducted to determine how structure is processed in the parietal pathway.

A.8 High Level Vision Processing Theories

Neurophysiological recordings give us a lot of information about how the brain processes vision.

However, as we move up the visual pathway the types of stimulus that neurones respond to become

more complex and varied making it difficult to derive the “big picture” of how the vision processing

subsystems function. Various researchers have proposed theories of how these higher level systems

may function.

248

Page 271: Content-based Retrieval of Digital Video

A.8.1 Primal Sketch

Marr [56] proposed that images are processed in a number of stages from raw primal sketch to full

primal sketch to 2.5-D sketch. In reality, Marr’s primal sketch theory is a low-level to intermediate-

level processing theory because it does not explain the full three-dimensional representation of an

image. Marr’s theory to this point has been widely accepted, however, little evidence supports his

theories on the 2.5-D primal sketch. Marr proposed that at each point in an image the vision system

represents the surface’s relative depth and orientation. The 2.5-D primal sketch does not represent

boundaries between three dimensional objects. Other researcher’s claim that segmentation into

surfaces occurs before local depth and surface orientation occurs [96].

A.8.2 Recognition-By-Components

Biederman [96] proposed that recognition of complex objects occurs through identification of

generalised-cone components called geons. He proposed that there are 36 of these geons and each

can be described by readily detectable properties of edges such as curvature, collinearity, sym-

metry, parallelism, and cotermination. Using psychophysical experiments Biederman [184] showed

that surface features were not required to recognise objects and that edge features facilitated

recognition as quickly as surface features.

By presenting objects composed simply of the features of geons subjects were able to recognise

objects with only a duration of 50-100ms [96]. Reducing the number of geons in an object increased

the identification error, however, still 90% accuracy could be achieved when only four geons of

six and nine geon objects were displayed. Biederman [96] also tested the significance of vertices

and midsegments in preattentive object recognition. He found that with an exposure duration of

100ms up to 25% removal of either vertices or midsegments resulted in only 10% identification error

showing that detection of geons is robust to missing features. As the percentage of contour deletion

increased to 65% a noticeable difference in identification errors occurred between vertex deletion

and midsegment deletion. Deletion of vertices resulted in 54% identification error whereas deletion

of midsegments resulted in only 31% error. These results indicate the significance of vertices in

visual processing and recognition. As the exposure duration increased from 200ms to 750ms the

error began to decrease again to approximately 10%. Therefore contour filling in can occur within

1s but uses a more complex process than that which is used for 100ms exposures.

A.8.3 High Level Theory for Seeing and Imagining

Kosslyn [5] has presented a theory for how the brain sees and imagines images. The theory centres

around separate subsystems for encoding the shape of an object and for encoding the relative posi-

tions between objects. The idea of separate subsystems for shape and structure is not new, however,

Kosslyn expands the existing architecture. Firstly, Kosslyn proposes that IT which consists of large

receptive fields will only respond to objects if they are attended to. The “visual buffer” is proposed

249

Page 272: Content-based Retrieval of Digital Video

Categorical locationencoding

Associativememory

Visual memoryactivation

CENTRALEXECUTIVE

SYSTEM

Coordinate locationencoding

Shape encoding Search controller

Attention window/visual buffer

Categorical relationsaccess interpretation

Part realignment

Attention shiftCoordinate location

accessPosition alteration

Spatiotopic mapconstruction

Retinotopic map

Figure A.9: Kosslyn’s [5] high level theory for seeing and imagining.

to exist within V4 where the shape is encoded from the attention window passing from V4 to IT.

Secondly, Kosslyn proposes that there are two structure subsystems. One which represents categor-

ical relationships such as “top/bottom,” “side of,” and “connected to the end,” and another which

represents coordinate relationships which represents objects relative to a single position which is

useful for navigation. The subsystems proposed by Kosslyn and their interactions are detailed in

Figure A.9.

A.8.4 Features of Similarity

Most of what has been discussed so far involves the extraction of visual features from images.

Another challenging problem in the content-based retrieval of images involves either searching for

images containing similar objects to one presented or reorganising the information space based

around similarity between images (or objects therein). As is discussed in Chapter 8, existing sys-

tems represent objects as a point in a multi-dimensional feature space. An opposing view pro-

posed by Tversky [9] suggests that similarity is based on feature sets rather than feature metrics.

Similarity metrics can be derived from the intersection and differences between two feature sets.

Psychophysical experiments have confirmed Tversky’s theory.

Using feature set theory subsets can be derived which account for most of the similarity variance

250

Page 273: Content-based Retrieval of Digital Video

within the object space using an additive measure based on the intersection between feature sets.

Alternatively the differences between objects can be analysed to determine a hierarchical feature

tree. Both forms of analysis would be useful for indexing multimedia data and also organising the

information space for browsing.

A.8.5 Motion Processing Models

Chey et al. [185] have proposed a motion processing model which consists of 5 levels including

photoreceptors, ganglion cells, simple cells, complex cells, and MT cells. The model provides a

representation of local motion speeds which can be used by higher level systems for extracting 3D

form and motion. Level 1 consists of change sensitive units which respond to changes in luminance

over time simulating photoreceptors. Level 2 consists of transient cells which integrate responses

from change sensitive units over time simulating Y (or M) ganglion cells. Due to their time averaging

properties transient cell responses may overlap with neighbouring transient cells. Ganglion cells

in the retina have been found to have this property which also explains how ganglion cells can

predict motion faster than photoreceptors can respond to changes in luminance [129]. These cells

are not speed selective. Level 3 contains self-similar short-range filters with varying widths which

allow them to be sensitive to different speeds. In level 4 competition occurs across neighbouring

short-range filters to deblur the activity profiles. Finally, level 5 consists of competition across

spatial scales to provide finer speed tuning. Using their model Chey et al. [185] presented similar

results to psychophysical results.

An alternative approach to computing local optical flow is through using spatio-temporal filters.

Simoncelli and Adelson [186] showed that spatio-temporal filters could accurately detect translatory

motion and also showed their equivalence to gradient methods.

A problem with low-level and intermediate-level motion processing systems is that they can

only detect component motion. A plaid made of two gratings of differing orientations would be

detected by two different cells representing the perceived motion of each grating. The true motion

of the plaid would not be detected. An approach proposed by Kawakami and Okamoto [187]

uses the Hough Transform to simulate both component and plaid cells. Because of the aperture

problem simple cells are not directionally sensitive and hence can only indicate motion but not

the direction of motion. Directionally sensitive simple cells correlate input from non-directionally

sensitive simple cells for an accurate representation of motion. Simulation results show that the

model can accurately detect motion induced by lines, dots, random-dot textures, circles, and is

robust to noise.

Beardsley and Vaina [188] proposed a back-propagation neural network for integrating direc-

tionally sensitive neurone responses to determine planar, rotational, and radial motion. Rotational

and radial motion can be described by the angle of motion relative to the line from the centre of the

view. An angle of 90 would represent rotational motion whereas an angle of 0 would represent

radial motion. Other angles would represent expanding or contracting spiral motion. Integrating

251

Page 274: Content-based Retrieval of Digital Video

the angle of motion in the visual field allows the global motion to be determined.

A.9 Conclusions

This appendix has provided a review of the neural structure of the visual pathway based on

neurophysiological evidence. The review highlights that there are a number of parallel processing

streams within the visual pathway. Two parallel streams are the form-colour-texture stream (ventral

pathway) and structure-motion stream (parietal pathway). It appears that the ventral pathway is

used to recognise complex shapes and to activate associative memory for retrieval. A content-based

retrieval system is not concerned with recognition as it assumes no prior knowledge. Therefore the

ventral pathway could be ignored. However, two of the greatest problems in object extraction

are based around texture and illumination. It appears that both of these are handled within the

ventral pathway. Even so, colour and texture processing in the ventral pathway does not appear to

be used for discounting the illuminant or for determining surface orientation from texture. Rather

it appears to recognise objects which have particular colours and texture.

An ideal computer implementation of human vision would model the exact neural architecture

of the visual pathway. From the retina to the primary visual cortex quite a lot is known about

this architecture primarily because of its organised, repeating nature and ability to respond to

simple features. Higher up the visual pathway less is known about the roles and layout of all of

the neurones. Theories can be proposed based on recordings from these areas, however a complete

description is not possible. Hence a full computer simulation of the visual pathway neural network

is currently not possible. What is possible is to implement what is currently known which includes

most parts of the retina to primary visual cortex and some parts of V2, V3, and MT. To fill in the

missing subsystems we need to draw on high-level processing theories and computational models.

Such models fill in the gaps for subsystems where only part of the neural organisation is known,

such as V2, V3, and MT.

How three-dimensional objects are extracted and represented within the brain based on colour,

texture, shading, and motion is still unknown. Some high-level vision processing theories suggest

interactions between subsystems indicating “what” the brain must do but do not describe “how”

it is done. At this point we must draw on conventional image processing techniques discussed in

Section 2.3 to fill in the gaps.

252

Page 275: Content-based Retrieval of Digital Video

Appendix B

Texture

This appendix provides a more detailed literature review of texture identification and segmenta-

tion techniques than is provided in Chapter 2 and Chapter 4. Texture representation techniques

are explored that represent texture with three dimensions. After texture is represented it can be

segmented into texture regions. The second half of this appendix explores techniques for texture

segmentation.

B.1 Texture Representation

Psychological studies have found that texture can be described in three dimensions [39, 68]. The

three components of the two dimensional Wold decomposition also correlate with the those found

from psychological studies [36]. The Wold decomposition identifies the deterministic and inde-

terministic components of a signal. The 2D Wold decomposition further decomposes the deter-

ministic component into harmonic and evanescent components. Therefore, texture representation

techniques can be classified as identifying one or more of the harmonic, evanescent, and indeter-

ministic components. Picard and Liu [61] performed the Wold decomposition by first identifying

the deterministic component. The deterministic component is then separated into its harmonic and

evanescent components. The deterministic component is then removed from the texture and the

remaining information is used to represent the indeterministic component. The following sections

describe techniques for representing the harmonic, evanescent, and indeterministic components.

B.1.1 Harmonic Component

The harmonic component essentially describes the spatial frequency of a texture. Techniques to

identify the harmonic component usually involve transforming the texture from the time domain

into the frequency domain using techniques such as the Fourier transform and wavelets. Picard

253

Page 276: Content-based Retrieval of Digital Video

(a)

(b)

Figure B.1: The autocorrelation function of Brodatz textures D15 (a) and D68 (b).

and Liu [61] used the autocorrelation function of the image to identify the harmonic component.

The autocorrelation function is computed by the inverse Fourier transform of the image power

spectrum density function (see Figure B.1). The peaks in the autocorrelation function are used to

determine the harmonic component.

Ma and Manjunath [60] used Gabor wavelets to represent multiple spatial frequencies of tex-

tures. Wavelets are generally combined with a hierarchical decomposition to produce coefficients

representing scales to the power of 2. Figure B.2 (a) shows that at each scale two sets of coef-

ficients are produced, one that represents the high frequencies and another that represents the

low frequencies. The wavelet is recursively applied to the low frequency coefficients at each level

inherently transforming the data for larger spatial frequencies.

In two dimensions, the wavelet decomposition can be computed by performing two, one-

dimensional wavelet transforms. The layout of the resulting coefficients along with a decomposition

of a sample image is shown in Figure B.2 (b). The lowest subband represents low frequencies in

the image while the other subbands represent horizontal, vertical, and diagonal high frequencies.

Each subband contains information about the spatial frequencies in the image. This information

can be used to compare textures. For image retrieval the wavelet coefficients must be stored in a

compact feature vector. One technique used to form a compact feature vector is to store only the

mean magnitude and variance for each subband [189]. For a wavelet decomposition performed to

3 levels, only 10 elements are required in the feature vector.

Tamura [39] used the term ‘coarseness’ to describe the harmonic component of texture. To

describe coarseness Tamura used a technique proposed by Rosenfeld [105] describing texture in

terms of edgeness per unit area. Rosenfeld’s coarseness measure has been used in the QBIC system

[19].

254

Page 277: Content-based Retrieval of Digital Video

H0(m) 2

H1(m) 2

H0(n) 2

H1(n) 2

H0(n) 2

H1(n) 2

(a) (b)

Figure B.2: (a) Hierarchical decomposition, (b) Wavelet decomposition of an image.

B.1.2 Evanescent Component

The evanescent component is essentially the dominant orientations of the texture. Many techniques

used to represent the harmonic component also inherently represent the evanescent component as

is usually the case when oriented edge detectors and wavelets are used. Techniques that employ

oriented wavelets or edge detectors are able to represent the evanescent component directly in the

individual image responses for each orientation. The dominant orientation is simply the response

image with the largest magnitude. This can be seen in Figure 4.23 and also in Figure B.2. Picard and

Liu [61] used a different approach to determine the evanescent component from the autocorrelation

function. The evanescent component was estimated by fitting a line in the 2D spectral domain using

oriented bandpass filters. Oriented bandpass filters are essentially wavelets but the difference here

is that they are applied to data that is already in the spectral domain rather than the time domain.

B.1.3 Indeterministic Component

As discussed in Section 4.7.4 many different statistical models can be used to represent the indeter-

ministic component. The Gibbs random field (GRF) model has been studied extensively by Picard

[65] for analysing and synthesising textures. The GRF model has been shown to be equivalent

to the Markov random field model in certain circumstances and technically more accurate [65].

However, Picard found that the estimation was sensitive to initial conditions and was suitable for

homogeneous fields unlike natural textures. More recent work by Picard [61] has focused on the

Wold decomposition and SAR models.

Autoregressive models are used to classify textures by extracting parameters which describe the

dependency of a pixel on its neighbours. Francos et al. [36] first proposed to use an AR model to

extract the indeterministic component of a Wold decomposition after the removal of the harmonic

255

Page 278: Content-based Retrieval of Digital Video

(a)

(b)

Figure B.3: Co-occurrence matrix. Brodatz textures D15 (a) and D68 (b).

and evanescent components. The AR parameters are determined using a 2-D Levinson algorithm

[62]. The AR model was extended by Mao and Jain [64] to operate at multiple resolutions. The

multiresolution simultaneous autoregressive (MRSAR) model was able to perform significantly

better than other models using 4 resolutions and 2 variates. Picard and Liu [61] extended the

work by Francos et al. and used Mao and Jain’s MRSAR model which was computationally more

efficient but not as accurate. Francos et al. [63] further extended their original work by using a

maximum likelihood (ML) approach. In this technique they used an ARMA model to model the

indeterministic component. Due to the nature of the ML algorithm this approach has been the

most accurate to date although it is computationally expensive.

Another technique that can be used to represent the indeterministic component are co-occurrence

matrices [14]. A greyscale co-occurrence matrix is an N×N matrix where N represents the number

of grey levels in an image and element cij represents the number of times grey level i neighbours

grey level j (Figure B.3). Rather than storing the entire matrix, features of the matrix can be

extracted such as the maximum probability, moments, and entropy [14]. Even though the co-

occurence matrix is useful for representation it is primarily used to represent an entire image and

is not as useful for identifying features such as texture borders.

A technique that has also been used to model texture is fractal dimension. Using fractals to

model texture is natural as fractals in nature often exhibit complexity and consist of multiple scales

of texture. Fractal dimension is a measure which describes how the length of a contour increases

as the smallest measuring distance (ε) decreases. Naturally occurring objects are not ideal but are

semi-fractal [67], making a fractal measure more useful than simple range or variance measures [66].

There are a number of techniques that can be used to measure the fractal dimension of a textured

surface of an image such as the Hurst operator [66] and the box-counting method [67]. Both

256

Page 279: Content-based Retrieval of Digital Video

methods calculate the range between the highest and lowest pixel values within a neighbourhood.

The range is plotted against the distance on a log-log graph. The slope of the graph is used as

a measure of fractal dimension. Fractal dimension is a more accurate descriptor of texture than

range or variance. However, it is not substantially different from a range descriptor as it only gives

a measure of how the range changes over a distance and still lacks the ability to describe texture

in terms of its structure.

B.2 Texture Segmentation

Texture segmentation involves classifying each pixel in an image as belonging to one texture region.

Texture segmentation begins with identifying the texture attributes at each pixel in the image

using techniques discussed in the previous section. The pixels are then grouped into regions based

on their texture attributes and possibly relative location. The most common technique to detect

texture regions in an image is the k means clustering technique [64] and often produces accurate

segmentations when provided with good texture representation methods. The problem with this

technique however is that it requires the number of textures to be known before processing begins.

This is not possible for natural images. Lu and Chung [190] proposed a technique for finding peaks

in the texture histogram to determine the number of clusters before applying k means clustering

to provide unsupervised segmentation.

Cesmeli [116] used Gabor filters to describe the texture features and a technique called LEGION

(Locally Excitatory and Globally Inhibitory Oscillator Networks) to identify regions of homoge-

neous texture. A single neural oscillator was used for every pixel and is connected to its immediate

eight nearest neighbours with excitatory connections whilst only one global inhibitor is used for

the entire image. The weight of the excitatory connection is based on the similarity between the

texture features at each pixel. The phase of the oscillators determines which texture the pixel

belongs to.

Kruizinga and Petkov [191] attempted to model grating cells by simulating a cell which had

three parallel simple cells as input. Their technique was relatively successful in segmenting textures

but only through supervised k means clustering.

Smith and Chang [104] proposed a segmentation of texture using a quadtree decomposition

where blocks of heterogeneous texture were segmented into subblocks. The disadvantage with this

approach is that the texture regions are imprecise and exhibit blocking artefacts, however this was

not a concern for the application of the technique in spatial texture retrieval.

257

Page 280: Content-based Retrieval of Digital Video

258

Page 281: Content-based Retrieval of Digital Video

Bibliography

[1] J. Faichney and R. Gonzalez, “Goldleaf hierarchical document browser,” in Australian User

Interface Conference, January 2001.

[2] J. Faichney and R. Gonzalez, “Combined colour and contour representation using anti-aliased

histograms,” in 6th International Conference on Signal Processing, pp. 735–739, August 2002.

[3] J. Faichney and R. Gonzalez, “Asymmetry analysis for tuning orientation and position sen-

sitivity in contour and vertex detection,” in The IASTED International Conference on Com-

puter Graphics and Imaging, August 2004.

[4] Y. Gong, Intelligent Image Databases Towards Advanced Image Retrieval. Kluwer Academic

Publishers, 1998.

[5] S. M. Kosslyn, “Seeing and imagining in the cerebral hemispheres: A computational ap-

proach,” Psychological Review, vol. 94, no. 2, pp. 148–175, 1987.

[6] K. Tanaka, H. aki Saito, Y. Fukada, and M. Moriya, “Coding visual images of objects in

the inferotemporal cortex of the macaque monkey,” Journal of Neurophysiology, vol. 66,

pp. 170–189, July 1991.

[7] H. Tomita, M. Ohbayashi, K. Nakahara, I. Hasegawa, and Y. Miyashita, “Top-down signal

from prefrontal cortex in executive control of memory retrieval,” Nature, vol. 401, pp. 699–

703, 1999.

[8] M. DEsposito, J. A. Detre, D. C. Alsop, R. K. Shin, S. Alas, and M. Grossman, “The neural

basis of the central executive system of working memory,” Nature, vol. 378, pp. 279–281,

1995.

[9] A. Tversky, “Features of similarity,” Psychological Review, vol. 84, no. 4, pp. 327–352, 1977.

[10] D. H. Hubel and T. N. Wiesel, “Functional architecture of macaque monkey visual cortex,”

Proceedings of the Royal Society of London B, vol. 198, pp. 1–59, 1977.

[11] S. Grossberg, “Figure-ground separation by visual cortex,” Tech. Rep. Technical Report

CAS/CNS-TR-96-018, Boston University, 1996.

259

Page 282: Content-based Retrieval of Digital Video

[12] F. Heitger, L. Rosenthaler, R. von der Heydt, E. Peterhans, and O. Kubler, “Simulation

of neural contour mechanisms: from simple to end-stopped cells,” Vision Research, vol. 32,

no. 5, pp. 962–981, 1992.

[13] J. Canny, “A computational approach to edge detection,” IEEE Transactions on PAMI,

vol. PAMI-8, pp. 679–698, November 1986.

[14] R. M. Haralick, “Statistical and structural approaches to texture,” Proceedings of the IEEE,

vol. 67, pp. 786–804, May 1979.

[15] M. Stricker and M. Orengo, “Similarity of color images,” in Proceedings of Storage and

Retrieval for Image and Video Databases III, vol. 2420, pp. 381–392, 1995.

[16] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner,

D. Lee, D. Petkovic, D. Steele, and P. Yanker, “Query by image and video content: The

QBIC system,” IEEE Computer, pp. 23–32, September 1995.

[17] H. J. Zhang, C. Y. Low, S. W. Smoliar, and J. H. Wu, “Video parsing, retrieval and browsing:

An integrated and content-based solution,” in ACM Multimedia 95, pp. 15–24, 1995.

[18] Y. Tonomura, A. Akutsu, Y. Taniguchi, and G. Suzuki, “Structured video computing,” IEEE

Multimedia, vol. 1, pp. 34–43, Fall 1994.

[19] W. Niblack, R. Barber, W. Equitzand, M. Flickner, E. Glasman, D. Petkovic, P. Yanker,

C. Faloutsos, and G. Taubin, “The QBIC project: Querying image by content using color, tex-

ture, and shape,” in SPIE Proceedings Storage and Retrieval for Image and Video Databases,

vol. 1908, pp. 173–187, 1993.

[20] A. Pentland, R. W. Picard, and S. Scarloff, “Photobook: Tools for content-based manipu-

lation of image databases,” in Proc. SPIE Conf. Storage & Retrieval for Image and Video

Databases II, pp. 34–47, 1994.

[21] M. J. Swain and D. H. Ballard, “Color indexing,” International Journal of Computer Vision,

vol. 7, no. 1, pp. 11–32, 1991.

[22] J. R. Smith and S.-F. Chang, “Integrated spatial and feature image query,” Multimedia

System Journal, vol. 7, pp. 129–140, March 1999.

[23] Y. Tonomura and A. Akutsu, “A structured video handling technique for multimedia sys-

tems,” IEICE Transactions on Information and Systems, vol. E78-D, pp. 764–777, June

1995.

[24] A. J. Stewart and M. S. Langer, “Toward accurate recovery of shape from shading under

diffuse lighting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19,

pp. 1020–1025, September 1997.

260

Page 283: Content-based Retrieval of Digital Video

[25] S. Grossberg and E. Mingolla, “Neural dynamics of form perception: Boundary completion,

illusory figures, and neon color spreading,” Psychological Review, vol. 92, no. 2, pp. 173–211,

1985.

[26] F. Heitger and R. von der Heydt, “A computational model of neural contour processing:

Figure-ground segregation and illusory contours,” in ICCV, pp. 32–40, 1993.

[27] J. R. Smith and S.-F. Chang, “Automated image retrieval using color and texture,” Tech.

Rep. 414-95-20, Columbia University, July 1995.

[28] A. Guttman, “R-trees: A dynamic index structure for spatial searching,” in Proceedings of

ACM SIGMOD, pp. 47–57, 1984.

[29] D. A. White and R. Jain, “Similarity indexing with the SS-tree,” in Proceedings of the Twelth

International Conference on Data Engineering, pp. 516–523, 1996.

[30] N. Katayama and S. Satoh, “SR-tree: An index structure for high-dimensional nearest neigh-

bor queries,” in Proceedings of ACM SIGMOD 97, pp. 369–380, 1997.

[31] R. Bayer and E. McCreight, “Organization and maintenance of large ordered indexes,” Acta

Informatica, vol. 1, pp. 173–189, 1972.

[32] D. Comer, “The ubiquitous B-tree,” ACM Computing Surveys, vol. 11, pp. 121–137, June

1979.

[33] P. Eades, “A heuristic for graph drawing,” Congressus Numerantium, no. 42, pp. 149–160,

1984.

[34] G. W. Furnas, “Generalised fisheye views,” in Proc. ACM SIGCHI ’86 Conf. on Human

Factors in Computing Systems, pp. 16–23, 1986.

[35] G. G. Robertson, J. D. Mackinlay, and S. K. Card, “Cone trees: Animated 3d visualizations

of hierarchical information,” in ACM SIGCHI’91, pp. 189–194, April 1991.

[36] J. M. Francos, A. Z. Meiri, and B. Porat, “A unified texture model based on a 2-D wold-like

decomposition,” IEEE Transactions on Signal Processing, vol. 41, pp. 2665–2678, August

1993.

[37] J. Smith, “Integrated spatial and feature image systems: Retrieval,” 1997.

[38] R. C. Veltkamp and M. Tanase, Content-Based Image and Video Retrieval, ch. A Survey of

Content-Based Image Retrieval Systems, pp. 47–101. Kluwer, 2002.

[39] H. Tamura, S. Mori, and T. Yamawaki, “Textural features corresponding to visual percep-

tion,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 8, pp. 460–473, June 1978.

[40] J. R. Smith and S.-F. Chang, “VisualSEEk: A fully automated content-based image query

system,” in ACM Multimedia, pp. 87–98, 1996.

261

Page 284: Content-based Retrieval of Digital Video

[41] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, “The R*-tree: An efficient and

robust access method for points and rectangles,” in Proceedings of ACM SIGMOD, pp. 322–

331, May 1990.

[42] J. Bach, C. Fuller, A. Gupta, A. Hampapur, B. Gorowitz, R. Humphrey, R. Jain, and C. Shu,

“Virage image search engine: an open framework for image management,” in Proceedings of

the SPIE, Storage and Retrieval for Image and Video Databases IV, (San Jose, CA), pp. 76–

87, SPIE, February 1996.

[43] W.-Y. Ma and B. S. Manjunath, “NeTra: A toolbox for navigating large image databases,”

Multimedia Systems, no. 7, pp. 184–198, 1999.

[44] D. Ponceleon, S. Srinivasan, A. Amir, D. Petkovic, and D. Diklic, “Key to effective video

retrieval: Effective cataloging and browsing,” in ACM Multimedia ‘98, pp. 99–107, ACM,

1998.

[45] H. J. Zhang, J. Y. A. Wang, and Y. Altunbasak, “Content-based video retrieval and com-

pression: A unified solution,” in Proc. IEEE Int. Conf. Image Processing, pp. 13–16, 1997.

[46] T. Kato, “Database architecture for content-based image retrieval,” in SPIE Image Storage

and Retrieval Systems, vol. 1662, pp. 112–123, 1992.

[47] F. Arman, R. Depommier, A. Hsu, and M.-Y. Chiu, “Content-based browsing of video se-

quences,” in ACM Multimedia ’94, pp. 97–103, 1994.

[48] H. Ueda, T. Miyatake, and S. Yoshizawa, “IMPACT: An interactive natural-motion-picture

dedicated multimedia authoring system,” in ACM CHI’91, pp. 343–350, 1991.

[49] M. Mills, J. Cohen, and Y. Y. Wong, “A magnifier tool for video data,” in Proceedings of

CHI ’92, pp. 93–98, 1992.

[50] M. G. Christel, M. A. Smith, and D. B. Winkler, “Evolving video skims into useful multimedia

abstractions,” in ACM CHI’98, pp. 171–178, April 1998.

[51] E. Elliott and G. Davenport, “Video streamer,” in ACM CHI ’94 Conference Companion,

pp. 65–66, 1994.

[52] M. Irani, H. Sawhney, R. Kumar, and P. Anandan, “Interactive content-based video indexing

and browsing,” in First IEEE Workshop on Multimedia Signal Processing, 1997.

[53] Y. Tonomura and S. Abe, “Content oriented visual interface using video icons for visual

database systems,” Journal of Visual Languages and Computing, vol. 1, pp. 183–198, 1990.

[54] R. Gonzalez, “Hypermedia data modeling, coding, and semiotics,” Proceedings of the IEEE,

vol. 85, pp. 1111–1140, July 1997.

[55] H. Jiang, A. S. Helal, A. K. Elmagarmid, and A. Joshi, “Scene change detection techniques

for video database systems,” Multimedia Systems, vol. 6, pp. 186–195, 1998.

262

Page 285: Content-based Retrieval of Digital Video

[56] D. Marr, Vision: A computational investigation into the human representation and processing

of visual information. New York: W. H. Freeman, 1982.

[57] R. Kirsch, “Computer determination of the constituent structure of biological images,” Com-

put. Biomed. Res., vol. 4, pp. 315–328, 1971.

[58] W. Frei and C. C. Chen, “Fast boundary detection: A generalization and a new algorithm,”

IEEE Trans. Computers, vol. C-26, no. 10, pp. 988–998, 1977.

[59] G. S. Robinson, “Detection and coding of edges using directional masks,” Tech. Rep. 660,

University of Southern California, Image Processing Institute, 1976.

[60] W. Y. Ma and B. S. Manjunath, “Texture features and learning similarity,” in Proceedings

of IEEE Int. Conf. on Computer Vision and Pattern Recognition, pp. 425–430, June 1996.

[61] R. W. Picard and F. Liu, “A new Wold ordering for image similarity,” in Proceedings of IEEE

International Conference on Acoustics, Speech and Signal Processing, pp. 129–132, 1994.

[62] T. L. Marzetta, “Two-dimensional linear prediction: Autocorrelation array, minimum-phase

prediction error filters, and reflection coefficient arrays,” IEEE Transactions on Acousitcs,

Speech, and Signal Processing, vol. ASSP-28, pp. 725–733, December 1980.

[63] J. M. Francos, A. Narasimhan, and J. W. Woods, “Maximum likelihood parameter estima-

tion of textures using a wold-decomposition based model,” IEEE Transactions on Image

Processing, vol. 4, pp. 1655–1666, December 1995.

[64] J. Mao and A. K. Jain, “Texture classification and segmentation using multiresolution si-

multatneous autoregressive models,” Pattern Recognition, vol. 25, no. 2, pp. 173–188, 1992.

[65] R. W. Picard, “Structure patterns from random fields,” in Proceedings of the 26th Annual

Asimolar Conference on Signals, Systems, and Computers, pp. 1011–1015, October 1992.

[66] J. C. Russ, “Processing images with a local hurst operator to reveal textural differences,”

Journal of Computer-Assisted Microscopy, vol. 2, no. 4, pp. 249–257, 1990.

[67] N. Sarkar and B. B. Chaudhuri, “An efficient differential box-counting approach to compute

fractal dimension of image,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 24,

pp. 115–120, January 1994.

[68] A. R. Rao and G. L. Lohse, “Towards a texture naming system: Identifying relevant dimen-

sions of texture,” in IEEE Proceedings of Visualization ’93, pp. 220–228, October 1993.

[69] R. C. Gonzalez and R. E. Woods, Digital Image Processing. Addison-Wesley, 1992.

[70] K.-S. Cheng, J.-S. Lin, and C.-W. Mao, “The application of competitive hopfield neural

network to medical image segmentation,” IEEE Transactions on Medical Imaging, vol. 15,

pp. 560–567, August 1996.

263

Page 286: Content-based Retrieval of Digital Video

[71] X. Q. Li, Z. W. Zhao, H. D. Cheng, C. M. Huang, and R. W. Harris, “A fuzzy logic approach

to image segmentation,” in IEEE Proceedings of Conference on Image Processing, pp. 337–

341, October 1994.

[72] H. Atmaca, M. Bulut, and D. Demir, “Histogram based fuzzy Kohonen clustering network

for image segmentation,” in IEEE Proceedings of the International Conference on Image

Processing, pp. 951–954, September 1996.

[73] B. Bhanu, S. Lee, and S. Das, “Adaptive image segmentation using genetic and hybrid search

methods,” IEEE Transactions on Aerospace and Electronic Systems, vol. 31, no. 4, 1995.

[74] A. Wardhani and R. Gonalez, “Perceptual grouping of natural images for CBIR,” in IEEE

International Symposium on Signal Processing and its Applications, vol. 2, pp. ”923–926”,

1999.

[75] E. B. Goldstein, Sensation & Perception. Brooks/Cole Publishing Company, 1999.

[76] C. Fuchas and W. Forstner, “Polymorphic grouping for image segmentation,” in Proceedings

of the 5th International IEEE Conference on Computer Vision, pp. 175–182, 1995.

[77] K. Rao, “A computer vision system to detect 3-D rectangular solids,” in IEEE Workshop on

Applications of Computer Vision, pp. 27–32, 1996.

[78] A. P. Witkin, “Recovering surface shape and orientation from texture,” Artificial Intelligence,

vol. 17, pp. 17–45, 1981.

[79] J. Y. Jau and R. T. Chin, “Shape from texture using the Wigner distribution,” Computer

vision, graphics, and image processing, vol. 52, pp. 248–263, 1990.

[80] M. Brady and A. Yuille, “An extremum principle for shape from contour,” IEEE Transactions

on Pattern Analysis and Machine Intelligence, vol. PAMI-6, pp. 288–301, May 1984.

[81] L. S. Davis, L. Janos, and S. M. Dunn, “Efficient recovery of shape from texture,” IEEE

Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-5, pp. 485–492,

September 1983.

[82] R. Bajcsy and L. Lieberman, “Texture gradient as a depth cue,” Computer Graphics and

Image Processing, vol. 5, pp. 52–67, 1976.

[83] T. A. C. M. Claasen and W. F. G. Mecklenbrauker, “The Wigner distribution – a tool for

time-frequency signal analysis,” Phillips Journal of Research, vol. 35, pp. 217–250, 1980.

[84] K. Arbter, W. E. Snyder, H. Burkhardt, and G. Hirzinger, “Application of affine-invariant

fourier descriptors to recognition of 3-D objects,” IEEE Transactions on Pattern Analysis

and Machine Intelligence, vol. 12, pp. 640–647, July 1990.

[85] S. Scarloff and A. P. Pentland, “Modal matching for correspondance and recognition,” IEEE

Transactions on Pattern Analysis and Machine Intelligence, vol. 17, pp. 545–561, June 1995.

264

Page 287: Content-based Retrieval of Digital Video

[86] S.-K. Chang, Q.-Y. Shi, and C.-W. Yan, “Iconic indexing by 2-D strings,” IEEE Transactions

on Pattern Analysis and Machine Intelligence, vol. PAMI-9, pp. 413–428, May 1987.

[87] R. Z. Liang, S. Venkatesh, and D. Kieronska, “Video indexing by spatial representation,” in

Proceedings of the Third Australian and New Zealand Conference on Intelligent Information

Systems, pp. 99–104, November 1995.

[88] V. N. Gudivada and V. V. Raghavan, “Design and evaluation of algorithms for image retrieval

by spatial similarity,” ACM Transactions on Information Systems, vol. 13, pp. 115–144, April

1995.

[89] E. A. El-kwae and M. R. Kabuka, “A robust framework for content-based retrieval by spatial

similarity in image databases,” ACM Transactions on Information Systems, vol. 17, pp. 174–

198, April 1999.

[90] C.-S. Li, J. R. Smith, L. D. Bergman, and V. Castelli, “Sequential processing for content-

based retrieval of composite objects,” in Proc. SPIE Conf. Storage & Retrieval for Image

and Video Databases VI, January 1998.

[91] J. M. Martınez, ed., MPEG-7 Overview (version 9). ISO/IEC JTC1/SC29/WG11 N5525,

March 2003.

[92] D. H. Hubel and T. N. Wiesel, “Receptive fields of single neurones in the cats striate cortex,”

Journal of Physiology, vol. 148, pp. 574–591, 1959.

[93] D. H. Hubel and T. N. Wiesel, “Receptive fields and functional architecture in two nonstriate

visual areas (18 and 19) of the cat,” Journal of Neurophysiology, vol. 28, pp. 229–289, 1965.

[94] S. Grossberg and E. Mingolla, “Neural dynamics of perceptual grouping: Textures, bound-

aries, and emergent segmentations,” Perception and Psychophysics, vol. 38, no. 2, pp. 141–

171, 1985.

[95] E. Kobatake and K. Tanaka, “Neuronal selectivities to complex object features in the ventral

visual pathway of the macaque cerebral cortex,” Journal of Neurophysiology, vol. 71, pp. 856–

867, March 1994.

[96] I. Biederman, “Recognition-by-components: A theory of human image understanding,” Psy-

chological Review, vol. 94, no. 2, pp. 115–147, 1987.

[97] A. Gove, S. Grossberg, and E. Mingolla, “Brightness perception, illusory contours, and cor-

ticogeniculate feedback,” Visual Neuroscience, vol. 12, pp. 1027–1052, 1995.

[98] P. Murphy and A. M. Sillito, “Corticofugal feedback influences the generation of length

tuning in the visual pathway,” Nature, vol. 329, pp. 727–729, 1987.

[99] E. M. Stephen Grossberg and J. Williamson, “Synthetic aperture radar processing by a

multiple scale neural system for boundary and surface representation,” Neural Networks,

vol. 8, pp. 1005–1028, 1995.

265

Page 288: Content-based Retrieval of Digital Video

[100] D. Walters, “Selection of image primitives for general-purpose visual processing,” Computer

Vision, Graphics, and Image Processing, vol. 37, pp. 261–298, 1987.

[101] M. Miyahara and Y. Yoshida, “Mathematical transform of (R,G,B) color data to Mun-

sell (H,V,C) color data,” in SPIE Visual Communications and Image Processing, vol. 1001,

pp. 650–657, 1988.

[102] J. Taylor, G. Murch, and P. McManus, “TekHVC: A uniform perceptual color system for

display users,” in Proceedings of the SID (Soc. for Info. Display), 1989.

[103] “http://www.onthenet.com.au/˜jolon/photodatabase.zip.”

[104] J. R. Smith and S.-F. Chang, “Quad-tree segmentation for texture-based image query,” in

Multimedia 94, pp. 279–286, 1994.

[105] A. Rosenfeld and E. B. Troy, “Visual texture analysis,” tech. rep., Computer Science Center,

University of Maryland, June 1970.

[106] G. Avrahami and V. Pratt, “Sub-pixel edge detection in character digitization,” in Raster

Imaging and Digital Typography II–Papers from the second RIDT meeting, pp. 54–64, 1991.

[107] T. P. Weldon, W. E. Higgins, and D. F. Dunn, “Gabor filter design for multiple texture

segmentation,” Optical Engineering SPIE, vol. 35, pp. 2852–2863, October 1996.

[108] P. Brodatz, Textures: A Photographic Album for Artists and Designers. New York: Dover,

1966.

[109] A. M. Sillito, K. L. Grieve, H. E. Jones, J. Cudeiro, and J. Davis, “Visual cortical mechanisms

detecting focal orientation discontinuities,” Nature, vol. 378, pp. 492–496, 1995.

[110] M. E. Larkum, J. J. Zhu, and B. Sakmann, “A new cellular mechanism for coupling inputs

arriving at different cortical layers,” Nature, vol. 398, pp. 338–341, 1999.

[111] D. H. Grosof, R. M. Shapley, and M. J. Hawken, “Macaque V1 neurons can signal ‘illusory’

contours,” Nature, vol. 365, pp. 550–552, 1993.

[112] M. Soriano, L. Spillman, and M. Bach, “The abutting grating illusion,” Vision Research,

vol. 36, pp. 109–116, 1996.

[113] N. A. Stillings, Cognitive Science An Introduction. MIT Press, 1995.

[114] J. Beck, A. Sutter, and R. Ivry, “Spatial frequency channels and perceptual grouping in

texture segmentation,” Computer Vision, Graphics, and Image Processing, vol. 37, pp. 299–

325, 1987.

[115] V. A. F. Lamme and H. Spekreijse, “Neuronal synchrony does not represent texture segre-

gation,” Nature, pp. 362–366, 1998.

266

Page 289: Content-based Retrieval of Digital Video

[116] E. Cesmeli, “Texture segmentation using gabor filters and LEGION,” in Online Proceedings

of the 1996 Midwest Artificial Intelligence and Cognitive Science Conference, 1996.

[117] D. Huttenlocher, D. Klanderman, and A. Rucklige, “Comparing images using the Hausdorff

distance,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, pp. 850–

863, September 1993.

[118] P. V. C. Hough, “A Method and Means for Recognizing Complex Patterns.” US Patent:

3,069,654, December 1962.

[119] R. O. Duda and P. E. Hart, “Use of the Hough Transformation to Detect Lines and Curves

in Pictures,” Communication of the Association for Computing Machinery, vol. 15, no. 1,

pp. 11–15, 1972.

[120] J. Illingworth and J. Kittler, “A Survey of the Hough Transform,” Computer Vision, Graph-

ics, and Image Processing, vol. 44, pp. 87–116, 1988.

[121] T. Meier and K. N. Ngan, “Automatic video sequence segmentation using object tracking,”

in 1997 IEEE TENCON - Speech and Image Technologies for Computing and Telecommuni-

cations, pp. 283–286, IEEE, 1997.

[122] A. Nagasaka and Y. Tanaka, “Automatic video indexing and full-video search for object

appearances,” in Second Working Conference on Visual Database Systems, pp. 119–133,

IFIP WG, October 1991.

[123] H. Zhang, A. Kankanhalli, and S. W. Smoliar, “Automatic partitioning of full-motion video,”

Multimedia Systems, vol. 1, pp. 10–28, 1993.

[124] B. Furht, S. W. Smoliar, and H. Zhang, Video and Image Processing in Multimedia Systems.

Kluwer, 1995.

[125] D. L. Gall, “Mpeg: A video compression standard for multimedia applications,” Communi-

cations of the ACM, vol. 34, pp. 46–58, April 1991.

[126] B.-L. Yeo and B. Liu, “On the extraction of DC sequences from MPEG compressed video,”

in The International Conference on Image Processing, vol. 2, pp. 260–263, 1995.

[127] J. M. Corridoni and A. D. Bimbo, “Structured digital video indexing,” in Proceedings of

IEEE International Conference on Pattern Recognition, pp. 125–129, 1996.

[128] M. Yeung and B.-L. Yeo, “Segmentation of video by clustering and graph analysis,” Computer

Vision and Image Understanding, vol. 71, pp. 94–109, July 1998.

[129] M. J. Berry, I. H. Brivanlou, T. A. Jordan, and M. Meister, “Anticipation of moving stimuli

by the retina,” Nature, vol. 398, pp. 334–338, March 1999.

267

Page 290: Content-based Retrieval of Digital Video

[130] A. M. Sillito, H. E. Jones, G. L. Gerstein, and D. C. West, “Feature-linked synchronization

of thalamic relay cell firing induced by feedback from the visual cortex,” Nature, vol. 369,

pp. 479–482, 1994.

[131] B. B. Bederson and J. D. Hollan, “Pad++: A zooming graphical interface for exploring

alternate interface physics,” in ACM UIST ’94, pp. 17–26, 1994.

[132] A. Woodruff, J. Landay, and M. Stonebraker, “Goal-directed zoom,” in ACM CHI’98,

pp. 305–306, April 1998.

[133] H. Lieberman, “Powers of ten thousand: Navigating in large information spaces,” in ACM

UIST’94, pp. 15–16, November 1994.

[134] M. Sarkar and M. H. Brown, “Graphical fisheye views of graphs,” in ACM CHI’92, pp. 83–91,

May 1992.

[135] J. D. Mackinlay, G. G. Robertson, and S. K. Card, “The perspective wall: Detail and context

smoothly integrated,” in Proceedings of CHI ’91 Human Factors in Computing Systems,

pp. 173–179, 1991.

[136] G. G. Robertson and J. D. Mackinlay, “The document lens,” in ACM UIST’93, pp. 101–108,

November 1993.

[137] Y. K. Leung and M. D. Apperley, “A review and taxonomy of distortion-oriented presentation

techniques,” ACM Transactions on Computer-Human Interaction, vol. 1, pp. 126–160, June

1994.

[138] R. Rao and S. K. Card, “The table lens: Merging graphical and symbolic representations in

an interactive focus+context visualization for tabular information,” in Proceedings of ACM

CHI’94, pp. 318–482, 1994.

[139] J. Lamping, R. Rao, and P. Pirolli, “A focus+context technique based on hypebolic geometry

for visualizing large hierarchies,” in ACM CHI’95, pp. 401–408, 1995.

[140] V. Hovestadt, O. Gramberg, and O. Deussen, “Hyperbolic user interfaces for computer aided

architectural design,” in ACM CHI’95 Conference Companion, pp. 304–305, May 1995.

[141] A. Taivalsaari, “The event horizon user interface model for small devices,” Tech. Rep. SMLI

TR-99-74, Sun Microsystems, March 1999.

[142] S. K. Card, G. G. Robertson, and J. D. Mackinlay, “The information visualizer, an informa-

tion workspace,” in ACM SIGCHI’91, pp. 181–188, April 1991.

[143] P. Lucas and L. Schneider, “Workscape: A scriptable document management environemnt,”

in ACM CHI’94 Conference Companion, pp. 9–10, April 1994.

[144] S. K. Card, G. G. Robertson, and W. York, “The WebBook and the WebForager: An Infor-

mation Workspace for the World-Wide Web,” in ACM CHI’96, pp. 111–117, 1996.

268

Page 291: Content-based Retrieval of Digital Video

[145] G. Robertson, M. Czerwinski, K. Larson, D. C. Robbins, D. Thiel, and M. van Dantzich,

“Data mountain: Using spatial memory for document management,” in ACM UIST’98,

pp. 153–162, 1998.

[146] I. Greenberg, “Facing up to new interfaces,” IEEE Computer, pp. 14–16, April 1999.

[147] G. Robertson, M. van Dantzich, and D. C. Robbins, “Task gallery.”

http://www.research.microsoft.com/research/ui/TaskGallery.

[148] M. Zizi and M. Beaudouin-Lafon, “Hypermedia exploration with interactive dynamic maps,”

International Journal of Human-Computer Studies, vol. 37, pp. 441–464, 1995.

[149] C. Chen and M. Czerwinski, “From latent semantics to spatial hypertext – an integrated

approach,” in The Proceedings of the 9th ACM Conference on Hypertext and Hypermedia,

pp. 77–86, 1998.

[150] Motorola, PowerPC Microprocessor Family: The Programming Environments For 32-bit Mi-

croprocessors. Motorola, 1997.

[151] M. S. Aldenderfer and R. K. Blashfield, Cluster Analysis. Newbury Park, California: Sage

Publications, Inc., 1984.

[152] J. Nievergelt, H. Hinterberger, and K. C. Sevcik, “The grid file: An adaptable, symmetric

multikey file search,” ACM Transactions on Database Systems, vol. 9, pp. 38–71, March 1984.

[153] T. M. J. Fruchterman and E. M. Reingold, “Graph-drawing by force directed placement,”

Software – Practice and Experience, vol. 21, no. 11, pp. 1129–1164, 1991.

[154] R. Davidson and D. Harel, “Drawing graphs nicely using simulated annealing,” ACM Trans-

actions on Graphics, vol. 15, October 1996.

[155] T. Kamada and S. Kawai, “An algorithm for drawing general undirected graphs,” Information

Processing Letters, vol. 31, pp. 7–15, 1989.

[156] J. B. Kruskal and M. Wish, “Multidimensional scaling,” Tech. Rep. Sage University Paper

Series on Quantitative Applications in Social Sciences 07-011, Sage University, 1978.

[157] J. D. Cohen, “Drawing graphs to convey proximity: An incremental arragement method,”

ACM Transactions on Computer-Human Interaction, vol. 4, no. 3, pp. 197–229, 1997.

[158] M. Kaufmann and D. Wagner, Drawing Graphs: Methods and Models. New York: Springer-

Verlag, 2001.

[159] J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Com-

munications of the ACM, vol. 18, no. 9, pp. 509–517, 1975.

[160] J. T. Robinson, “The K-D-B-tree: A search structure for large multidimensional dynamic

indexes,” in Proceedings of ACM SIGMOD 1981, pp. 10–18, 1981.

269

Page 292: Content-based Retrieval of Digital Video

[161] R. F. Sproull, “Refinements to nearest-neighbor searching in k-dimensional trees,” Algorith-

mica, vol. 6, pp. 579–589, 1991.

[162] N. Roussopoulos, S. Kelley, and F. Vincent, “Nearest neighbor queries,” in Proceedings of

ACM SIGMOD 95, pp. 71–79, 1995.

[163] N. Roussopoulos and D. Leifker, “Direct spatial search on pictorial databases using packed

R-trees,” in Proceedings of ACM SIGMOD 1985, pp. 17–31, 1985.

[164] A. Girgensohn, J. Boreczky, and L. Wilcox, “Keyframe-based user interfaces for digital

video,” IEEE Computer, pp. 61–67, September 2001.

[165] W. I. Grosky and R. Mehrotra, “Index-based object recognition in pictorial data manage-

ment,” Computer Vision, Graphics, and Image Processing, vol. 52, pp. 416–436, 1990.

[166] I. D. G. Macleod, Picture Language Machines, ch. On finding structure in pictures, p. 231.

Academic, 1970.

[167] M. W. Matlin and H. J. Foley, Sensation and Perception. Needham Heights, MA: Simon &

Shuster, Inc, 1991.

[168] L. Maffei and A. Fiorentini, “The visual cortex as a spatial frequency analyser,” Vision

Research, vol. 13, pp. 1255–1267, 1973.

[169] D. H. Hubel, T. N. Wiesel, and M. P. Stryker, “Anatomical demonstration of orientation

columns in macaque monkey,” Journal of Comparative Neurology, vol. 177, pp. 361–380,

1978.

[170] C. D. Gilbert and T. N. Wiesel, “Morphology and intracortical projections of functionally

characterised neurones in the cat visual cortex,” Nature, vol. 280, pp. 120–125, 1979.

[171] R. C. Reid and J.-M. Alonso, “Specificity of monosynaptic connections from thalamus to

visual cortex,” Nature, vol. 378, pp. 281–284, 1995.

[172] S. C. David Ferster and H. Wheat, “Orientation selectivity of thalamic input to simple cells

of cat visual cortex,” Nature, vol. 380, pp. 249–252, 1996.

[173] U. Polat and C. W. Tyler, “What pattern the eye sees best,” Vision Research, vol. 39,

pp. 887–895, 1999.

[174] D. G. Albrecht, R. L. D. Valois, and L. G. Thorell, “Visual cortical neurons: Are bars or

gratings the optimal stimuli,” Science, vol. 207, pp. 88–90, 1980.

[175] R. von der Heydt, E. Peterhans, and G. Baumgartner, “Illusory contours and cortical neuron

responses,” Science, vol. 224, pp. 1260–1262, 1984.

[176] T. F. Shipley and P. J. Kellman, “Strength of visual interpolation depends on the ratio of

physically specified to total edge length,” Perception & Psychophysics, vol. 52, pp. 97–106,

1992.

270

Page 293: Content-based Retrieval of Digital Video

[177] C. G. Gross and D. B. Bender, “Visual receptive fields of neurons in inferotemporal cortex

of the monkey,” Science, vol. 166, pp. 1303–1306, 1969.

[178] M. Ito, H. Tamura, I. Fujita, and K. Tanaka, “Size and position invariance of neuronal

responses in monkey inferotemporal cortex,” Journal of Neurophysiology, vol. 73, pp. 218–

226, January 1995.

[179] I. Fujita, K. Tanaka, M. Ito, and K. Cheng, “Columns for visual features of objects in monkey

inferotemporal cortex,” Nature, vol. 360, pp. 343–346, 1992.

[180] I. J. K. Stephen M. Kosslyn, William L. Thompson and N. M. Alpert, “Topographical rep-

resentations of mental images in primary visual cortex,” Nature, vol. 94, pp. 496–498, 1995.

[181] W. T. Newsome, A. Mikami, and R. H. Wurtz, “Motion selectivity in macaque visual cortex.

III. psychophysics and physiology of apparent motion,” Journal of Neurophysiology, vol. 55,

pp. 1340–1351, June 1986.

[182] J. A. Movshon and W. T. Newsome, “Visual response properties of striate cortical neurons

projecting to area MT in macaque monkeys,” The Journal of Neuroscience, vol. 16, pp. 7733–

7741, December 1996.

[183] W. T. Newsome and E. B. Pare, “A selective impairment of motion perception following le-

sions of the middle temporal visual area (mt),” The Journal of Neuroscience, vol. 8, pp. 2201–

2211, June 1988.

[184] I. Biederman and G. Ju, “Surface versus edge-based determinants of visual recognition,”

Cognitive Psychology, vol. 20, pp. 38–64, 1988.

[185] J. Chey, S. Grossberg, and E. Mingolla, “Neural dynamics of motion processing and speed

discrimination,” Vision Research, vol. 38, pp. 2769–2786, 1997.

[186] E. P. Simoncelli and E. H. Adelson, “Computing optical flow distributions using spatio-

temporal filters,” Tech. Rep. MIT Media Laboratory Vision and Modeling Technical Report

#165, MIT, March 1991.

[187] S. Kawakami and H. Okamoto, “A cell model for the detection of local image motion on the

magnocellular pathway of the visual cortex,” Vision Research, vol. 36, no. 1, pp. 117–147,

1995.

[188] S. A. Beardsley and L. M. Vaina, “Computational modelling of optic flow selectivity in MSTd

neurons,” Comput. Neural Syst., pp. 467–493, 1998.

[189] J. R. Smith and S.-F. Chang, “Quad-tree segmentation for texture-based image query,” in

ACM Multimedia 94, pp. 279–286, 1994.

[190] C.-S. Lu and P.-C. Chung, “Wold features for unsupervised texture segmentation,” in IEEE

International Conference on Pattern Recognition, pp. 1689–1693, August 1998.

271

Page 294: Content-based Retrieval of Digital Video

[191] P. Kruizinga and N. Petkov, “Grating cell operator features for oriented texture segmenta-

tion,” in IEEE Proceedings International Conference on Pattern Recognition, pp. 1010–1014,

August 1998.

272