Post on 19-Jan-2016
1. INTRODUCTION
This chapter will introduce the basis of video and image processing. The image or
video is stored only as a set of pixels with RGB values in computer. The computer
knows nothing about the meaning of these pixel values. The content of an image is
quite clear for a person. However, it is not so easy for a computer. For example, it is a
piece of cake to recognize yourself in an image or video, even in a crowd. But this is
extremely difficult for computer. The preprocessing is to help the computer to
understand the content of image or video. What is the so-called content of image or
video? Here content means features of image or video or their objects such as color,
texture, resolution, and motion. Object can be viewed as a meaningful component in
an image or video picture. For example, a moving car, a flying bird, a person are all
objects. There are a lot of techniques for image and video processing. This chapter
starts with an introduction to general image processing techniques and then talks
about video processing techniques. The reason we want to introduce image processing
first is that image processing techniques can be used on video if we treat each picture
of a video as a still image.
Example: Color:
1
2. BACKGROUND
A few years ago, the problems of representation and retrieval of visual media
were confined to specialized image databases (geographical, medical, pilot
experiments in computerized slide libraries), in the professional applications of the
audiovisual industries (production, broadcasting and archives), and in computerized
training or education. The present development of multimedia technology and
information highways has put content processing of visual media at the core of key
application domains: digital and interactive video, large distributed digital libraries,
multimedia publishing. Though the most important investments have been targeted at
the information infrastructure (networks, servers, coding and compression, delivery
models, multimedia systems architecture), a growing number of researchers have
realized that content processing will be a key asset in putting together successful
applications. The need for content processing techniques has been made evident from
a variety of angles, ranging from achieving better quality in compression, allowing
user choice of programs in video-on-demand, achieving better productivity in video
production, providing access to large still image databases or integrating still images
and video in multimedia publishing and cooperative work.
Example: Texture:
2
Content of image includes resolution, color, intensity and texture. Image
resolution is just the size of image in term of display pixels. Color is represented using
RGB color model in computer. For each pixel on the screen, there are three bytes
(R,G,B color component) to represent its color. Each color component is in the range
of 0 to 255. Intensity is the gray level information of pixels represented by one byte.
The intensity value is in the range of 0 to 255. Texture characterizes local variations
of image color or intensity. Although texture-based methods have been widely used in
computer vision and graphics, there is no single commonly accepted definition of
texture. Each texture analysis method defines texture according to its own model. We
consider texture as a symbol of local color or intensity variation. Image regions that
are detected to have a similar texture have similar pattern of local variation of color or
intensity.
3.COMMON IMAGE PROCESSING TECHNIQUES
3.1 Dithering:
Dithering is a process of using a pattern of solid dots to simulate shades of gray.
Different shapes and patterns of dots have been employed in this process, but the
effect is the same. When viewed from a great enough distance that the dots are not
discernible, the pattern appears as a solid shade of gray.
3.2 Erosion:
Erosion is the process of eliminating all the boundary points from an object, leaving
the object smaller in area by one pixel all around its perimeter. If it narrows to less
than three pixels thick at any point, it will become disconnected (into two objects) at
that point. It is useful for removing from a segmented image objects that are too small
to be of interest. Shrinking is a special kind of erosion in that single-pixel objects are
left intact. This is useful when the total object count must be preserved. Thinning is
another special kind of erosion. It is implemented in a two-step process. The first step
will mark all candidate pixels for removal. The second step actually removes those
candidates that can be removed without destroying object connectivity.
3
3.3 Dilation:
Dilation is the process of incorporating into the object all the background pixels that
touch it, leaving it larger in area by that amount. If two objects are separated by less
than three pixels at any point, they will become connected (merged into one object) at
that point. It is useful for filling small holes in segmented objects. Thickening is a
special kind of dilation. It is implemented in a two-step process. The first step marks
all the candidate pixels for addition. The second step adds those candidates that can be
added without merging objects.
3.4 Opening:
The process of erosion followed by dilation is called opening. It has the effect of
eliminating small and thin objects, breaking objects at thin points, and generally
smoothing the boundaries of larger objects without significantly changing their area.
3.5 Closing:
The process of dilation followed by erosion is called closing. It has the effect of filling
small and thin holes in objects, connecting nearby objects, and generally smoothing
the boundaries of objects without significantly changing their area.
3.6 Filtering:
Image filtering can be used for noise reduction, image sharpening, and image
smoothing. By applying a low-pass or high-pass filter to the image, the image can be
smoothed or sharpened respectively. Lowpass filter is used to reduce the amplitude of
high-frequency components. Simple lowpass filters apply local averaging. The gray
level at each pixel is replaced with the average of the gray levels in a square or
rectangular neighborhood. Gaussian Lowpass Filter applies Fourier transform to the
image. Highpass filter is used to increase the amplitude of high-frquency components.
4
3.7 Segmentation:
It is useful for detecting a set in which all the pixels are adjacent or touching. Within
each region, there are some common features among the pixels, such as color,
intensity, or texture. When a human observer views a scene, his visual system will
automatically segment the scene for him or her. The process is so fast and efficient
that one sees not a complex scene, but rather a collection of objects. However,
computer must laboriously isolate the objects in an image by breaking the image into
sets of pixels, each of which is the image of one object.
Image segmentation can be approached from three ways. The first approach is called
region approach, in which each pixel is assigned to a particular object or region. In
the boundary approach, only the boundaries that exist between the regions are
located. The third is called edge approach, where people try to identify edge pixels
and then link them together to form the required boundaries.
3.8 Object Recognition:
The most difficult part of image processing is object recognition. Although there are
many image segmentation algorithms that can segment image into regions with some
continuous feature, it is still very difficult to recognize objects from these regions.
There are several reasons for this. First, image segmentation is an ill-posed task and
there is always some degree of uncertainty in the segmentation result. Second, an
object may contain several regions and how to connect different regions is another
problem. At present, no algorithm can segment general images into objects
automatically with high accuracy. In the case that there is some prior knowledge
about the foreground objects or background scene, the accuracy of object recognition
could be pretty good. Usually the image is first segmented into regions according to
the pattern of color or texture. Then separate regions will be grouped to form objects.
The grouping process is important for the success of object recognition. Full
automatical grouping only occurs when the prior knowledge about the foreground
objects or background scene exists. In the other cases, human interaction may be
required to achieve good accuracy of object recognition.
5
4. BASIS OF VIDEO PROCESSING
4.1 Content of Digital Video
Generally speaking, there is much similarity between digital video and image. Each
picture of video can be treated as a still image. All the techniques applicable to images
can also be applied to video pictures. However, there are still different. The most
significant difference is that video has temporal information and uses motion
estimation for compression. Video is a meaningful group of pictures that tells a story
or something else. Video pictures can be grouped as a shot. A video shot is a set of
pictures taken in one camera break. Within each shot, there can be one or more key
pictures. Key picture is a representative of the content of a video shot. For a long
video shot, there may be multiple key pictures. Usually video processing segments
video into separate shots, selects key pictures from these shots, and then generates
features of these key pictures. The features (color, texture, object) of key pictures are
searched in video query.
Video processing includes shot detection, key picture selection, feature generation,
and object extraction.
4.1.1 Shot Detection:
Shot detection is a process to detect camera shots. A camera shot consists of one or
more pictures taken in one camera break. The general approach to shot detection has
been the definition of a difference metric. If the differences between two pictures are
above the metric, then there is a shot between them. An algorithm can be proposed for
this. This algorithm uses binary search to detect shot which makes it very fast and
achieve good performance as well.
6
4.1.2 Key Picture Selection:
After shot detection, each shot is represented by at least one key picture. The choice
of key picture could be as simple as a particular picture in the shot: the first, the last,
or the middle. However, in situations such as long shot, no single picture can
represent the content of the entire shot. QBIC (Query by Image Content) uses a
synthesized key picture created by seamlessly mosaicking all the pictures in a given
shot using the computed motion transformation of the dominant background. This
picture is an authentic depiction of all background captured in the whole shot. In
CBIR (Content based Image Retrieval) system, key picture selection is a simple
process that usually chooses the first and last pictures of a shot as key pictures.
4.1.3 Feature Generation:
After key picture selection, features of key pictures such as color, texture, intensity
are stored as indexes of the video shot. Users can perform traditional search by using
keyword querying and content-based query by specifying a color, intensity, or texture
pattern. Only the generated features will be searched against and the retrieval can be
in real time.
4.2.4 Object Extraction:
During the process of shot detection and key picture selection, the objects in the video
are also extracted using image segmentation techniques or motion information.
Segmentation-based techniques are mainly based on image segmentation. And objects
are recognized and tracked by segmentation projection. Motion-based techniques
make use of motion vectors to distinguish objects from background and keep track of
their motion. It is a very difficult problem. And the new standard being developed will
talk about how to get objects in the video and encode them separately into different
layers. Hopefully this process is not manual and it is also unrealistic to expect it to be
full automatical.
7
5.CURRENT RESEARCH ABOUT VIDEO INDEXING
AND RETRIEVAL
Video indexing and retrieval is a very active research area. In the field of digital
video, computer-assisted content-based indexing is a critical technology and currently
a bottleneck in the productive use of video resources. Only an indexed video can
effectively support retrieval and distribution in video editing, production, video-on-
demand and multimedia information systems. To achieve this, we need algorithms
and systems that provide the ability to store and retrieve video in a way that allows
flexible and efficient search based on content. In this chapter, we will talk about some
important aspects about the state of art progress in video indexing and retrieval. It is
organized as follows:
Video Parsing
Video Indexing and Retrieval
Object Recognition and Motion Tracking
5.1 Video Parsing
The first step of video processing is video parsing. Video parsing is a process to
segment video stream into generic shots. These shots are the elementary index unit in
a video database, just like a word in a text database. Then each of these shots will be
represented by one or more key pictures. Only these key pictures are stored into the
8
video database. There are several tasks in video parsing, including shot detection and
key picture selection.
5.1.1 Shot Detection in video parsing:
The first step of video parsing is shot detection. Shot detection algorithms usually
belong to two classes: (1) those based on global representations like color/intensity
histograms without any local information, and (2) those based on measuring local
difference like intensity change. The former are relatively insensitive to motion but
can miss shots when scenes look quite different but have similar distributions. The
latter are sensitive to moving objects and camera. Some systems combine the
advantages of the two classes of detection by using a mixed method. QBIC is one of
these systems.
5.1.2 Key Picture Selection in video parsing:
The next step after shot detection is key picture selection. Each shot has at least one
key picture. Key picture can best represent the visual content of video. The number of
key pictures for each shot can be constant or adaptive to shot content. The first picture
is selected as a key picture; and subsequent pictures are compared against this
candidate. A two-threshold technique, similar to the one described above, is applied to
identify a picture significantly different from the candidate. This new picture is
considered another key picture and the subsequent pictures are compared against this
new candidate. Users can control the density of key pictures by adjusting the two
threshold values.
5.1.3 Feature Generation in video parsing:
After key picture selection, features of key pictures such as color, texture, intensity
are stored as indexes of the video shot. Users can perform traditional search by using
keyword querying and content-based query by specifying a color, intensity, or texture
pattern. Only the generated features will be searched against and the retrieval can be
in real time.
9
5.2 Video Indexing and Retrieval
After each object in the video shot has been segmented and tracked, their features
such as color, texture, motion can be obtained and stored in a feature database. The
resulting database is a simple feature, value pair and the actual query is performed on
this feature database. For each feature, there is a function to calculate the distance
between query object and tracked objects in the video database. The total distance is a
weighted sum of these distances. If the total distance is below a certain threshold, then
it is returned as a possible matching.
There are also some image processing system such as Yahoo Image Surfer
Category List, Web Seer, Web Seek, VisualSeek, UCB’s query all images, Lycos and
MIT's Photo book. Some of them are mainly based on keyword searching. First the
images are assigned one or more keywords manually and catorized into different
groups such as photos, arts, people, animals, plants. Users can then browse through
the separate category that may be interesting to them.
5.2.1 Examples of some image processing systems :
``Yahoo Image Surfer Category List (YISCL)'' and ``Lycos''. YISCL system also
provides visual search function which is based on color distribution matching. USB’s
query all image presents several interesting ideas such as ``blob world'' and ``body
plan''. Blob world is a region. While blob world does not exist completely in the
"thing" domain, it recognizes the nature of images as combinations of objects, and
querying and learning in blob world are more meaningful than they are with simple
"stuff" representations. The Expectation-Maximization (EM) algorithm is used to
perform automatic segmentation based on image features. After segmentation, each
region is shown as an elliptic blob. Body plan is an algorithm for image segmentation.
A body plan is a sophisticated model of the way a horse is put together; as a result, the
program is capable of recognizing horses in different aspects. MIT’s Photo book
allows users to perform texture modeling, face recognition, shape matching, brain
matching, and interactive segmentation and annotation. Web Seek allows users to
draw a query that depicts the spatial relations between objects.
10
5.3 Object Recognition and Motion Tracking:
This is such an important topic. In video, the credibility of object recognition is higher
than that in still image because there is more information available. The most valuable
information is motion vectors. The motion vectors of a moving object has some
intrinsic patterns that conform to a motion model. There are some papers talking
about object recognition using affine motion model that allow the determination of
the occurred transformation between images.
6.BASIC REQUIREMENTS
6.1 Video Data Modeling
In a conventional database management system (DBMS), access to data is based on
distinct attributes of well-defined data developed for a specific application. For
unstructured data such as audio, video, or graphics, similar attributes can be defined.
A means for extracting information contained in the unstructured data is required.
Next, this information must be appropriately modeled in order to support both user
queries for content and data models for storage.
Fig: First Stage in Video Data Adaptation: Data Modeling
From a structural perspective, a motion picture can be modeled as data
consisting of a finite-length of synchronized audio and still images. This model is a
11
simple instance of the more general models for heterogeneous multimedia data
objects. Davenport described the fundamental film component as the shot: a
contiguously recorded audio/image sequence. To this basic component, attributes
such as content, perspective, and context can be assigned, and later used to formulate
specific queries on a collection of shots. Such a model is appropriate for providing
multiple views on the final data schema and has been suggested by Lip man and
Bender Smith and Davenport use a technique called stratification for aggregating
collections of shots by contextual descriptions called strata. These strata provide
access to frames over a temporal span rather than to individual frames or shot
endpoints. This technique can then be used primarily for editing and creating movies
from source shots. It also provides a quick query access and a view of desired blocks
of video. Because of the linearity of the medium we cannot get a coherent description
of an item but as a result of the stratification method the related information is lumped
together. The linear integrity of the raw footage is erased resulting in contextual
information which relates the shot with the environment. Rowet has developed a
video-on-demand system for video data browsing. In this system the data are modeled
based on a survey of what users would query for. Three types of indices were
identified to satisfy the user queries. The first is a textual bibliographic index which
includes information about the video and the individuals involved in the making of
the video. The second is a textual structural index of the hierarchy of movie, i.e.,
segment, scene, and shots. The third is a content index which includes keyword
indices for the audio track, object indices for significant objects and key images in the
video which represent important events.
The above model does not utilize the semantics associated with video data.
Different video data types have different semantics associated with them. We must
take advantage of this fact and model video data based on the semantics associated
with each data type.
6.2 Video indexing
Video annotation or indexing is the process of attaching content based labels to video.
Video indexing is the process of extracting from the video data the temporal location
of a feature and its value.
12
6.2.1 Need of video indexing:
Indexing video data is essential for providing content based access. Indexing has
typically been viewed either from a manual annotation perspective or from an image
sequence processing perspective. The indexing effort is directly proportional to the
granularity of video access. As applications demand finer grain access to video,
automation of the indexing process becomes essential. Given the current state of art in
computer vision, pattern recognition and image processing reliable and efficient
automation is possible for low level video indices, like cuts and image motion
properties etc.
Existing work on content based video access and video indexing can be grouped
into three main categories
6.2.1.1 High level indexing:
The work by Davis is an excellent instance of high level indexing. This approach uses
a set of predefined index terms for annotating video. The index terms are organized
based on a high level ontological category like action, time, space, etc.
The high level indexing techniques are primarily designed from the perspective
of manual indexing or annotation. This approach is suitable for dealing with small
quantities of new video and for accessing previously annotated databases.
6.2.1.2 Low level indexing:
These techniques provide access to video based on properties like color, texture
etc. These techniques can be classified under the label of low level indexing.
The driving force behind these groups of techniques is to extract data features
from the video data, organize the features based on some distance metric and to use
similarity based matching to retrieve the video. Their primary limitation is the lack of
semantics attached to the features.
6.2.1.3 Domain specific indexing:
These techniques use the high level structure of video to constrain the low level video
feature extraction and processing. These techniques are effective in their intended
13
domain of application. The primary limitation of these techniques is their narrow
range of applicability.
Figure 1 shows a conceptual architecture for content-based video indexing and search (Figure 1).
Figure 1: Parsing, representation and organization of video content
6.3 Video data Management
We here want to know how to extract contents from segmented video shots and then
index them effectively so that users can retrieve and browse a large amount of video
collections. Management of sequential video streams includes three steps, i.e. parsing,
content extraction & indexing and retrieval & browsing.
Video parsing is the process of detecting scene changes or the boundaries
between camera shots in a video stream the video stream is segmented into generic
clips. These clips are the elemental index units in a video database, just like a word in
a text database. Then, each of these clips will be represented visually by their key
frames. To reduce the requirements for mass amount of storage, only these key frames
will be stored into the database. There are two type of transitions, abrupt transitions or
camera break and gradual transitions e.g., fade-in, fade-out, dissolve, and wipe.
Indexing, this tags video clips when the system inserts them into the database.
The tag includes information based on a knowledge model that guides the
classification according to the semantic primitives of the images. Indexing is thus
driven by the image itself and any semantic descriptors provided by the model. Two
14
types of indices, text-based and image-based, are needed. The text-based index is
typed in by human operator based on the key frames using a content logger. The
image-based index is automatically constructed based on the image features extracted
from the key frames.
Retrieval and browsing, where users can access the database through queries
based on text and/or visual examples or browse it through interaction with displays of
meaningful icons. Users can also browse the results of a retrieval query. It is
important that both retrieval and browsing appeal to the user's visual intuition. By
visual query, users want to find video shots that look similar to a given example. In
concept query, users want to find video shots by the presence of specific objects or
events. Visual query can be realized by directly comparing low level visual features
like color, texture, shape and temporal variance of video shots or their representative
frames (i.e. key frames). On the other hand, the concept query depends on object
detection, tracking and recognition. Since fully automatic object extraction is still
impossible, some extent of user interaction is necessary in this process Manual
indexing labour can be greatly reduced with the help of video analysis techniques.
7.IMAGE PROCESSING TECHNIQUES FOR VIDEO
CONTENT EXTRACTION
The increase in the diversity and availability of electronic information led to
additional processing requirements, in order to retrieve relevant and useful data: the
accessibility problem. This problem is even more relevant for audiovisual
information, where huge amounts of data have to be searched, indexed and processed.
Most of the solutions for this type of problems point towards a common need: to
extract relevant information features for a given content domain. A process which
underlies two difficult tasks: deciding what is relevant and extracting it. In fact, while
content extraction techniques are reasonably developed for text, video data still is
essentially opaque. Despite its obvious advantages as a communication medium, the
lack of suitable processing and communication supporting platforms has delayed its
15
introduction in a generalized way. This situation is changing and new video based
applications are being developed.
7.1 Toolkit overview
videoCEL is basically a library for video content extraction. Its components extract
relevant features of video data and can be reused by different applications. The object
model includes components for video data modelling and tools for processing and
extracting video content, but currently the video processing is restricted to images.
At the data modelling level, the more significant concepts are the following:
Images, for representing the frame data, a numerical matrix whose values can be
colors, color map entries, etc.
ColorMaps, which map entries into a color space, allowing an additional indexation
level;
ImageDisplayConvertes and ImageIOHandlers that converts images in the specific
formats of the platforms and vice-versa.
The object model of videoCEL is a subset of a more complete model, which also
includes concepts such has shots, shot sequences and views Concepts, which are
modelled in a distinct toolkit that provides functionalities for indexing, browsing and
playing annotated video segments.
A shot object is a discrete sequence of images with a set of temporal attributes
such as frame rate and duration and represents a video segment. A shot sequence
object groups several shots using some semantic criteria. Views are used to visualize
and browse shots and shot sequences.
7.2 Temporal segmentation tools
One of the most important tasks for video analysis is to specify a unit set, in which the
video temporal sequence may be organized. The different video transitions are
important for video content identification and for the definition of the semantics of the
video language, making their detection one of the primary goals to be achieved. The
basic assumption of the transition detection procedures is that the video segments are
spatially and temporally continuous, and thus the boundary images must suffer
significant content changes. Changes, which depend on the transition type and can be
16
measured. The original problem is reduced to the search of suitable difference
quantification metrics, whose maximums identify, with great probability, the
transition temporal locations.
7.3 Cut detection
The process of detecting cuts is quite simple, mainly because the changes in content
are very visible and they always occur instantaneously between consecutive frames.
The implemented algorithm simply uses one of the quantification metrics, and a cut is
declared when the differences are above a certain threshold. Thus, its success is
greatly dependent on the metric suitability. The results obtained by applying this
procedure to some of our metrics are presented next. The thresholds selection was
made empirically, while trying to maximize the success of the detection (minimizing
simultaneously the false and missed detections). The captured video segment belongs
to an outdoors news report, so its transitions are not very “artistic” (mainly cuts).
There are several well known strategies that usually improve this detection. For
instance, the use of adaptive thresholds increases the flexibility of the threshold,
allowing the adaptation of the algorithm to diverse video content An approach that
was used with some success in previous work, while trying to reduce some of the
lacks of the metrics specific behavior, was simply to produce a weighted average of
the differences obtained with two or more metrics. Pre-processing images using noise
filters or lower resolution operators are also quite usual tasks, offering means for
reducing image noise and also the processing complexity. The distinctive treatment of
image regions, in order to eliminate some of the more extreme values, remarkably
increases the detection accuracy, especially when there are only a few objects moving
on the captured scene.
7.4 Gradual transition detection
Gradual transitions, such as fades, dissolves and wipes, cause more gradual changes
which evolve during several images. Although the obtained differences are less
distinct from the average values, and can have similar values to the ones caused by
camera operations, there are several successful procedures, which were adapted and
are currently supported by the toolkit.
17
7.4.1 Twin-Comparison algorithm :
This algorithm was developed after verifying that, in spite of the fact that the first and
last transition frames are quite different, consecutive images remain very similar.
Thus, as in the cuts detection, this procedure uses one of the difference metrics, but,
instead of one, it has two thresholds: one higher for cuts, and another for the gradual
transitions. While this algorithm just detects gradual transitions and distinguish them
from cuts, there are other approaches which also classify fades, dissolves and wipes.
7.4.2 Edge-Comparison algorithm:
This algorithm analyses both edge change fractions, exiting and entering. Distinct
gradual transitions generate characteristic variations of these values. For instance, a
fade in always generates an increase in the entering edge fraction; conversely, a fade
out causes an increase in the exiting edge fraction; a dissolve has the same effect as a
fade out followed by a fade in.
7.5 Camera operation detection
As distinct transitions give different meanings to adjacent video segments, the
possible camera operations are also relevant for content identification . For example,
that information can be used to build salient stills and select key frames or segments
for video representation. All the methods which detect and classify camera operations
start from the following observation: each one generates global characteristic changes
in the captured objects and background. For example, when a pan happens they move
horizontally in the opposite direction of the camera motion; the behaviour of the tilts
is similar but in the vertical axis; zooms generate convergent or divergent moves.
7.5.1 X-ray based method:
This approach basically produces fingerprints of the global motion flow. After
extracting the edges, each image is reduced to its horizontal and vertical projections, a
column and a row, that roughly represent the horizontal and vertical global motions,
which are usually referred to as the x-ray images.
18
7.6 Lighting conditions characterization
Light effects are usually mentioned in the cinema language grammar, as they
contribution is essential for the overall video content meaning. The lighting conditions
can be easily extracted by observing the distribution of the light intensity histogram:
its mode, mean and average are valuable in characterizing its distribution type and
spread. These features also allow the quantification of the lighting variations, once the
similarity of the images is determined.
7.7 Scene segmentation
Scene segmentation refers to the image decomposition in its main components:
objects, background, captions, etc. It is a first step for the identification and
classification of the scene main features, and its tracking during all the sequence. The
simplest implemented segmentation method is the amplitude threshold, which is quite
successful when the different regions have distinct amplitudes. It is particularly useful
procedure for binarizing captions. Other methods are described below.
7.7.1 Region-based segmentation:
Region-based segmentation procedures find out various regions in an image which
have similar features. One of such algorithms is the split and merge algorithm that
first divides the image in atomic homogeneous regions, and then merges the similar
adjacent regions until they are sufficiently different. Two distinct metrics are needed:
one for measuring the initial regions homogeneity (the variance, or any other
difference measure), and another for quantifying the adjacent regions similarity (the
average, median, mode, etc.).
7.7.2 Motion-based segmentation:
The main idea in motion-based segmentation techniques is to identify image regions
with similar motion behaviours. These properties are determined by analysing the
temporal evolution of the pixels. This process is carried out in the frequency image
produced for all the image sequence. When more constant pixels are selected, for
example, the final image is the background causing the motion removal. Once the
19
background is extracted, the same principle can be used to extract and track motion or
objects.
7.7.3 Scene and object detection:
The process of detecting scenes or scene regions (objects) is, in certain way, the
opposite process of transition detection: we want to find images regions whose
differences are below a certain threshold. As a consequence this procedure uses
difference quantification metrics. These functions can be determined for the entire
image, or a hierarchical growing resolution calculation can be performed to accelerate
the process. Another tested algorithm, also hierarchical, is based in the hausdorff
distance. It retrieves all the possible transformations (translation, rotation, etc.)
between the edges of two images. Another way of extracting objects is by
representing their contours. The toolkit uses a polygonal line approach to represent
contours as a set of connected segments. The ending of a segment is detected when
the relation between the current segment polygonal area and its length is beyond a
certain threshold.
7.7.4 Caption extraction:
Based on an existing caption extraction method a new and more effective procedure
was implemented. As the captions are usually artificially added to images, the first
step of this procedure is extracting high-contrast regions. This task is performed by
segmenting the edge image, whose contours have been previously dilated by a certain
radius. These regions are then subjected to a certain caption-characteristic size
constraints, based on the x-rays (projections of edge images) properties; just the
horizontal cluster remains. The resulting image is segmented and two different images
are produced: one with black background for lighter text, and another with white
background for darker text. The process is complete after binarizing both images and
proceeding to more dimensional region constraints.
20
7.8 Edge detection
Two distinct procedures for edge detection were implemented: (1) gradient module
threshold, where the image vectors are obtained using the Sobel operator; (2) the
canny filter, considered the optimum detector, which analyses the representatively of
gradient module maximums, and thus producing thinner contours. As the differential
operators amplify high frequency zones, it is common practice to pre-process the
images using noise filters, a functionality also supported by the toolkit in the form of
several smoothing operators: the median filter, the average filter, and a Gaussian
filter.
21
8. APPLICATIONS
8.1 Videocel applications:
Video browser:
This application is used to visualize video streams. The browser can load a stream and
split it in its shot segments using cut detection algorithms. Each shot is then
represented in the browser main window by an icon that is a reduced form of its first
frame the shots can be played using several view objects.
Weather Digest:
The Weather Digest application generates HTML documents from TV weather
forecasts. The temporal sequence of maps, presented on the TV, is mapped to a
sequence of images in the HTML page. This application illustrates the importance of
information models.
News analysis:
News analysis developed a set of applications to be used by social scientists in content
analysis of TV news. The analysis was in filling forms including news items duration,
subjects, etc., which our attempts to automate. The system generates HTML pages
with the images and CSV (Comma Separated Values) tables suitable for use in
spreadsheets such as Excel. Additionally, these HTML pages can be also used for
news browsing and there also is a java based tool for accessing this information.
22
8.2 COBRA Model
In order to explore video content and provide a framework for automatic extraction of
semantic content from raw video data, we propose the Content Based Retrieval
(COBRA) video data model. The model is independent of feature/semantic extractors,
providing flexibility by using different video processing and pattern recognition
techniques for that purposes. The feature grammar is exploited to describe the low-
level persistent meta-data. The grammar also describes the dependencies between the
extractors.
At the same time it is in line with the latest development in MPEG-7,
distinguishing four distinct layers within video content: the raw data, the feature, the
object, and the event layer. The object and event layers are concept layers consisting
of entities characterized by prominent spatial and temporal dimensions respectively.
To provide automatic extraction of concepts (objects and events) from visual
features (which are extracted using existing video/image processing techniques), the
COBRA video model is extended with object and event grammars. These grammars
are aimed at formalizing the descriptions of these high-level concepts, as well as
facilitating their extraction based on features and spatial-temporal reasoning.
This rule-based approach results in the automatic mapping from features to
high-level concepts. However, we still have the problem of creating object and event
rules manually. This might be very difficult, especially in the case of certain object
rules which require extensive user familiarity with features and object extraction
techniques.
As the model also provides a framework for stochastic modeling of events, we
have chosen to exploit the learning capability of Hidden Markov Models (HMMs) to
recognize events in video data automatically.
23
Figure 1 - Video sequences
Figure 2 - Video shots
Figure 3 - Principal color detection
Figure 4 - Detected player
SE
LE
CT
vi.frame seq
FROM player pl, video vi
WH
ER
E
Event:vi.frame.e=((Appr_n
et_BSL:
((e1:Player_near_the_net,
e2:Backhand_slice),(),(),
24
(),(before
(e2,e1,n),n<75)),e1.o1.na
me=e2.o1.nam="Sampras");
Query 1 - Video query
The WHERE clause of the query shown in the table above, constrains player
profiles on only documents that contain videos with the event Approaching the net
with backhand slice stoke'. This new event description, defined inside the query,
demonstrates how complex events can be defined dynamically based on previously
extracted events and spatial-temporal relations. The first one of two predefined events
is the Player_near_the_net event, which is defined using spatial-temporal
rules, where as the second one, the Backhand_slice event, is defined using the
HMM approach. The temporal relation requires e1 to start at least 75 frames before
event e2. The event descriptions are evaluated by the query processor. It rewrites the
event from its conceptual definition into a standard object algebra extended by the
COBRA video model and spatial-temporal operations. Therefore, a user is able to
explore video content specifying very detailed complex queries that include a
combination of features, objects, and events, as well as spatial-temporal relations
among them.
CONCLUSION
Visual information has always been always an important source of knowledge. With
the advances in information computing & communication technology, this
information in the type of digital images & digital video, is highly available also
through the computer. To be able to cope with the explosion of visual information, an
organization of material which allows for fast search & retrieval is required. These
calls for the system which is in some way can provide content-based handling of
visual information. In this seminar I have tried to give the basic image processing
25
techniques, status of content based access to images & video databases, some
applications regarding to video content extraction.
An image extraction system is necessarily for users that have large collection
of images like digital library. During the last few years, some content based
techniques for image retrieval system are commercially available. These systems offer
retrieval by color; texture or shape & smart combinations of these images help users
in finding the image he is looking for. A video retrieval system is useful for video
archiving, video editing, production etc.
BIBLIOGRAPHY
[1] Basis of Video and Image Processing by Jian Wang
[2] A survey on content based access to image & video databases by Kjersti Aas &
Line Eikvil
[3]Content-based representation and retrieval of visual Media: a state-of-the-art
review 1
Philippe Aigrain, Hongjiang Zhang Dragutin Petkovic
[4]Advanced content-based semantic scene analysis and information retrieval: the
schema project
E. Izquierdo, J.R. Casas, R. Leonardi, P. Migliorati, Noel e.
O’connor, I. Kompatsiaris and M. G. Strintzis
[5]Relevance feedback techniques in content-based image retrieval
By Yong Rui, Thomas U Huang, Sharad Mehrotra
[6] IEEE conference on Advanced Video and Signal Based Surveillance
Video Extraction in Compressed Domain
26
[7] IEEE paper on Image Segmentation; block based feature extraction; large
Video/image databases;
[8] IEEE paper on Face location in wavelet-based video compression for
high perpetual quality
Feature extraction
27
28