Scene Summarization

SCENE SUMMARIZATION

Written by: Alex Rav-Acha // Yael Pritch // Shmuel Peleg(2006)

PRESENTED BY: NIHAD AWIDAT

Dynamic Video Synopsis

Making a Long Video Short

The power of video over still images…

Dynamic video synopsis: provides a compact video representation, while preserving

the essential activities of the original video.

Related Work on Video Abstraction:

There are two main approaches for video synopsis:1. a set of salient images (key frames) is selected from the

original video sequence.2. a collection of short video sequences.

Related Work on Video Abstraction:

In both approaches, entire frames are used as the fundamental building blocks.

A different methodology uses mosaic images together.

Dynamic video synopsis

Dynamic video synopsis:1. The video synopsis is itself a video, expressing the

dynamics of the scene. 2. most of the activity in the video is condensed by

simultaneously showing several actions.


schematic video clip represented as a space-time volume


shifting events from their original time interval to another time interval. To get an optimal use of image regions.

video examples:http://www.vision.huji.ac.il/synopsis.

http://www.vision.huji.ac.il/synopsis


Dynamic Video Synopsis - notation

N frames of an input video sequence be represented in a 3D space-time volume I(x, y, t), where (x, y) are the spatial coordinates of this pixel, and 1 ≤ t ≤ N is the frame number

I(x, y, t)

1 ≤ t ≤ N

S(x, y, t)

Activity Detection

every input pixel has been labeled with its level of “activity”. an input pixel “active” if its color difference from the

temporal median at location (x, y) is larger than a given threshold.

To clean the activity indicator from noise, a median filter is applied to χ before continuing with the synopsis process.

Generating a Video Synopsis

We would like to generate a synopsis video S(x, y, t) having the following properties: The video synopsis S should be substantially shorter than

the original video I. Maximum “activity” from the original video should appear

in the synopsis video. The motion of objects in the video synopsis should be

similar to their motion in the original video. The video synopsis should look good, and visible seams or

fragmented objects should be avoided.

Methods to Generate a Video synopsis

two approaches to Generating a Video Synopsis :

1. a low-level method using optimizations on Markov Random Fields.

2. an object-based approach in which Proceedings of objects are extracted from the input video.

Method I: Video Synopsis by Energy Minimization

I(x, y, t)

S(x, y, t)

I(x, y, M(x, y, t)) S(x, y, t)

The synopsis video S is generated with a mapping M, assigning to every coordinate (x, y, t) in the synopsis video S the coordinates of a source pixel from I.

Time shift of pixels, keeping the spatial locations fixed. any synopsis pixel S(x, y, t) can come from an input pixel I(x, y, M(x, y, t)).

Video Synopsis by Energy Minimization

The shorter video synopsis S is generated from the input video I by including most active pixels.

Activity strips

Video Synopsis by Energy Minimization

The time shift M is obtained by solving an energy minimization problem, where the cost function is given by:

We want to minimize E(M).

Cost func

Loss in activity Discontinuity cost across seams

Loss in activity

We want to minimize The number of active pixels in the input video I that do not appear in the synopsis video S.

Loss in activity:

Sum of active pixels in I

Sum of pixels that appear in video S

Discontinuity cost across seams

We want to minimize the sum of color differences across seams between spatiotemporal neighbors in the synopsis video S and the corresponding neighbors in the input video I.

Discontinuity cost across seams

ei : a six unit spatio-temporal vector

Markov random field (MRF)

the cost function E(M) corresponds to a 3D Markov random field (MRF).

MRF is a complicated topic, some simplifications in order not to lose the whole picture.

Markov chain (1D)

A Markov chain is a sequence of random variables X1,X2,X3,… with the Markov property that given the present state the

future and past states are conditionally independent.

The joint probability of the sequence is given by:

MRF – (2D Markov chain)

Generalization of Markov chain: Uni - lateral Bi - lateral 1D 2D time domain space domain. (No natural ordering on image

pixels!) Now it is a set of random variables having a Markov property

described by an undirected graph.

MRF (2D) The blue nodes are the observed variables, which in our case are a pixel in the

3D volume of the output movie. (2D space 1D temporal - can be assigned any time value corresponding to an input frame.)

The pink nodes are the hidden variables, which represents the activity cost. The edges between nodes (dependency) are determined according to the

discontinuity cost.

Image: http://nghiaho.com/?page_id=1366

MRF – in our case The cost function can therefore be minimized by algorithms like iterative

graph-cuts.

Restricted Solution Using a 2D Graph

The optimization of the cost function E(M), allowing each pixel in the video synopsis to come from any time, is a large-scale problem.

For example, an input video of 3 minutes which is summarized into a video synopsis of 5 seconds results in a graph with approximately 225 nodes, each having 5400 labels.

therefore we use the constraint that Consecutive pixels in the synopsis video S are restricted to come from consecutive pixels in the input video I.


Under this restriction the 3D graph is reduced to a 2D graph where each node corresponds to a spatial location in the synopsis movie.

The label of each node M(x, y) determines the frame number t in I shown in the first frame of S.

between two neighboring locations (x1,y1) and (x2, y2) in S A seam exists if M(x1, y1) = M(x2, y2)


the discontinuity cost Ed(M) along the seam is a sum of the color differences at this spatial location over all frames in S.

ei are now four unit vectors describing the four spatial neighbors. The number of labels for each node is N − K, where N and K are the

number of frames in the input and output videos respectively. The activity loss for each pixel is:

video examples:http://www.vision.huji.ac.il/synopsis.

a low-level method using optimizations on Markov Random Fields


Method II: Object-Based Synopsis

The low-level approach for dynamic video synopsis is limited to satisfying local properties such as avoiding visible seams.

avoiding the stroboscopic effect requires the detection and tracking of each object in the volume.

object based approach for dynamic video synopsis: shift objects in time and create new synopsis frames that never

appeared in the input sequence in order to make a better use of space and time.

Object-Based Synopsis

Moving objects are detected as described in earlier by comparing each pixel to the temporal median and thresholding this difference.

noise cleaning using a spatial median filter grouping together spatiotemporal connected components. This process results in a set of objects, where each object b is represented

by its characteristic function:

From each object, segments are created by selecting subsets of frames in which the object appears.

Such segments can represent different time intervals, optionally taken at different sampling rates.


The video synopsis S will be constructed from the input video I using the following steps: 1. Objects b1 . . . br are extracted from the input video I.2. A set of non-overlapping segments B is selected from the

original objects.3. A temporal shift M is applied to each selected segment,

creating a shorter video synopsis while avoiding occlusions between objects and enabling seamless stitching.


examples for a schematic temporal rearrangement of objects

Two objects recorded at different times are shifted to the same time interval in the video synopsis. A single object moving during a long period is broken into segments having a shorter time intervals, and those are played simultaneously creating a dynamic stroboscopic effect. Intersection of objects does not disturb the synopsis when object volumes are broken into segments.


A pixel in the resulting synopsis may have multiple sources (coming from different objects)

post-processing step in which all objects are stitched together.

The background image is generated by taking a pixel’s median value over all the frames of the sequence.

The selected objects can then be blended in.


We define the set of all pixels which are mapped to a single synopsis pixel (x, y, t) S as src(x, y, t)∈

we denote the number of (active) pixels in an object (or a segment) b as We then define an energy function which measures the cost for a subset

selection of segments B and for a temporal shift M.

activity loss

penalty for occlusions between objects (enables flexibility in temporal

arrangement of the objects - when the segmentation of moving objects is not

perfect)

Penalty for long synopsis


.

activity loss

penalty for occlusions between objects

Penalty for long synopsis


Minimizing the energy over all possible segment selections B and a temporal shift M is very exhaustive due to the large number of possibilities.

However, the problem can be scaled down significantly by restricting the solutions.

Two restricted schemes are described: Video-Synopsis with a Pre-determined Length Lossless Video Synopsis

i) Video-Synopsis with a Pre-determined Length

each object is partitioned into overlapping and consecutive segments of length K.

All the segments are time-shifted to begin at time t = 1, and we are left with deciding which segments to include in the synopsis video.

some objects may not appear in the synopsis video.

Video-Synopsis with a Pre-determined Length

occlusion cost between all pairs of segments.

bi and bj be two segments with appearance times ti and tj , and the support of each segment be represented by its characteristic function χ.

The cost between two segments is defined to be the sum of color differences between the two segments, after being shifted to time t = 1.


For the synopsis video we select a partial set of segments B which minimizes the cost in

To avoid showing the same spatio-temporal pixel twice we set v(bi, bj) = ∞ for segments bi and bj that intersect in the original movie.

if the stroboscopic effect is undesirable, it can be avoided by setting v(bi, bj) = ∞ for all bi and bj that were sampled from the same object.

constant K

occlusion cost activity loss


video examples:http://www.vision.huji.ac.il/synopsis



to minimize the energy function we use Simulated Annealing .

After segment selection, a synopsis movie of length K is constructed by pasting together all the shifted segments.

Simulated Annealing

annealing in metallurgy, a technique involving heating and controlled cooling of a material to increase the size of its crystals and reduce their defects.

a slow decrease in the probability of accepting worse solutions as it explores the solution space.

Accepting worse solutions is a fundamental property because it allows for a more extensive search for the optimal solution.

https://en.wikipedia.org/wiki/Simulated_annealing

Simulated Annealing

Overview The goal is to bring the system, from an arbitrary initial state, to a state with the

minimum possible energy.The basic iteration: At each step, the SA heuristic considers some neighbouring state s' of the current state s, and

probabilistically decides between moving the system to state s' or staying in state s. These probabilities ultimately lead the system to move to states of lower energy.

Typically this step is repeated until the system reaches a state that is good enough for the application, or until a given computation budget has been exhausted.

https://en.wikipedia.org/wiki/Simulated_annealing

Simulated Annealing

In our case: Each state describes the subset of segments that are included

in the synopsis neighboring states are taken to be sets in which a segment is

removed, added or replaced with another segment. After segment selection, a synopsis movie of length K is

constructed by pasting together all the shifted segments.

ii) Lossless Video Synopsis

longer synopsis for video surveillance, which all activities are guaranteed to appear.

find a compact temporal rearrangement of the object segments. we use Simulated Annealing to minimize the energy. In this case, a state corresponds to a set of time shifts for all

segments, and two states are defined as neighbors if their time shifts differ for only a single segment.

ii) Lossless Video Synopsis

Panoramic Video Synopsis

When a video camera is scanning a scene, much redundancy can be eliminated by using a panoramic mosaic.

Limited dynamics can be represented by a stroboscopic image. Constructing the panoramic video synopsis is done in a similar manner to

the regular video synopsis, with a preliminary stage of aligning all the frames to some reference frame.

Surveillance Examples

Video synopsis from street surveillance. (a) A typical frame from the original video (22 seconds). (b) A frame from a video synopsis movie (2 seconds) showing condensed activity. (c) A frame from a shorter video synopsis (0.7 seconds), showing aneven more condensed activity.

Video Indexing Through Video Synopsis

Video synopsis can be used for video indexing, providing the user with efficient and intuitive links for accessing actions in videos.

The information of the video is projected into the ”space of activities”, in which only activities matter.

Webcam Synopsis: Peeking Around the World

generate a short video that will be a synopsis of an endless video streams, generated by webcams or surveillance cameras.

address queries like “I would like to watch in one minute the highlights of this camera broadcast during the past day”.


The process includes two major phases: An online conversion of the video stream into a database

of objects and activities (rather than frames). A response phase, generating the video synopsis as a

response to the user’s query.


analyze the video for interesting events, and record an object-based description of the video.

lists for each webcam the interesting objects, their duration, location, and their appearance.

In a 3D spacetime description of the video, each object is a “tube”.

moving objects are interesting. phase transitions when a moving object turns into background and vice

versa.


the number of objects is unbounded, To keep a finite object queue a procedure for removing

objects from this queue when space is exhausted is needed.

Generating a webcam synopsis

A two phase process for webcam synopsis: Online Phase during video capture:

This phase is done in real time. Object (tube) detection in space time. Inserting detected tubes into the object queue Removing tubes from the object queue when reaching a

space limit.


Response Phase building a synopsis according to a user query. Construction of time lapse video of the changing

background. Selection of tubes that will appear in the synopsis and their

corresponding times Stitching the tubes and the background into a coherent

video.

Computing Activity Tubes

Background Construction: The appearance of the background changes in time due to

changes in lighting, changes of background objects… To compute the background image for each time, we use a

temporal median over a few minutes before and after each frame.


Moving Objects Extraction using Min-Cut background subtraction together with min-cut to get a

smooth segmentation of foreground objects.


Foreground-Background Phase Transitions Tubes that abruptly begin or end in the middle of a frame

represent phase transitions. (Examples are cars being parked or getting out of parking)

In most cases phase transitions are significant events, Detected by looking for background changes that correspond

to beginning and ending of tubes.


Energy Between Tubes we define the energy of interaction between tubes. The activity tubes are stored in the object queue B. Each tube b is defined over a finite time segment in the

original video stream . The synopsis video is generated based on a temporal

mapping M, shifting objects b in time from the original video into the time segment . in the video synopsis.


Optimal synopsis video will minimize the following energy function:

activity cost the collision costTemporal consistency

cost

M(b) = ˆb indicates the time shift of tube b into the synopsis

The activity cost favors synopsis movies with maximum activity. It penalizes for objects that are not mapped to a valid time in the synopsis.

For every two “shifted” tubes and every relative time shift between them, we define the collision cost as the volume of their space-time overlap weighted by their activity measures.

The temporal consistency cost adds a bias towards preserving the chronological order of events.

The Object Queue

When an object is inserted into the queue, its activity cost is computed to accelerate the future construction of synopsis videos.

When removing objects (tubes) from the queue, we prefer to remove objects that are least likely to be included in a final synopsis according to three simple criteria that can be computed efficiently: “importance” (activity), “collision potential”, and “age”.

Synopsis Generation

Given the desired period from the input video, and the desired length of the synopsis, the synopsis video is generated using four steps. Generating a background video. Once the background video is defined, a consistency cost is

computed for each object and for each possible time in the synopsis.

An energy minimization step determines which tubes (space-time objects) appear in the synopsis and at what time.

The selected tubes are combined with the background time-lapse to get the final synopsis.

Synopsis Generation

Time Lapse Background video: generated before adding activity tubes into the synopsis.

It should represent the background changes over time. It should represent the background for the activity tubes.

These two goals are conflicting, as representing the background of activity tubes will be done best when the background video covers only active periods, ignoring, for example, most night hours.

We address this trade-off by constructing two temporal histograms. A temporal activity histogram Ha of the video stream. A uniform temporal histogram Ht. We compute a third histogram by interpolating the two histstograms λ · Ha + (1 − λ) · Ht, where λ is a weight given by the user.

Synopsis Generation

Consistency with Background Since we do not assume accurate segmentation of moving objects, we prefer to

stitch tubes to background images having a similar appearance. This tube-background consistency can be taken into account by adding a new

energy term Eb(M). This term will measure the cost of stitching this object to the time-lapsed background

the color values of the mapped tube ˆb the color values of the time

lapsed background.σ(ˆb) the set of pixels in the border of the mapped activity tube ˆb

t out the duration of the output synopsis

Synopsis Generation

Energy Minimization To create the final synopsis video we look for a temporal mapping M that

minimizes the energy of the activity tubes together with the background consistency term:

it can be minimized by simulated annealing method. The simulated annealing works in the space of all possible temporal mappings M,

including the special case when a tube is not used at all in the synopsis video.

Synopsis Generation

Energy Minimization by simulated annealing Each state describes the subset of tubes that are included in the synopsis neighboring states are defined as states in which a single activity tube is removed

or changes its mapping into the synopsis. As an initial state we used the state in which all tubes are shifted to the beginning

of the synopsis movie.

Synopsis Generation

Stitching the Synopsis Video Each tube is stitched independently to the time lapse

background. Poisson blending method used.

Examples

Examples

Online demos



Thank you

REFERENCES

Making a Long Video Short: Dynamic Video Synopsis. Rav-Acha, Y. Pritch, and S. Peleg, CVPR'06, New York, June 2006, pp. 435- 441.http://www.cs.huji.ac.il/~peleg/papers/cvpr06-synopsis.pdf

Webcam Synopsis: Peeking Around the World, Y. Pritch, A. Rav-Acha, A. Gutman, and S. Peleg, ICCV 2007, Rio de Janeiro, Oct. 2007.http://www.cs.huji.ac.il/~peleg/papers/iccv07-webcam.pdf

Non-Chronological Video Synopsis and Indexing, Y. Pritch, A. Rav-Acha, and S. Peleg, IEEE Trans. PAMI, to appear Nov. 2008http://www.cs.huji.ac.il/~peleg/papers/pami08-synopsis.pdf

Markov Random Field (MRF) and Graph-cut https://en.wikipedia.org/wiki/Markov_random_field https://en.wikipedia.org/wiki/Simulated_annealing

http://www.cs.huji.ac.il/~peleg/papers/cvpr06-synopsis.pdf

http://www.cs.huji.ac.il/~peleg/papers/iccv07-webcam.pdf

http://www.cs.huji.ac.il/~peleg/papers/pami08-synopsis.pdf

http://grapeot.me/post/2011/07/01/Computing-Markov-Random-Field-(MRF)-and-Graph-cut-(2).aspx

Scene Summarization

Documents

Transcript of Scene Summarization