Content Based Multimedia Signal Processing

Content Based Multimedia Signal Processing

Yu Hen Hu

University of Wisconsin – Madison

Outline

• Multimedia content description Interface (MPEG-7)

• Video content features• Spoken content features• Multimedia indexing, and retrieval• Multimedia summary, filtering• Other applications

MPEG-7 Overview

• Large amount of digital contents are available

• Easy to create, digitize, and distribute audio-visual content

• Family album syndrome– Need organize, index,

retrieval

• Information overloading– Need filtering

• MPEG-7 ObjectiveProvide inter-operability among systems and applications used in generation, management, distribution, and consumption of audio-visual content description.

Help user to identify, retrieve, or filter audio-video information.

Potential Application of MPEG-7

• Summary, – Generation of multimedia

program guide or content summary

– Generation of content description of A/V archive to allow seamless exchange among content creator, aggregator, and consumer.

• Filtering– Filter and transform

multimedia streams in resource limited environment by matching user preference, available resource and content description.

• Retrieval– Recall music using samples

of tunes– Recall pictures using

sketches of shape, color movement, description of scenario

• Recommendation– Recommend program

materials by matching user preference (profile) to program content

• Indexing– Create family photo or video

library index

Content descriptions

• Descriptors – MPEG-7 contains

standardized descriptors for audio, visual, generic contents.

– Standardize how these content features are being characterized, but not how to extract.

– Different levels of syntax and semantic descriptions are available

• Description Scheme– Specify the structure and

relations among different A/V descriptors

• Description Definition Language (DDL)– Standardized language

based on XML (eXtended Markup Language) for defining new Ds and DSs; extending or modifying existing Ds and Dss.

Visual Color Descriptors

• Color space: HSV (hue-saturation-value)– Scalable color descriptor

(SCD): color histogram (uniform 255 bin) of an image in HSV encoded by Haar transform.

• Color layout descriptor: – spatial distribution of

color in an arbitrarily shaped region.

• Dominant color descriptor (DCD): – colors are clustered first.

• Color structure descriptor (CSD): – scan 8x8 block in slide

window, and count particular color in window.

• Group of Frame/Group of Picture color descriptor

Visual Texture Descriptor

• Texture Browsing D.– Regularity:

• 0: irregular; 3: periodic

– Directionality• Up to 2 directions• 1-6 in 30O increment

– Coarseness• 0: fine; 3: coarse

• Edge histogram D.– 16 sub-images– 5 (edge direction)

bins/sub-image

• Homogeneous Texture D. (HTD)– Divide frequency space

into 30 bins (5 radial, 6 angular)

– 2D Gabor filter bank applied to each bin

– Energy and energy deviation in each bin computed to form descriptor.

Visual Shape Descriptor

• 3D Shape D. – Shape spectrum– Histogram (100 bins,

12bits/bin) of a shape index, computed over 3D surface.

– Each shape index measures local convexity.

• Region-based D.: Art– Angular radial transform– Shape analysis based on

moments– ART basis:

Vnm(, ) = exp(jm)Rn()

Rn() = 2 cos(n) n 0 = 1 n = 0

• Contour based shape descriptor– Curvature scale space

(CSS)– N points/curve, successively

smoothed by [0.25 0.5 0.25] till curve become convex.

– Curvature at each point form a curvature at that scale.

– Peaks of each scale are used as feature

• 2D/3D descriptors– Use multiple 2D descriptors

to describe 3D shape

Visual Motion Descriptor• Motion activity D.

– Intensity– Direction of activity– Spatial distribution of activity– Temporal distribution of

activity

• Camera motion– Panning– Booming (lift up)– Tracking– Tilting– Zooming– Rolling (around image

center)– Dollying (backward)

• Warping (w.r.t. mosaic)• Motion trajectory

Videosegment

Camera motion

Motion activity

Mosaic

Warping parameter

Motionregion

trajectory

Parametricmotion

MPEG-7 Audio Content Descriptors

• 4 classes of audio signals– Pure music– Pure speech– Pure sound effect– Arbitrary sound track

• Audio descriptors– Silence Ds: silencetype– Sound effect Ds:

• Audio Spectrum

• Sound effect features

– Spoken content Ds:• Speaker type• Link type• Extraction info type• Confusion info type

– Timbre Ds:• Instrument • Harmonic instrument • Percussive instrument

– Melody contour Ds• Contour• Meter• beat

Spoken content description

Goal: To support potentially erroneous decoding extracted using an automatic speech recognition system for robust retrieval.

• Spoken content Header– Word lexicon (vocabulary)– Phone lexicon:

• IPA (international phonetic association. Alphabet)• SAMPA (speech assessment method phonetic

alphabet)

– Phone confusion statistics– Speaker

• Spoken content lattice (word or phone)– Lattice Node– Word and phone link

Audioprocessing

ASR MPEG-7Encoder

Speechwaveform

lattice

Header

lattice

BOREP=0.6

ISP=0.7

HISP=0.3

Use of Content Features

• Multimedia information retrieval– Create searchable

archive of A/V materials, e.g. album, digital library

– Real world examples: • call routing

• Technical support

• On-line manual

• Shopping

• Multimedia on demand

• Filtering– Automated email sorter– Personalized information

portal

• Enhance low-level signal processing– Coding and trans-coding– Post-processing

Content-based Retrieval

Query Module

InteractiveQuery

Formation

Featureextraction

User

Retrieval Module

Browsing&

Feedback

Feature comparison

Output

ImageDatabase

Featureextraction

Input Module

FeatureDatabase

Multimediadata

Multimedia CBR System Design Issues

• Requirement analysis– How the multimedia materials are to be used

– Determines what set of features are needed.

• Archiving– How should individual objects are stored? Granularity?

• Indexing (query) and retrieving– With multi-dimensional indices, what is an effective and efficient

retrieval method?

– What is a suitable perceptually-consistent similarity measure?

• User interface– Modality? Text or spoken language or others?

– Interactive or batch? Will dialogue be available?

Multimedia Archiving

• Facts:– Often in compressed format and needs large

storage space– Content index will also occupy storage space

• Issues– Granularity must match underlying file system – Logical versus physical segmentation – File allocation on file system must support multiple

stream access and low latency

Indexing and Retrieving

• Index – A very high dimensional

binary vector– Encoding of content

features– Text-based content can

be represented with term vectors

– A/V content features can be either Boolean vectors or term vectors

• Retrieval– Retrieval is a pattern

classification problem– Use index vector as the

feature vector– Classify each object as

relevant and irrelevant to a query vector (template)

– A perceptually consistent similarity measure is essential

Term Vector Query

• Each document is represented by a specific term vector• A term is a key-word or a phrase • A term vector is a vector of terms. Each dimension of the vector

corresponding to a term. • Dimension of a term vector = total number of distinct terms.• Example:

Set of terms = [tree, cake, happy, cry, mother, father, big, small]

document = “Father gives me a big cake. I am so happy”, “mother planted a small tree”

Term vectors: [ 0, 1, 1, 0, 0, 1, 1, 0], [1, 0, 0, 0, 1, 0, 0, 1]

– A probabilistic term vector representation.– Relative Term Frequency (within a document)

tf (t,d) = count of term t / # of terms in document d

– Inverse document Frequency

df(t) = total count of document/ # of doc contain t

– Weighted term frequency

dt = tf(t,d) · log [ df(t)]

– Inverse document frequency term vector D = [d1, d2, … ]

Inverse Term Frequency Vector

ITF Vector Example

Document 1: The weather is great these days.

Document 2: These are great ideas

Document 3: You look great

Eliminate: The, is, these, are, you

Term tf(t,1) tf(t,2) tf(t,3) df(t) D1 D2 D3Weather 1/6 0 0 3 0.08 0.00 0.00great 1/6 1/4 1/3 1 0.00 0.00 0.00day 1/6 0 0 3 0.08 0.00 0.00idea 0 1/4 0 3 0.00 0.12 0.00look 0 0 1/3 3 0.00 0.00 0.16

Human Computer InterfaceVoice, gesturepush button/keyexpression, eye

Command

DataSensation: visualaudio, pressuresmell: virtual environment

HCI is a match-maker: Matchingthe needs of human and computers

Basic HCI Design Principles

• Consistency: Same command means the same thing

• Intuition: Metaphor that is familiar to the user

• Adaptability: Adapt to user’s skill, style

• Economy: Use minimum efforts to achieve a goal

• Non-intrusive: Do not decide for user without asking

• Structure: Present only relevant information to user in a simple manner.

User Models

• User Profiles:– Categorize users using features relevant to tasks– Static features: age, sex, etc.– Dynamic features: activity logs, etc. – Derived features: skill levels, preferences, etc.

• Use of Profiles for HCI– Adaptation: Customize HCI for different category

of users– Better understanding of user’s needs

Principles of Dialogue Design

• Feedback: Always acknowledge user’s input

• Status: Always inform users where are they in the system

• Escape: Provide a graceful way to exit half way.

• Minimal Work: Minimize amount of input user must provide

• Default: Provide default values to minimize work

• Help: Context sensitive help

• Undo: Allow user to make unintentional mistake and correct it

• Consistency:

• Document retrieval problem is a hypothesis testing problem:

H0: di is relevant to q (r=1)

H1: di is irrelevant to q (r=0)

• Type I error (Pe1=P{r=0|H0}) Relevant but not retrieved.

• Type II error (Pe2 =P{r=1|H1}) : Irrelevant but retrieved.

Contingency table for evaluating retrieval

Performance Evaluation

• Precision Recall Curve– P(recision) = w/(w+y) is a

measure of specificity of the result

– R(ecall) = w/(w+x) is an indicator of completeness of the result.

• Operating curve– Pe1 = x/(w+x) = 1 – R– Pe2 = y/(y+z) = F(allout)

• Expected search length = average # of documents need to be examined to retrieve a given number of relevant documents.

• Subjective criteria

Retrieved Not retrievedRelevant w xIrrelevant y z

Retrieved Not retrievedRelevant w xIrrelevant y z

Example: MetaSEEk

• MetaSEEk-A meta-search engine– Purpose: retrieving images– Method: Select and interface with multiple on-line

image search engines– Search Principle: Performance of different query

classes of search engines and their search options

A. B. Benitez, M. Beigi, and S.-F. Chang, Using Relevance Feedback in Content-Based Image Metasearch, IEEE Internet Computing, Vol. 2, No. 4, pp. 59-69, July/August 1998

Basic idea of MetaSEEk

• Classify the user queries into different clusters by their visual content.

• Rank the different search engines according to their performance for the different classes of user queries

• Select the search engines and search options according to their rank for the specific query cluster

• Display the search results to User• Modify these performance according to the user

feedback

Overview-Basic components of a meta-search engine

Content-Based Visual Query (1)

• Advantage – Ease of creating, capturing and collecting digital

imaginary

• Approaches– Extract significant features (Color, Texture, Shape,

Structure)– Organize Feature Vectors– Compute the closeness of the feature vectors– Retrieve matched or most similar images

Content-Based Visual Query (2)Improve Efficiency

• Keyword-based search– Match images with particular subjects and narrow

down the search scope

• Clustering– Classify images into various categories based on

their contents

• Indexing– Applied to the image feature vectors to support

efficient access to the database

Cluster the visual data

• K-means algorithm– Simplicity– Reduced computation

• Tamura algorithm (for text)• For Color, feature vector are calculated using

the color histogram• Using Euclidean distance

Conceptual structure of the meta-search database.

Multimedia summary and filtering

• Summary– Text: email reading– Image: caption

generation– Video: high-lights, story

board

• Issues: – Segmentation– Clustering of segments– Labeling clusters– Associate with syntactic

and semantic labels

• Filtering– Same as retrieval: filter

out irrelevant objects based on a given criterion (query)

– Often need to be performed based on content features

• E.g. filtering traffic accidents or law violations from traffic monitoring videos

Content based Coding and Post-processing

• Different coding decisions based on low level content features – coding mode (inter/intra

selection)– motion estimation

• Object based coding– Encoding different

regions (VOP) separately– Using different coder for

different types of regions

• Multiple abstraction layer coding– An analysis/synthesis

approach

– Synthesize low level contents from higher level abstraction

• E.g. texture synthesis

• Content based post-processing– Identify content types and

en synthesize low level content

Content Based Multimedia Signal Processing

Documents

Transcript of Content Based Multimedia Signal Processing