Lee Multimedia Instructions in Microprocessors for Native Signal Processing
Content Based Multimedia Signal Processing
description
Transcript of Content Based Multimedia Signal Processing
Content Based Multimedia Signal Processing
Yu Hen Hu
University of Wisconsin – Madison
Outline
• Multimedia content description Interface (MPEG-7)
• Video content features• Spoken content features• Multimedia indexing, and retrieval• Multimedia summary, filtering• Other applications
MPEG-7 Overview
• Large amount of digital contents are available
• Easy to create, digitize, and distribute audio-visual content
• Family album syndrome– Need organize, index,
retrieval
• Information overloading– Need filtering
• MPEG-7 ObjectiveProvide inter-operability among systems and applications used in generation, management, distribution, and consumption of audio-visual content description.
Help user to identify, retrieve, or filter audio-video information.
Potential Application of MPEG-7
• Summary, – Generation of multimedia
program guide or content summary
– Generation of content description of A/V archive to allow seamless exchange among content creator, aggregator, and consumer.
• Filtering– Filter and transform
multimedia streams in resource limited environment by matching user preference, available resource and content description.
• Retrieval– Recall music using samples
of tunes– Recall pictures using
sketches of shape, color movement, description of scenario
• Recommendation– Recommend program
materials by matching user preference (profile) to program content
• Indexing– Create family photo or video
library index
Content descriptions
• Descriptors – MPEG-7 contains
standardized descriptors for audio, visual, generic contents.
– Standardize how these content features are being characterized, but not how to extract.
– Different levels of syntax and semantic descriptions are available
• Description Scheme– Specify the structure and
relations among different A/V descriptors
• Description Definition Language (DDL)– Standardized language
based on XML (eXtended Markup Language) for defining new Ds and DSs; extending or modifying existing Ds and Dss.
Visual Color Descriptors
• Color space: HSV (hue-saturation-value)– Scalable color descriptor
(SCD): color histogram (uniform 255 bin) of an image in HSV encoded by Haar transform.
• Color layout descriptor: – spatial distribution of
color in an arbitrarily shaped region.
• Dominant color descriptor (DCD): – colors are clustered first.
• Color structure descriptor (CSD): – scan 8x8 block in slide
window, and count particular color in window.
• Group of Frame/Group of Picture color descriptor
Visual Texture Descriptor
• Texture Browsing D.– Regularity:
• 0: irregular; 3: periodic
– Directionality• Up to 2 directions• 1-6 in 30O increment
– Coarseness• 0: fine; 3: coarse
• Edge histogram D.– 16 sub-images– 5 (edge direction)
bins/sub-image
• Homogeneous Texture D. (HTD)– Divide frequency space
into 30 bins (5 radial, 6 angular)
– 2D Gabor filter bank applied to each bin
– Energy and energy deviation in each bin computed to form descriptor.
Visual Shape Descriptor
• 3D Shape D. – Shape spectrum– Histogram (100 bins,
12bits/bin) of a shape index, computed over 3D surface.
– Each shape index measures local convexity.
• Region-based D.: Art– Angular radial transform– Shape analysis based on
moments– ART basis:
Vnm(, ) = exp(jm)Rn()
Rn() = 2 cos(n) n 0 = 1 n = 0
• Contour based shape descriptor– Curvature scale space
(CSS)– N points/curve, successively
smoothed by [0.25 0.5 0.25] till curve become convex.
– Curvature at each point form a curvature at that scale.
– Peaks of each scale are used as feature
• 2D/3D descriptors– Use multiple 2D descriptors
to describe 3D shape
Visual Motion Descriptor• Motion activity D.
– Intensity– Direction of activity– Spatial distribution of activity– Temporal distribution of
activity
• Camera motion– Panning– Booming (lift up)– Tracking– Tilting– Zooming– Rolling (around image
center)– Dollying (backward)
• Warping (w.r.t. mosaic)• Motion trajectory
Videosegment
Camera motion
Motion activity
Mosaic
Warping parameter
Motionregion
trajectory
Parametricmotion
MPEG-7 Audio Content Descriptors
• 4 classes of audio signals– Pure music– Pure speech– Pure sound effect– Arbitrary sound track
• Audio descriptors– Silence Ds: silencetype– Sound effect Ds:
• Audio Spectrum
• Sound effect features
– Spoken content Ds:• Speaker type• Link type• Extraction info type• Confusion info type
– Timbre Ds:• Instrument • Harmonic instrument • Percussive instrument
– Melody contour Ds• Contour• Meter• beat
Spoken content description
Goal: To support potentially erroneous decoding extracted using an automatic speech recognition system for robust retrieval.
• Spoken content Header– Word lexicon (vocabulary)– Phone lexicon:
• IPA (international phonetic association. Alphabet)• SAMPA (speech assessment method phonetic
alphabet)
– Phone confusion statistics– Speaker
• Spoken content lattice (word or phone)– Lattice Node– Word and phone link
Audioprocessing
ASR MPEG-7Encoder
Speechwaveform
lattice
Header
lattice
BOREP=0.6
ISP=0.7
HISP=0.3
Use of Content Features
• Multimedia information retrieval– Create searchable
archive of A/V materials, e.g. album, digital library
– Real world examples: • call routing
• Technical support
• On-line manual
• Shopping
• Multimedia on demand
• Filtering– Automated email sorter– Personalized information
portal
• Enhance low-level signal processing– Coding and trans-coding– Post-processing
Content-based Retrieval
Query Module
InteractiveQuery
Formation
Featureextraction
User
Retrieval Module
Browsing&
Feedback
Feature comparison
Output
ImageDatabase
Featureextraction
Input Module
FeatureDatabase
Multimediadata
Multimedia CBR System Design Issues
• Requirement analysis– How the multimedia materials are to be used
– Determines what set of features are needed.
• Archiving– How should individual objects are stored? Granularity?
• Indexing (query) and retrieving– With multi-dimensional indices, what is an effective and efficient
retrieval method?
– What is a suitable perceptually-consistent similarity measure?
• User interface– Modality? Text or spoken language or others?
– Interactive or batch? Will dialogue be available?
Multimedia Archiving
• Facts:– Often in compressed format and needs large
storage space– Content index will also occupy storage space
• Issues– Granularity must match underlying file system – Logical versus physical segmentation – File allocation on file system must support multiple
stream access and low latency
Indexing and Retrieving
• Index – A very high dimensional
binary vector– Encoding of content
features– Text-based content can
be represented with term vectors
– A/V content features can be either Boolean vectors or term vectors
• Retrieval– Retrieval is a pattern
classification problem– Use index vector as the
feature vector– Classify each object as
relevant and irrelevant to a query vector (template)
– A perceptually consistent similarity measure is essential
Term Vector Query
• Each document is represented by a specific term vector• A term is a key-word or a phrase • A term vector is a vector of terms. Each dimension of the vector
corresponding to a term. • Dimension of a term vector = total number of distinct terms.• Example:
Set of terms = [tree, cake, happy, cry, mother, father, big, small]
document = “Father gives me a big cake. I am so happy”, “mother planted a small tree”
Term vectors: [ 0, 1, 1, 0, 0, 1, 1, 0], [1, 0, 0, 0, 1, 0, 0, 1]
– A probabilistic term vector representation.– Relative Term Frequency (within a document)
tf (t,d) = count of term t / # of terms in document d
– Inverse document Frequency
df(t) = total count of document/ # of doc contain t
– Weighted term frequency
dt = tf(t,d) · log [ df(t)]
– Inverse document frequency term vector D = [d1, d2, … ]
Inverse Term Frequency Vector
ITF Vector Example
Document 1: The weather is great these days.
Document 2: These are great ideas
Document 3: You look great
Eliminate: The, is, these, are, you
Term tf(t,1) tf(t,2) tf(t,3) df(t) D1 D2 D3Weather 1/6 0 0 3 0.08 0.00 0.00great 1/6 1/4 1/3 1 0.00 0.00 0.00day 1/6 0 0 3 0.08 0.00 0.00idea 0 1/4 0 3 0.00 0.12 0.00look 0 0 1/3 3 0.00 0.00 0.16
Human Computer InterfaceVoice, gesturepush button/keyexpression, eye
Command
DataSensation: visualaudio, pressuresmell: virtual environment
HCI is a match-maker: Matchingthe needs of human and computers
Basic HCI Design Principles
• Consistency: Same command means the same thing
• Intuition: Metaphor that is familiar to the user
• Adaptability: Adapt to user’s skill, style
• Economy: Use minimum efforts to achieve a goal
• Non-intrusive: Do not decide for user without asking
• Structure: Present only relevant information to user in a simple manner.
User Models
• User Profiles:– Categorize users using features relevant to tasks– Static features: age, sex, etc.– Dynamic features: activity logs, etc. – Derived features: skill levels, preferences, etc.
• Use of Profiles for HCI– Adaptation: Customize HCI for different category
of users– Better understanding of user’s needs
Principles of Dialogue Design
• Feedback: Always acknowledge user’s input
• Status: Always inform users where are they in the system
• Escape: Provide a graceful way to exit half way.
• Minimal Work: Minimize amount of input user must provide
• Default: Provide default values to minimize work
• Help: Context sensitive help
• Undo: Allow user to make unintentional mistake and correct it
• Consistency:
• Document retrieval problem is a hypothesis testing problem:
H0: di is relevant to q (r=1)
H1: di is irrelevant to q (r=0)
• Type I error (Pe1=P{r=0|H0}) Relevant but not retrieved.
• Type II error (Pe2 =P{r=1|H1}) : Irrelevant but retrieved.
Contingency table for evaluating retrieval
Performance Evaluation
• Precision Recall Curve– P(recision) = w/(w+y) is a
measure of specificity of the result
– R(ecall) = w/(w+x) is an indicator of completeness of the result.
• Operating curve– Pe1 = x/(w+x) = 1 – R– Pe2 = y/(y+z) = F(allout)
• Expected search length = average # of documents need to be examined to retrieve a given number of relevant documents.
• Subjective criteria
Retrieved Not retrievedRelevant w xIrrelevant y z
Retrieved Not retrievedRelevant w xIrrelevant y z
Example: MetaSEEk
• MetaSEEk-A meta-search engine– Purpose: retrieving images– Method: Select and interface with multiple on-line
image search engines– Search Principle: Performance of different query
classes of search engines and their search options
A. B. Benitez, M. Beigi, and S.-F. Chang, Using Relevance Feedback in Content-Based Image Metasearch, IEEE Internet Computing, Vol. 2, No. 4, pp. 59-69, July/August 1998
Basic idea of MetaSEEk
• Classify the user queries into different clusters by their visual content.
• Rank the different search engines according to their performance for the different classes of user queries
• Select the search engines and search options according to their rank for the specific query cluster
• Display the search results to User• Modify these performance according to the user
feedback
Overview-Basic components of a meta-search engine
Content-Based Visual Query (1)
• Advantage – Ease of creating, capturing and collecting digital
imaginary
• Approaches– Extract significant features (Color, Texture, Shape,
Structure)– Organize Feature Vectors– Compute the closeness of the feature vectors– Retrieve matched or most similar images
Content-Based Visual Query (2)Improve Efficiency
• Keyword-based search– Match images with particular subjects and narrow
down the search scope
• Clustering– Classify images into various categories based on
their contents
• Indexing– Applied to the image feature vectors to support
efficient access to the database
Cluster the visual data
• K-means algorithm– Simplicity– Reduced computation
• Tamura algorithm (for text)• For Color, feature vector are calculated using
the color histogram• Using Euclidean distance
Conceptual structure of the meta-search database.
Multimedia summary and filtering
• Summary– Text: email reading– Image: caption
generation– Video: high-lights, story
board
• Issues: – Segmentation– Clustering of segments– Labeling clusters– Associate with syntactic
and semantic labels
• Filtering– Same as retrieval: filter
out irrelevant objects based on a given criterion (query)
– Often need to be performed based on content features
• E.g. filtering traffic accidents or law violations from traffic monitoring videos
Content based Coding and Post-processing
• Different coding decisions based on low level content features – coding mode (inter/intra
selection)– motion estimation
• Object based coding– Encoding different
regions (VOP) separately– Using different coder for
different types of regions
• Multiple abstraction layer coding– An analysis/synthesis
approach
– Synthesize low level contents from higher level abstraction
• E.g. texture synthesis
• Content based post-processing– Identify content types and
en synthesize low level content