Download - 1 Image Video & Multimedia Systems Laboratory Multimedia Knowledge Laboratory Informatics and Telematics Institute Exploitation of knowledge in video recordings.

1Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute

Exploitation of knowledge in video recordings

Dr. Alexia Briassouli, Dr. Yiannis Kompatsiaris

Multimedia Knowledge LaboratoryCERTH-ITI

October 24, 2008Thessaloniki, Greece

22

Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute

Evolution of Content• 1-2 exabytes (millions of

terabytes) of new information produced world-wide annually

• 80 billion of digital images are captured each year

• Over 1 billion images related to commercial transactions are available through the Internet

• This number is estimated to increase by ten times in the next two years.

• 4 000 new films are produced each year

• 300 000 world-wide available films

• 33 000 television stations and 43 000 radio stations

• 100 billions of hours of audiovisual content

Personal Content

Sport - News

Movies

Web Mobile

33


DIRECTOR

SCENE

TAKE

TITLE

Multimedia Content

Networks

Storage & Devices

Segmentation KA

AnalysisLabeling

Cross-media analysis

Context

Reasoning

Metadata Generation &

Representation

Content adaptation and distribution -

Multiple Terminal & Networks

Hybrid / Content-based retrieval

recommendations and personalization

Semantic technology in

MarketsWeb 2.0 photo

- video applications

44


Need for annotation + medatata

“The value of information depends on how easily it can be

found, retrieved, accessed, filtered or managed in an active, personalized way”


Video Analysis

Video analysis that exploits knowledge provides significant advantages:Improved accuracy of semantics from videoHigher level concepts inferred through

exploitation of knowledge combined with video processing:Knowledge about behavior, event detection

More efficient storage, access, retrieval, dissemination of multimodal data because of the (automatically generated) annotations


Video Analysis in JUMAS

77


Text-based indexing

• Manual annotation • + Straightforward• + High/Semantic level• + Efficient during

content creation• Most commonly used• Necessary in a number

of applications• - Time consuming• - Operator-application

dependent• - Text related problems

(synonyms etc)

• Annotation using captions and related

text• Web, Video, Documents etc

• + Straightforward• + High/Semantic

level• + Multimodal

approach• - Text processing

restrictions and limitations

• - Captions must exist

88


Addressing the Semantic GapSemantic Gap

• Semantic Gap for multimedia: To map automatically generated numerical low level-features to higher level human-understandable semantic concepts<?xml version='1.0' encoding='ISO-8859-1' ?>

<Mpeg7 xmlns…> <DescriptionUnit xsi:type = "DescriptorCollectionType">

<Descriptor xsi:type = "DominantColorType"> <SpatialCoherency>31</SpatialCoherency>

<Value> <Percentage>31</Percentage>

<Index>19 23 29 </Index> <ColorVariance>0 0 0 </ColorVariance>

</Value> </Descriptor>

</DescriptionUnit></Mpeg7>

<?xml version='1.0' encoding='ISO-8859-1' ?><Mpeg7 xmlns…>

<DescriptionUnit xsi:type = "DescriptorCollectionType"> <Descriptor xsi:type = "DominantColorType"> <SpatialCoherency>31</SpatialCoherency>

<Value> <Percentage>31</Percentage>

<Index>19 23 29 </Index> <ColorVariance>0 0 0 </ColorVariance>

</Value> </Descriptor>

</DescriptionUnit></Mpeg7>

Dominant Color Descriptor of a sky region

This image contains a sky region and is a holiday image

99


Problem definition

• Semantic image analysis: how to translate the automatically extracted visual descriptions into human like conceptual ones

• Low-level features provide cues for strengthen/weaken evidence based on visual similarity

• Prior knowledge is needed to support semantics disambiguation

1010


Additional Analysis

Information

Knowledge Infrastructure (Multimedia Ontology)

Manual Annotation

- Models

Semantic Analysis

Single Modality Analysis

Knowledge ExtractionA common view

Feature extractionText, Image

analysisSegmentation,

SVMsEvidence

generation“Vehicle”, “Building”

Classifiers fusionGlobal vs. Local

Modalities fusionContext

“Ambulance”

ReasoningFusion of

annotationsConsistency

checkingHigher-level

concepts/events“Emergency scene”

Multimedia content

annotation toolsTraining

(Statistical) Modeling

Domain Multimedia content

AnnotationsAlgorithms -

FeaturesContext


Knowledge from Video analysis

Semantics from videoImplicitly derived via machine learning

methods i.e. based on training: SVM, HMM, Neural Networks, Bayesian

Networks

Training uses appropriate data, relevant to the semantics that interest us

Training finds models that connect low level features (e.g. motion trajectories) with high-level annotations

These models are then applied to test data

1212


Natural-Person: 0.456798Sailing-Boat: 0.463645

Sand: 0.476777Building: 0.415358

Pavement: 0.454740Road: 0.503242

Body-Of-Water: 0.489957Cliff: 0.472907

Cloud: 0.757926Mountain: 0.512597

Sea: 0.455338Sky: 0.658825

Stone: 0.471733Waterfall: 0.500000

Wave: 0.476669Dried-Plant: 0.494825

Dried-Plant-Snowed: 0.476524Foliage: 0.497562Grass: 0.491781Tree: 0.447355

Trunk: 0.493255Snow: 0.467218

Sunset: 0.503164Car: 0.456347

Ground: 0.454769Lamp-Post: 0.499387

Statue: 0.501076

Classification ResultsaceMedia

Segment’s hypothesis

set

1313


Frame Region – Concept AssociationFrame Region – Concept Association• Region feature vectorRegion feature vector formed from local descriptors formed from local descriptors• Individual SVM introduced for every defined local Individual SVM introduced for every defined local

concept, receiving as input the concept, receiving as input the region feature vectorregion feature vector• Training identical to global concept training caseTraining identical to global concept training case• Every region evaluated by all trained SVMs, segment’s Every region evaluated by all trained SVMs, segment’s

local concept hypothesis set created ( )local concept hypothesis set created ( )hijL

Ground: 0.89 Grass: 0.44Mountain: 0.21 Boat: 0.07Smoke: 0.41 Dirty-Water: 0.18Trunk: 0.12 Foam: 0.19Debris: 0.34 Mud: 0.31Water: 0.42 Sky: 0.22Ashes: 0.11 Subtitles: 0.24Flames: 0.13 Vehicle: 0.12Building: 0. 25 Foliage: 0.84Person: 0.32 Road: 0.39

Segment’s hypothesis set

1414


Initial Region-Concept Association• Region feature vector formed from local descriptors

• Individual SVM introduced for every defined concept, receiving as input the region feature vector

• Training identical to global training case• Every region evaluated by all trained SVMs,

segment’s concept hypothesis set created ( )

Building: 0.89 Roof: 0.29Grass: 0.21 Tree: 0.07Stone: 0.41 Ground: 0.15Dried-plant: 0.12 Sky: 0.19Person: 0.34 Trunk: 0.31Vegetation: 0.42 Rock: 0.22Boat: 0.11 Sand: 0.44Sea: 0.13 Wave: 0.12

Segment’s hypothesis set

hijC


Knowledge for Video analysisExplicit Semantics from video

Based on previously known models Explicitly defined models, rules, facts Rules from preliminary scripts and

standards from similar cases

Explicit and implicit knowledge can be combined with results from low-level video processing to extract meaningful high-level knowledge


System Overview

Multimedia Content

Video Analysis (face recognition, motion segmentation etc)

Knowledge

Infrastructure (explicit or

implicit)

Semantic Multimedia Description


Video analysis

Motion AnalysisMotion detectionTrackingDetection of when motion occurs

Motion SegmentationObject segmentation based on motion

characteristicsGeneration of ‘active regions’


Activity Areas from motion analysis


Sub-activity Areas After statistical processing for temporal

localization of motion and events

People walking towards each other People meet

People leave together

2020


Fight Sequence


Video Processing (1) Pre-processing

Separate video from audioSplit video into framesNoise removal via spatiotemporal filtering

Scene/shot detectionShot = frames taken by single cameraDetect transition between framesUses only low-level informationScene = story-telling unitUses higher-level knowledge, semantics


Video Processing (2)Spatial segmentation:

Spatial segmentation in images, video framesExtracts object(s) based on color, texture

features

Motion segmentation:Groups pixels with similar motion

Spatiotemporal segmentation:Finds objects over several frames through

combination of motion, appearance featuresMerges spatial and motion segmentation

results


Knowledge in Video Analysis (1)

Low level features can be combined with knowledge/rules for higher-level results Spatiotemporally segmented objects can

be used for object recognition Face/gesture recognition after training with

faces/gestures of significance Motion in specific parts of a video (e.g.

near court entrance, near prisoner’s seat) has additional significance:Needs prior knowledge of which parts of the video

frames are important and why


Knowledge in Video Analysis (2)

Knowledge structures can provide additional information about the relations between different low-level features Interactions e.g. two motions in opposite

directions, relation of extracted gestures, may mean something: people meeting, fighting, pointing, gesticulating

Face recognition combined with prior knowledge can show who is present when an event occurs


Conclusions

Combined use of video processing with knowledge can lead to richer and more accurate high-level descriptions of multimedia data

Can be used in many more applications than currently, because the knowledge introduces flexibility and adaptability to the system: The same algorithms and low-level

features can provide much more information when used in combination with explicit and implicit knowledge


26

Thank you!CERTH-ITI / Multimedia Knowledge

Laboratoryhttp://mklab.iti.gr


Video Analysis State of the Art

Spatiotemporal segmentation:Find spatiotemporally homogeneous objects

i.e. similar appearance and motionApply spatial segmentation on each frame Match segmented objects in successive

frames using low-level features (e.g. similar color, texture, continuous motion)

Use motion information – project position of object in current/next frames



Spatial segmentation: Spatial segmentation in images, video frames: Region Based: Most methods are based on grouping

similar features like color, texture, location – based on homogeneity of intensity, texture, position

Gradient/edge based: detecting changes in spatial distribution of features e.g. pixel illumination

Some methods combine region/edge information