1Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Exploitation of knowledge in video recordings
Dr. Alexia Briassouli, Dr. Yiannis Kompatsiaris
Multimedia Knowledge LaboratoryCERTH-ITI
October 24, 2008Thessaloniki, Greece
22
Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Evolution of Content• 1-2 exabytes (millions of
terabytes) of new information produced world-wide annually
• 80 billion of digital images are captured each year
• Over 1 billion images related to commercial transactions are available through the Internet
• This number is estimated to increase by ten times in the next two years.
• 4 000 new films are produced each year
• 300 000 world-wide available films
• 33 000 television stations and 43 000 radio stations
• 100 billions of hours of audiovisual content
Personal Content
Sport - News
Movies
Web Mobile
33
Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
DIRECTOR
SCENE
TAKE
TITLE
Multimedia Content
Networks
Storage & Devices
Segmentation KA
AnalysisLabeling
Cross-media analysis
Context
Reasoning
Metadata Generation &
Representation
Content adaptation and distribution -
Multiple Terminal & Networks
Hybrid / Content-based retrieval
recommendations and personalization
Semantic technology in
MarketsWeb 2.0 photo
- video applications
44
Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Need for annotation + medatata
“The value of information depends on how easily it can be
found, retrieved, accessed, filtered or managed in an active, personalized way”
5Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Video Analysis
Video analysis that exploits knowledge provides significant advantages:Improved accuracy of semantics from videoHigher level concepts inferred through
exploitation of knowledge combined with video processing:Knowledge about behavior, event detection
More efficient storage, access, retrieval, dissemination of multimodal data because of the (automatically generated) annotations
6Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Video Analysis in JUMAS
77
Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Text-based indexing
• Manual annotation • + Straightforward• + High/Semantic level• + Efficient during
content creation• Most commonly used• Necessary in a number
of applications• - Time consuming• - Operator-application
dependent• - Text related problems
(synonyms etc)
• Annotation using captions and related
text• Web, Video, Documents etc
• + Straightforward• + High/Semantic
level• + Multimodal
approach• - Text processing
restrictions and limitations
• - Captions must exist
88
Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Addressing the Semantic GapSemantic Gap
• Semantic Gap for multimedia: To map automatically generated numerical low level-features to higher level human-understandable semantic concepts<?xml version='1.0' encoding='ISO-8859-1' ?>
<Mpeg7 xmlns…> <DescriptionUnit xsi:type = "DescriptorCollectionType">
<Descriptor xsi:type = "DominantColorType"> <SpatialCoherency>31</SpatialCoherency>
<Value> <Percentage>31</Percentage>
<Index>19 23 29 </Index> <ColorVariance>0 0 0 </ColorVariance>
</Value> </Descriptor>
</DescriptionUnit></Mpeg7>
<?xml version='1.0' encoding='ISO-8859-1' ?><Mpeg7 xmlns…>
<DescriptionUnit xsi:type = "DescriptorCollectionType"> <Descriptor xsi:type = "DominantColorType"> <SpatialCoherency>31</SpatialCoherency>
<Value> <Percentage>31</Percentage>
<Index>19 23 29 </Index> <ColorVariance>0 0 0 </ColorVariance>
</Value> </Descriptor>
</DescriptionUnit></Mpeg7>
Dominant Color Descriptor of a sky region
This image contains a sky region and is a holiday image
99
Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Problem definition
• Semantic image analysis: how to translate the automatically extracted visual descriptions into human like conceptual ones
• Low-level features provide cues for strengthen/weaken evidence based on visual similarity
• Prior knowledge is needed to support semantics disambiguation
1010
Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Additional Analysis
Information
Knowledge Infrastructure (Multimedia Ontology)
Manual Annotation
- Models
Semantic Analysis
Single Modality Analysis
Knowledge ExtractionA common view
Feature extractionText, Image
analysisSegmentation,
SVMsEvidence
generation“Vehicle”, “Building”
Classifiers fusionGlobal vs. Local
Modalities fusionContext
“Ambulance”
ReasoningFusion of
annotationsConsistency
checkingHigher-level
concepts/events“Emergency scene”
Multimedia content
annotation toolsTraining
(Statistical) Modeling
Domain Multimedia content
AnnotationsAlgorithms -
FeaturesContext
11Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Knowledge from Video analysis
Semantics from videoImplicitly derived via machine learning
methods i.e. based on training: SVM, HMM, Neural Networks, Bayesian
Networks
Training uses appropriate data, relevant to the semantics that interest us
Training finds models that connect low level features (e.g. motion trajectories) with high-level annotations
These models are then applied to test data
1212
Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Natural-Person: 0.456798Sailing-Boat: 0.463645
Sand: 0.476777Building: 0.415358
Pavement: 0.454740Road: 0.503242
Body-Of-Water: 0.489957Cliff: 0.472907
Cloud: 0.757926Mountain: 0.512597
Sea: 0.455338Sky: 0.658825
Stone: 0.471733Waterfall: 0.500000
Wave: 0.476669Dried-Plant: 0.494825
Dried-Plant-Snowed: 0.476524Foliage: 0.497562Grass: 0.491781Tree: 0.447355
Trunk: 0.493255Snow: 0.467218
Sunset: 0.503164Car: 0.456347
Ground: 0.454769Lamp-Post: 0.499387
Statue: 0.501076
Classification ResultsaceMedia
Segment’s hypothesis
set
1313
Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Frame Region – Concept AssociationFrame Region – Concept Association• Region feature vectorRegion feature vector formed from local descriptors formed from local descriptors• Individual SVM introduced for every defined local Individual SVM introduced for every defined local
concept, receiving as input the concept, receiving as input the region feature vectorregion feature vector• Training identical to global concept training caseTraining identical to global concept training case• Every region evaluated by all trained SVMs, segment’s Every region evaluated by all trained SVMs, segment’s
local concept hypothesis set created ( )local concept hypothesis set created ( )hijL
Ground: 0.89 Grass: 0.44Mountain: 0.21 Boat: 0.07Smoke: 0.41 Dirty-Water: 0.18Trunk: 0.12 Foam: 0.19Debris: 0.34 Mud: 0.31Water: 0.42 Sky: 0.22Ashes: 0.11 Subtitles: 0.24Flames: 0.13 Vehicle: 0.12Building: 0. 25 Foliage: 0.84Person: 0.32 Road: 0.39
Segment’s hypothesis set
1414
Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Initial Region-Concept Association• Region feature vector formed from local descriptors
• Individual SVM introduced for every defined concept, receiving as input the region feature vector
• Training identical to global training case• Every region evaluated by all trained SVMs,
segment’s concept hypothesis set created ( )
Building: 0.89 Roof: 0.29Grass: 0.21 Tree: 0.07Stone: 0.41 Ground: 0.15Dried-plant: 0.12 Sky: 0.19Person: 0.34 Trunk: 0.31Vegetation: 0.42 Rock: 0.22Boat: 0.11 Sand: 0.44Sea: 0.13 Wave: 0.12
Segment’s hypothesis set
hijC
15Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Knowledge for Video analysisExplicit Semantics from video
Based on previously known models Explicitly defined models, rules, facts Rules from preliminary scripts and
standards from similar cases
Explicit and implicit knowledge can be combined with results from low-level video processing to extract meaningful high-level knowledge
16Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
System Overview
Multimedia Content
Video Analysis (face recognition, motion segmentation etc)
Knowledge
Infrastructure (explicit or
implicit)
Semantic Multimedia Description
17Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Video analysis
Motion AnalysisMotion detectionTrackingDetection of when motion occurs
Motion SegmentationObject segmentation based on motion
characteristicsGeneration of ‘active regions’
18Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Activity Areas from motion analysis
19Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Sub-activity Areas After statistical processing for temporal
localization of motion and events
People walking towards each other People meet
People leave together
2020
Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Fight Sequence
21Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Video Processing (1) Pre-processing
Separate video from audioSplit video into framesNoise removal via spatiotemporal filtering
Scene/shot detectionShot = frames taken by single cameraDetect transition between framesUses only low-level informationScene = story-telling unitUses higher-level knowledge, semantics
22Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Video Processing (2)Spatial segmentation:
Spatial segmentation in images, video framesExtracts object(s) based on color, texture
features
Motion segmentation:Groups pixels with similar motion
Spatiotemporal segmentation:Finds objects over several frames through
combination of motion, appearance featuresMerges spatial and motion segmentation
results
23Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Knowledge in Video Analysis (1)
Low level features can be combined with knowledge/rules for higher-level results Spatiotemporally segmented objects can
be used for object recognition Face/gesture recognition after training with
faces/gestures of significance Motion in specific parts of a video (e.g.
near court entrance, near prisoner’s seat) has additional significance:Needs prior knowledge of which parts of the video
frames are important and why
24Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Knowledge in Video Analysis (2)
Knowledge structures can provide additional information about the relations between different low-level features Interactions e.g. two motions in opposite
directions, relation of extracted gestures, may mean something: people meeting, fighting, pointing, gesticulating
Face recognition combined with prior knowledge can show who is present when an event occurs
25Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Conclusions
Combined use of video processing with knowledge can lead to richer and more accurate high-level descriptions of multimedia data
Can be used in many more applications than currently, because the knowledge introduces flexibility and adaptability to the system: The same algorithms and low-level
features can provide much more information when used in combination with explicit and implicit knowledge
26Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
26
Thank you!CERTH-ITI / Multimedia Knowledge
Laboratoryhttp://mklab.iti.gr
27Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Video Analysis State of the Art
Spatiotemporal segmentation:Find spatiotemporally homogeneous objects
i.e. similar appearance and motionApply spatial segmentation on each frame Match segmented objects in successive
frames using low-level features (e.g. similar color, texture, continuous motion)
Use motion information – project position of object in current/next frames
28Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Video Analysis State of the Art
29Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Video Analysis State of the Art
30Image Video & Multimedia Systems LaboratoryMultimedia Knowledge LaboratoryInformatics and Telematics Institute
Video Analysis State of the Art
Spatial segmentation: Spatial segmentation in images, video frames: Region Based: Most methods are based on grouping
similar features like color, texture, location – based on homogeneity of intensity, texture, position
Gradient/edge based: detecting changes in spatial distribution of features e.g. pixel illumination
Some methods combine region/edge information
Top Related