A framework for content based semantic information extraction from multimedia contents
Transcript of A framework for content based semantic information extraction from multimedia contents
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
1/142
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
2/142
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
3/142
COMPUTERS CIENCE FAC ULT Y. COMPUTERSCIENCE ANDARTIFICIAL
INTELLIGENCED EPARTMENT
A FRAMEWORK FORCONTENTBASED
SEMANTICINFORMATIONEXTRACTION
FROMMULTIMEDIACONTENTS
A thesis submitted in fulfillment of the requirements for the degree of
DOCTOR OF PHILOSOPHY by:
Igor Garca Olaizola
Supervised by Prof. Basilio Sierra Araujo
&
Dr. Julin Flrez Esnal
El doctorando El director El director
Donostia San Sebastian, Wednesday 11th September, 2013
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
4/142
A Framework for Content Based Semantic Information Extraction from Multimedia
Contents
Author: Igor Garca Olaizola
Advisor: Basilio Sierra Araujo
Advisor: Julin Flrez Esnal
SVN Version Control Data:
Dat e: 201309 1101:13:04+0200(W ed, 11Se p2013)
Aut hor: i ol ai zol a
Revision: 80M
DRAFT 1.7
The following web-page address contains up to date information about this dis-
sertation and related topics:
http://www.vicomtech.org/
Text printed in Donostia San Sebastin
First edition, September 2013
http://www.vicomtech.org/http://www.vicomtech.org/ -
8/9/2019 A framework for content based semantic information extraction from multimedia contents
5/142
Zuretzat aita.
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
6/142
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
7/142
Abstract
One of the main characteristics of the new digital era is the media
big bang, where images (still images or moving pictures) are one of
the main type of data. Moreover, this is an increasing trend mainly
pushed by the easy of capturing given by all new mobile devices that
include one or more cameras.
From a professional perspective, most content related sectors are
facing two main problems in order to operate efficient content man-
agement systems: a) need of new technologies to store, process and
retrieve huge and continuously increasing datasets and b) lack of
effective methods for automatic analysis and characterization of
unannotated media.
More specifically, the audiovisual and broadcasting sector which is
experiencing a radical transformation towards a fully Internet conver-
gent ecosystem, requires content based search and retrieval systemsto browse in huge distributed datasets and include content from
different and heterogeneous sources.
On the other hand, earth observation technologies are improving the
quantity and quality of the sensors installed in new satellites. This
fact implies a much higher input data flow that must be stored and
processed.
In general terms, the aforementioned sectors and many other me-
dia related activities are good examples of the Big Data phenomenon
where one of the main problem relies on the semantic gap; the in-
ability to transform mathematical descriptors obtained by image
processing algorithms into concepts that humans can naturally un-
derstand.
This dissertation work presents an applied research activity overview
along different R&D projects related with computer vision and mul-
timedia content management. One of the main outcomes of this
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
8/142
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
9/142
Resumen
Una de las caractersticas principales de la nueva era digital es el la
gran explosin producida alrededor de los contenidos multimedia
donde las imgenes (tanto estticas como en movimiento) suponen
el tipo principal de dato. Adems, esta tendencia sigue siendo cre-
ciente debido principalmente a la facilidad de captura ofrecida por
los dispositivos mviles que incluyen una o ms cmaras.
En lo referente a los diferentes sectores profesionales relacionados
con los medios digitales, el crecimiento tan exagerado de los datos
est causando dos problemas principales. Por un lado, se requieren
nuevas tecnologas que permitan el almacenamiento, proceso y la
recuperacin de contenidos de una manera efectiva en conjuntos
enormes y crecientes de datos. Por otro lado, tambin son necesarios
mtodos automticos de anlisis y caracterizacin de contenidos sin
anotacin previa.
De una forma ms especfica, podemos destacar el sector audiovi-
sual en el que se est produciendo una profunda transformacin
provocada principalmente por el proceso de convergencia con Inter-
net. En esta situacin, se hacen cada vez ms necesarios sistemas de
bsqueda y recuperacin de contenidos que permitan navegar en
conjuntos masivos de datos que cada vez son ms distribuidos y de
una procedencia ms heterognea.
En el caso de la observacin de la Tierra, los sistemas de adquisi-
cin de datos son cada vez ms numerosos y ms precisos, hecho
que genera flujos de contenido cada vez mayores que deben ser
continuamente procesados y almacenados.
En general, los sectores mencionados previamente y otras activi-
dades relacionadas con contenidos multimedia son claros ejemplos
del fenmeno Big Data que est produciendo y donde uno de los
problemas principales consiste en eliminar la brecha semntica (ms
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
10/142
conocida comosemantic gap). Podemos definir la brecha semntica
como la diferencia no salvada por el momento entre los conceptos
matemticos derivados de las diferentes tcnicas de procesamiento
de imgenes y los conceptos que los humanos manejamos para de-scribir estos mismos contenidos.
La presente memoria de tesis presenta una revisin sobre la activi-
dad de investigacin aplicada que se ha realizado mediante varios
proyectos relacionados con la visin por computador y la gestin
de contenido multimedia. Uno de los resultados principales de esta
actividad investigadora ha sido el modelo Mandrgora, un diseo
de arquitectura con el objetivo de minimizar la brecha semntica
y crear anotaciones automticas basadas en una ontologa previa-mente definida.
Debido a que uno de los problemas principales a los que se en-
frenta la implementacin de Mandrgora es el hecho de que la falta
de conocimiento previo sobre el contenido limita el anlisis inicial,
hemos propuesto un nuevo mtodo (DITEC) para la caracterizacin
semntica de imgenes. Los buenos resultados obtenidos en las prue-
bas experimentales realizadas han resultado en una adaptacin del
mtodo original basado en un descriptor global de manera que unavariante de dicho descriptor global resulte eficaz como descriptor
local. En este documento tambin se describe la variante DITEC lo-
cal en la que los resultados de las pruebas experimentales realizadas
(an con una implementacin en fase de desarrollo) han mostrado
un comportamiento altamente competitivo al ser comparadas con
los descriptores locales ms populares en la literatura cientfica.
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
11/142
Laburpena
Aro digital berriaren berezitasun nagusienetako bat media edukien
big bang edo ugaritze neurrigabekoa da, irudiak (bai argazki eta
baita bideoak ere) eduki mota nagusia direlarik. Joera hau gainera
oraindik ere gehiagora doa batik bat kamera bat edo bi dakartzaten
gailu mugikorrek edukia jasotzeko eskaintzen duten erraztasunak
lagunduta.
Ikuspegi profesionaletik begiratuz gero, edukiekin lotuta dauden
sektoreak edukien kudeaketa efiziente bat egiteko garaian bi arazo na-
gusiren aurrean aurkitzen dira. Alde batetik, datu kopuru itzel hauek
metatu, prozesatu eta norberaganatzeko teknologia berriak behar
dira. Bestalde, alde aurretik inongo anotaziorik ez duen edukiaren
analisi eta bereizte automatikorako metodo eraginkorrak garatzeke
daude oraindik.
Gehiago zehaztuz, ikus-entzunezko edukien eta irrati uhin bitarteko
hedabideen sektorea, une hauetan Internetekin bat egitera doan
bideari ekinda bizitzen ari den eraldatze prozesu sakonean inoizko
eduki kopuru handienak kudeatzen ari da. Gainera, edukiak gero
eta jatorri ezberdin eta izaera heterogeneoagoa azaltzen dute, garai
bateko eredu zentralizatu eta trinkoekin lan egiten zuten sistemen
eraginkortasuna erabat urrituz. Horregatik, datu hauetan bilaketa er-
aginkorrak egiteko edukia bera aztertzeko gai diren egitura malguko
sistemen garapena behar beharrezkoa da.
Bestalde, Lurraren behatze jarduerarako teknologiak gero eta kopuru
eta doitasun handiagoko instrumentazioa erabiltzen dute belaunaldi
berriko sateliteetan. Hau dela eta gero eta datu jario handiagoa
igortzen dute behaketa sistemek eta eduki guzti horiek gorde eta
prozesatzeko beharrak gero eta zailagoak dira betetzen.
Oro har arestian aipatutako sektoreak eta multimedia edukiekin
dabiltzan halako beste hainbat Big Data gertakariaren adibide
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
12/142
nabariak dira, bertan arrail semantikoa (semantic gap bezela
ezagutzen dena), hots, irudi prozesamenduko algoritmoen bidez er-
auzitako bereizgarri matematikoak gizakioi ulergarri egiten zaizkigun
kontzeptuetan bihurtzeko ezina, arazo nagusienetariko bat bihurtudelarik.
Tesi dokumentu honetan, konputagailu bidezko ikusmenari lotu-
tako ikerkuntza aplikatuko hainbat proiektutan lortutako emaitza
orokorrak azaltzen ditugu. Emaitza nagusienetako bat Mandragora
arkitektura da. Mandragoraren xede nagusia arrail semantikoa txik-
itzeko ontologi batetan oinarrituta dagoen irudien anotazio sistema
automatiko bat sortzea da.
Mandragoraren arazo nagusienetako bat hasierako ezagutzarik izangabe lehen prozesamendua itsuka egin beharra denez, irudien
domeinu semantikoa karakterizatzeko metodo berri bat aurkezten
dugu, DITEC izenekoa. Saiakera esperimentaletan lortutako emaitza
onak ikusirik, DITEC metodoaren muinean dagoen deskribatzaile
globala era lokalean erabiltzeko egokitzeko ahalegina egin dugu.
DITEC bereizgarri lokala ere azaltzen da beraz dokumentu honetan.
Metodoaren inplementazioa oraindik ere garapen egoeran dagoen
arren, lortutako emaitza esperimentalak oso onak izan dira zientzialiteraturan dauden deskribatzaile lokal ezagunenekin alderatuta.
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
13/142
Acknowledgements
This is not the story of a self-made man. Instead, all the achievements
presented in this work have a long chain behind, a chain composed
by people that have supported my entire professional career and
something that cannot be separated from personal experiences. At
this point, it is worth to acknowledge all these people.
In this sense, my both supervisors, Basilio Sierra and Julin Flrez
have been an essential part of this work, with an unconditional
commitment and a highly valuable scientific guidance. Dudarik
gabe, esan liteke, Julian, nire bide profesionalaren lehen hastapenak
zurekin eman nituela. Hasieratik zugandik sentitu nuen konfidantza
eta babesa ez dira hamarkada oso batetan gutxiagora joan, eta hori
bada zerbait. Denbora guzti honetan zugandik ikasia nire eguneroko
lanaren oinarri nagusienetako bat izanik, lan honetan ere halaxe isla-
datzen da. Bestalde, unibertsitatean irakasle egoki bat aukeratzeko
bidean, zorte izugarria izan nuen Basi ezagututa. Hasieratik jakin
izan du nire egoera profesionalak sortzen dizkidan etenaldi eta jar-
raipen faltara egokitzen. Aholku eta zuzendaritza zientifiko ezin hobe
bat egin dituela esango nuke eta era atsegin eta gogotsuan gainera,
gogor eta astuna izan litekeen prozesu bat, gustora egiten den lana
bihurtuz.
Dentro de Vicomtech, entorno en el que se ha movido la mayor parte
de mi actividad profesional y donde se enmarca esta tsis, he contado
con innumerables apoyos. Seguro que dejar alguno sin mencionar
(desde aqu mis disculpas) pero no por ello quiero dejar de citar al-
gunos tales como Jorge Posada, Director Adjunto, que me ha apoyado
en todo momento con nimos y consejos prcticos que vienen muy
bien cuando uno se centra demasiado en su problema. Amalia y
David, compaeros de fatigas que me demostraron que s es posible
hacer una tesis doctoral compaginada con la actividad profesional
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
14/142
en un centro tecnolgico. Shabs, this interesting man that always
shows me that things might have a non obvious point of view that it is
worth to observe. Edurne, gauza zailak errez eginaz behin eta berriz
bidea lautzen didan lankide eta laguna. Por supuesto, merece unamencin especial el departamento de TV Digital y Servicios Multime-
dia del que soy parte y del que he sentido un apoyo enorme durante
todo el proceso. Espero realmente poder corresponder en la misma
proporcin.
This work has been carried out in a strong collaboration with some
colleagues that deserve a special mention. Marco Quartulli, cien-
tfico renacentista con el que juego y aprendo cada da. Naiara
Aginako, irudiekin lan egiten hasi ginen egunetik bide lagun, laneanzorrotz bezain atsegin tratuan, zure txanda noiz iritsiko desiatzen
nago. Gorka, quien con su tesis marc el punto de partida de este
trabajo. Cuntas discusiones interesantes que derivaron en buenas
ideas o.. . en ms discusiones :-) Espero poder seguir disfrutando de
tu contrapunto. Y por supuesto Iigo Barandiaran, un investigator
meticuloso, creativo y con un gusto por el trabajo bien hecho que
para m sigue siendo un ejemplo. De las inmumerables horas que
hemos pasado juntos en este proceso, no ha habido un minuto en el
que no haya disfrutado. Me ilusiona saber que todava nuestro tra-
bajo require de mejoras y ms investigacin porque ser la manera
de que continuemos colaborando. Al final parece que nos vamos a
tomar esa cerveza!
Echando un poco la vista atrs, tampoco puedo olvidarme de otros
viejos amigos que aunque de una manera ms lejana han sido fun-
damentales para que este trabajo se haya realizado. Haritz,mein
Brder, karrera hasi genianetik horrenbeste ordu elkarrekin, hain-
bat proiektu eta diskusio, beti elkarlanean laguntzeko prest. Gero
urtebeteko abentura elkarrekin Wichernstrasse inguruetan. Nolako
injinerua nauken, hik baduk bai zeresana. Jaizki, aspaldiko lagun, kar-
rera garaitik eta gure Alemaniko abenturan triangelu zarauztarraren
beste erpina. Lehen bezela, orain ere ez didak gutxi lagundu. Esan
eta egin, artikuluaren zuzenketak eskatu orduko eta doitasun han-
diz gainera. Hitzez laguntza eskaintzea erreza dek, hik egitez erakutsi
didak.
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
15/142
Profesionalki lana buru belarri egin ahal izateko, pertsonalki oreka
lortzea behar beharrezkoa dela uste dutenetakoa naiz ni, eta hor-
retarako bizitza zurekin elkarbanatuz, goizero egunari ekitea gauza
zoragarri bat izatea zuri zor dizut, Myriam, batzutan garabi bat beharbaduzu ere.
Izan zirelako gara, eta garelako izango dira, Naroa eta Maddi, zuek
zarete nire benetako proiektua, txikiak izango zarete baina zuei begira
beste guztia geratzen zait niri txiki.
Esan bezala, izan zirelako gara, eta ni naizena baldin banaiz familiari
eta batik bat gurasoei zor diet (onetik dudana behintzat). Osaba Joxe,
nire bizitzan dauden oinarrizko printzipioen erakusle, horrenbeste
urte eta gero ez dira aldatu. Ama, zuk erakutsiak dira ahalegintzearenbalioa, lanean gustoa jartzearena, txukun ibiltzearena. Oraindik
ere halaxe erakusten didazu egunero. Aita, tesi lan honekin zuri
egin nahi dizut aipamenik garrantzitsuena. Ingenioa, irudimena eta
jakintzaren arteko konbinazio bezela maisutasunez erabiliz, zeuk
zuzendu ninduzun ingeniaritzara. Nik zure eredu hori jarraitzen
jartzen dut ahalegina. Lanean lanetik kanpo bezela, zuzen eta pula-
mentuz, ingurukoei laguntzen saiatuz eta zailak badira ere erabaki
zuzenei koherentzia osoz eutsiz. Egunero saiatzen naiz zuk hitzez etaegitez hain garbi erakutsitako bidea duintasunez betetzen.
Eskerrik asko guztioi
Igor Garca Olaizola
September 2013
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
16/142
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
17/142
Contents
List of Figures xvii
List of Tables xxi
I Work Description 1
1 Introduction 3
1.1 Context of this research activity . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 VicomtechIK4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1.1 Relation with other VicomtechIK4 PhD. processes 5
1.1.2 Computer Science and Artificial Intelligence Department of
the Computer Engineering Faculty . . . . . . . . . . . . . . . . 6
1.2 R&D Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Begira . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Skeye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 SiRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.3.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.4 SIAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.4.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.5 Cantata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 12
xiii
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
18/142
CONTENTS
1.2.5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.6 RUSHES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 131.2.7 Grafema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.7.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.8 IQCBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.8.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.9 Relationship of projects and scientific activity . . . . . . . . . 18
2 Computer Vision from a Cognitive Point of View 212.1 Mandragora Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Image Processing and AI Approach . . . . . . . . . . . . . . . . . . . 23
3 Domain Identification 27
3.1 Domain characterization for CBIR . . . . . . . . . . . . . . . . . . . . 29
3.1.1 Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1.1 Alternative methods for massive content annotation 30
3.1.2 Earth Observation, Meteorology . . . . . . . . . . . . . . . . . 30
3.1.2.1 Meteorology . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Local features vs. global features in domain identification . . . . . . 34
4 Proposed Method: DITEC 37
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 General description of the DITEC method . . . . . . . . . . . . . . . 38
4.2.1 Sensor modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.2 Data transformation . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.2.1 Functionals . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2.2 Geometrical constraints . . . . . . . . . . . . . . . . . 42
4.2.2.3 Quantization effects . . . . . . . . . . . . . . . . . . . 44
4.2.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.3.1 Statistical descriptors . . . . . . . . . . . . . . . . . . 48
4.2.3.2 Cauchy Distribution . . . . . . . . . . . . . . . . . . . 51
4.2.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.4.1 Feature Subset Selection in Machine Learning . . . 51
4.2.4.2 Attribute contribution analysis . . . . . . . . . . . . . 53
xiv
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
19/142
CONTENTS
4.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Case study 1: Corel 1000 dataset . . . . . . . . . . . . . . . . . 54
4.3.2 Case study 2: Geoeye satellite imagery . . . . . . . . . . . . . 57
4.4 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . 614.4.1 Computational complexity of the trace transform . . . . . . . 61
4.4.2 Computational complexity of attribute selection and classifi-
cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Conclusion of the presented method . . . . . . . . . . . . . . . . . . 63
4.6 Modified DITEC as local descriptor . . . . . . . . . . . . . . . . . . . 64
4.7 Implementation of DITEC as local descriptor . . . . . . . . . . . . . 64
4.7.1 Trace Transformation . . . . . . . . . . . . . . . . . . . . . . . 664.7.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.7.3 DITEC parameters . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.7.3.1 Angular and radial sampling . . . . . . . . . . . . . . 68
4.7.3.2 Effects of sampling in the computational cost . . . . 70
4.7.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . 70
4.7.4.1 Geometric Transformations . . . . . . . . . . . . . . 71
4.7.4.2 Photometric Transformations . . . . . . . . . . . . . 72
4.7.5 Current status of the local DITEC algorithm design . . . . . . 72
5 Main Contributions 75
5.1 Mandragora framework . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 DITEC method as global descriptor . . . . . . . . . . . . . . . . . . . 75
5.3 DITEC feature space analysis . . . . . . . . . . . . . . . . . . . . . . . 76
5.4 DITEC method as local descriptor . . . . . . . . . . . . . . . . . . . . 76
5.5 Other contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6 Conclusions and Future Work 77
6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.1.1 Collaborative filtering, Big Data and Visual Analytics . . . . . 78
II Patents & Publications 81
7 Publications 83
7.1 Weather analysis system based on sky images taken from the earth 83
7.2 A review on EO mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
xv
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
20/142
CONTENTS
7.3 Acc. Obj. Tracking and 3D Visualization for Sports Events TV Broad-
cast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.4 DITEC: Experimental analysis of an image characterization method
based on the trace transform . . . . . . . . . . . . . . . . . . . . . . . 847.5 Image Analysis platform for data management in the meteorologi-
cal domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.6 Architecture for semi-automatic multimedia analysis by hypothesis
r einf or mc ement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.7 Trace transform based method for color image domain identification 85
7.8 On the Image Content of the ESA EUSC JRC Workshop on Image
Information Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.9 Authors other publications . . . . . . . . . . . . . . . . . . . . . . . . 86
8 Selected Patents 93
8.1 Method for detecting the point of impact of a ball in sports events . 93
8.2 Authors Other Related Patents . . . . . . . . . . . . . . . . . . . . . . 93
III Appendix and Bibliography 95
A Consideration on the Implementation Aspects of the trace transform 97
A.1 Development platforms . . . . . . . . . . . . . . . . . . . . . . . . . . 97
A.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
B Calculation of the clipping points in a circular region 103
Bibliography 107
xvi
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
21/142
List of Figures
1.1 Begira scene definition. . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Example of the cloud segmentation process . . . . . . . . . . . . . . 9
1.3 Rushes content analysis workflow . . . . . . . . . . . . . . . . . . . . 13
1.4 Grafema Assets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Grafema System Workflow . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.6 Grafema System Architecture . . . . . . . . . . . . . . . . . . . . . . . 16
1.7 IQCBM System Architecture . . . . . . . . . . . . . . . . . . . . . . . . 17
1.8 Screenshots of the IQCBM user interface . . . . . . . . . . . . . . . . 18
1.9 Relationship between R&D projects and scientific activity in multi-
media content analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Information Retrieval Reference Model . . . . . . . . . . . . . . . . . 22
2.2 Mandragora Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 DIKW Pyramid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Watson DeepQA High-Level Architecture . . . . . . . . . . . . . . . . 28
3.2 Idealized query process decomposition on EO image mining . . . . 31
3.3 Envisat instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 General architecture of the meteorological information manage-ment system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1 DITEC System workflow . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Trace transform, geometrical representation . . . . . . . . . . . . . . 41
4.3 Trace transform contribution mask at very high resolution parame-
ters (Image resolution:100x100px.n = 1000, n = 1000, n = 5000). 444.4 Pixels relevance in trace transform scanning process with different
parameters (n, n, n). Original image resolution = 384x256. . . . . 45
xvii
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
22/142
LIST OF FIGURES
4.5 Trace Transform and subsequent Discrete Cosine Transform of
Lenna. (Y channel of YCbCr color space) . . . . . . . . . . . . . . . . 48
4.6 Conceptual scheme: DCT matrix transformation into , kpair vector. 49
4.7 Statistical properties of all Kurtosis measurements made on thedistributions obtained by processing Corel 1000 dataset . . . . . . . 50
4.8 Examples of probability density distribution and histograms ob-
tained by the samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.9 Samples of Corel 1000 dataset. The dataset includes 256x384 or
384x256 images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.10Distance among classes in the Corel 1000 dataset according to mis-
classified instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.11Distance among most inter-related classes in the Corel 1000 dataset
according to misclassified instances. . . . . . . . . . . . . . . . . . . . 57
4.12Corel 1000 picture corresponding to classArchitectureand classified
asMountain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.13 Corel 1000 precision results with different feature extraction algo-
rithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.14Samples of satellite footage dataset. 256x256px patches at different
scales. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.15Distance among classes in the Geoeye dataset according to misclas-
sified instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.16 Time performance behavior. . . . . . . . . . . . . . . . . . . . . . . . 62
4.17 System workflow for DITEC as local feature . . . . . . . . . . . . . . . 65
4.18 Matching accuracy depending on the number of angular samples . 68
4.19 Matching accuracy depending on the number of radial samples . . 69
4.20 Matching accuracy depending on the number of simultaneous in-
crease of angular and radial sampling . . . . . . . . . . . . . . . . . . 69
4.21 Computation time depending on the simultaneous increase of an-
gular and radial sampling . . . . . . . . . . . . . . . . . . . . . . . . . 704.22 In-plane Rotation Transformation matching results. . . . . . . . . . 71
4.23 Scale Transformation matching results. . . . . . . . . . . . . . . . . . 71
4.24 Projective Transformation matching results. . . . . . . . . . . . . . . 72
4.25 Exposure change photometric Transformation matching results. . . 73
4.26 Trace transform row and column analysis . . . . . . . . . . . . . . . . 73
A.1 DITEC development platform . . . . . . . . . . . . . . . . . . . . . . . 98
A.2 Circular patch image . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
xviii
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
23/142
LIST OF FIGURES
A.3 Result of (,) space exploration with Bresenham . . . . . . . . . . . 99
A.4 First half of the source image is sampled (blue regions) while areas
around vertical and horizontal axes are not considered. . . . . . . . 100
A.5 Second half of the source image is sampled (red and green). Theseregions are moved to
4,
34
,5
4 ,
74
,areas in order to be sampled
with the Bresenham algorithm. . . . . . . . . . . . . . . . . . . . . . . 100
A.6 Result of (,) sampling with Bresenham algorithm and a single
image rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A.7 Result of (,) pixelwise sampling with image rotation for each
angular iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A.8 Result of different sampling strategies of (,) space. . . . . . . . . 102
B.1 Scanline defined in terms of and . . . . . . . . . . . . . . . . . . . 104
xix
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
24/142
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
25/142
List of Tables
4.1 List of Trace Transform functionals proposed in [KP01]. . . . . . . . 42
4.2 Quantization effects of the trace transform . . . . . . . . . . . . . . . 46
4.3 Corel 1000 dataset confusion matrix. . . . . . . . . . . . . . . . . . . 564.4 Geoeye dataset confusion matrix. . . . . . . . . . . . . . . . . . . . . 59
xxi
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
26/142
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
27/142
Part I
Work Description
1
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
28/142
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
29/142
If our brains were simple enough
for us to understand them, wed be
so simple that we couldnt.
Ian Stewart
CHAPTER
1Introduction
Artificial Intelligence (AI) is probably one of the most exciting knowledge fields
where even the definition of the term becomes controversial due to the manifold
understanding of the intelligence that remains as a hard epistemological prob-
lem. Learning, reasoning, understanding, abstract thought, planning, problem
solving and other related topics are all different aspects that imply intelligence.
The emergence of programmable digital computers in the late 1940 offered a
revolutionary way to experimentally explore new methods for formal reasoning
and logic. However, the initial great expectation of AI did not come into reality
and the prediction made by Herbert A. Simon1 machines will be capable, within
twenty years, of doing any work a man can do still remains as a Science Fiction
topic.
The fashions of AI over the years have moved from automated theorem prov-
ing to expert systems that later on where substituted by behaviour-based robotics
and now seem to find the solution on learning from big data[Lev13]. All these
trends have not been able to meet the expectation that the founders of AI put
on the field[Wan11]. Patrick Winston( director of the MIT Artificial Intelligence
Laboratory from 1972 to 1997) cited the problem of mechanistic balkanization,
with research focusing on ever-narrower specialties such as neural networks or
genetic algorithms. When you dedicate your conferences to mechanisms, theres a
1Herbert Alexander Simon (June 15, 1916 February 9, 2001), ACMs Turing Award for making
basic contributions to artificial intelligence, the psychology of human cognition, and list processing
in (1975) and considered one of the founders of AI
3
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
30/142
1. INTRODUCTION
tendency to not work on fundamental problems, but rather [just] those problems
that the mechanisms can deal with[Cas11].
However, there has been a great scientific and technological advance in many
AI related domains formal logic, reasoning, statistics and data mining, geneticprogramming, knowledge representation, etc. that without satisfying the founda-
tions proposed by Winston or Chomsky, has enabled the creation of technological
solutions for different application fields such as natural language processing,
computer vision, drug design, medical diagnosis, genetics, finance & economy,
user recommendation systems and many others.
1.1 Context of this research activity
The research activity described in this dissertation has been mainly carried out
withing the applied researchperspective given by both, basic and applied research
projects developed in VicomtechIK41. VicomtechIK4 is an applied research
center located in San-Sebastian (Basque Country, Spain) and combines the excel-
lence in pure basic research with its application and transfer to the industry. In
this sense, some of these projects have been transferred to the industry and their
intellectual property has been protected by applying patents. In other cases, the
scientific progress carried out within the project has been published in journalsor conferences.
The knowledge of the market needs, technological state of the art and real
integration which introduces many constraints coming from the real world are
combined with the scientific method and basic research activities in collabora-
tion with Universities. In this case, the collaboration with the University of the
Basque Country2 and more specifically with the Computer Science and Artificial
Intelligence Department of the Computer Engineering Faculty has been a key
element for the balanced applied/scientific progress of the research work.
1.1.1 VicomtechIK4
VicomtechIK4 as an applied research centre is focused on all aspects related with
multimedia and visual communication technologies along the entire content pro-
duction pipeline, from generation, through processing and transmission until
1http://www.vicomtech.org2http://www.ehu.es
4
http://www.vicomtech.org/http://www.ehu.es/http://www.ehu.es/http://www.vicomtech.org/ -
8/9/2019 A framework for content based semantic information extraction from multimedia contents
31/142
1.1 Context of this research activity
rendering, interaction and reproduction. VicomtechIK4 is structured in 6 de-
partments that offer different views and specific technological solutions around
the aforementioned research activity. These departments are the following:
Digital Television and Multimedia Services
Speech and Natural Language Technologies
eTourism and Cultural Heritage
Intelligent Transport Systems and Engineering
3D Animation and Interactive Virtual Environments
eHealth and Biomedical Applications
The research described in this dissertation has been carried out within the
Digital TV & Multimedia Servicesdepartment with a strong collaboration withthe department ofeHealth and Biomedical Applications. In fact, the problem of
computer vision and multimedia content understanding is one of the main re-
search lines of VicomtechIK4. In this case, one of the main problems addressed
in this dissertation is aligned with one of the main difficulties of current multime-
dia management systems in diverse sectors such as broadcasting, remote sensing,
medical imaging, etc.: The huge amount of unannotated data and extremely
broad domains that cannot be explicitly defined.
1.1.1.1 Relation with other VicomtechIK4 PhD. processes
The research and technological activity performed in this work has been carried
out in a strong collaboration with other two PhD processes made in Vicomtech
IK4. These two works analyze and develop other aspects related with the analysis
and understanding of multimedia. More specifically Marcos[Mar11] studied the
multimedia retrieval problem from a semantic point of view, creating a semantic
middlewarebased approach as an intermediary layer between users high level
queries and systems low-level annotations. Some of the final considerations of
the work carried out by Marcos and the requirements identified as future work to
feed this semantic middleware from a bottom-up approach are the basis of the
initial context of this dissertation.
On the other hand, Barandiaran [Bar13]focused his work in the analysis of
local descriptors. The collaboration with Barandiaran has resulted into a local
adaptation of the global descriptor as one of the main contributions proposed
in this dissertation (see Section4.6). This novel local feature has demonstrated
highly robust characteristics as a local descriptor .
5
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
32/142
1. INTRODUCTION
1.1.2 Computer Science and Artificial Intelligence Department
of the Computer Engineering Faculty
The Robotics and Autonomous Systems Group which belongs to the Computer
Science and Artificial Intelligence Department of the Computer Engineering
Faculty is very active in two main areas:
Mobile Robotics
Behavior-based control architectures for mobile robots.
Bio-inspired robot navigation.
Use of visual information for navigation.
Machine Learning
Dynamic learning mechanisms
Classifier combination
New paradigms for supervised classification
Optimization problems
The deep knowledge of this group on machine learning science and tech-
nologies has provided the scientific foundations for the more technological work
developed in VicomtechIK4. This combination provides a high potential context
for scientific research.
1.2 R&D Projects
This dissertation work has been carried out based on R&D projects with common
underlying scientific needs and customer specific requirements. The knowledge
and experience acquired during these projects has driven the general framework
presented as one of the main contributions of this work.
1.2.1 Begira
Title:Diseo y Desarrollo de un Sistema Seguimiento Preciso de Objetos
en Transmisiones Deportivas (Design and development of a high accuracy
object tracking system for sports broadcasting).
Project typology:Industrial project partially supported by theGaitekpro-
gramme.
Company name:G93.
6
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
33/142
1.2 R&D Projects
Period:2005-2009.
1.2.1.1 Summary
Augmented reality projects require a deep knowledge of the scene that has to beextracted/updated in real time. In order to ensure the accuracy and real-time
performance of the system, the knowledge must be explicitly defined.
The goal of ofBegiraproject was to develop a single-camera system to track
the ball trajectory and position the bouncing point for Basque Pelota live TV
transmissions. The main constraints of the system were:
Single camera.
Broadcasting camera (720p@50).
Tracking, positioning and virtual reconstruction under 20 seconds.
Single standard computer for processing purposes.
From an Artificial Intelligence perspective, we can consider it as a system
where the knowledge domain is reduced to a single scene (the Basque Pelota
court) and thus can be explicitly defined. The main elements that define this
domain are:
3D environment: A court composed by 3 plane surfaces (front wall, side wall
and ground).
The relative position of the camera to the court is obtained during a
calibration process by putting a checkerboard on the ground.
Once the camera is calibrated, its position is fixed during the entire
match.
Dynamic objects: There are only 2 types of dynamic objects in the scene:
Players: There can be two or 4 players. Their size is much bigger than the
ball and most of the time their lowest part is touching the ground.
Ball: It is white, round and much smaller than the players. Sudden trajec-
tory changes are due to the hit of the players or a bounce. The ball is
so rigid that the bounce can be considered elastic.
According to the domain defined with the aforementioned concepts, a homog-
raphy matrixHis calculated to obtain cameras extrinsic parameters. Then the
7
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
34/142
1. INTRODUCTION
amera rg n
(xi, yi)
(Z=0 Plane Origin)
(a)
amera r g n
(x y )i i
(xi, yi)
R
(Z=0 Plane Origin)
(b)
Figure 1.1:Scene definition: Ball trajectory samples used to estimate the paramet-
ric curves and the calculation of the bouncing point on the ground once the center
position of the ball is obtained (crossing point of the two curves).
ball is initially detected and the tracking system follows its trajectory. Abrupt tra-
jectory changes define the limit between the instant before and after the bounce.
Once the two parametric curves are estimated, their crossing point is calculated
on the image. This two-dimensional position (in pixels) is then converted to the
3D space using the inverse of the homography matrix (H1). To solve the uncer-
tainty of the 3D position obtained by the 2D projection, the conditionZ= 0 isestablished for the bouncing point. More details of the project can be found in
Section7.3.
1.2.1.2 Conclusions
TheBegiraproject is a good example of expert systems applied to image process-
ing and computer vision. The technical goals were successfully achieved and the
results of the project were exploited by the Basque public broadcaster ETB and
the TV content producer G93. However, the knowledge acquired by the system
was sohardcodedthat it is very difficult to extend or integrate it in other more
general solutions. The good performance and accuracy results rely on its reduceddomain definition and rigid nature.
1.2.2 Skeye
Title: Sistema de anlisis meteorolgico basado en imgenes del cielo
tomadas desde tierra (Meteorological analysis system based on images
taken from the earth).
Project typology:Industrial project supported by theGaitekprogramme.
8
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
35/142
1.2 R&D Projects
Company name:Dominion.
Period:2007-2008.
1.2.2.1 SummaryMeteorological stations provide multiple sensor data as well as some more sub-
jective information such as the cloudiness. The goal ofSkeyewas to provide
an automatic system to accurately estimate the cloudiness factor, avoiding any
human intervention.
As mentioned in Begira, the semantic domain was small and quite straight
forward to model. The four classes that compose the domain are:cloud,sun,blue
skyand earth. The project was carried out by analyzing the features that were
characteristic for each class and the scene was defined in terms of a dome withnormalized illumination conditions.
(a) (b)
Figure 1.2:Example of the cloud segmentation process
1.2.2.2 Conclusions
Similarly toBegira, in this case, the feature extraction process provided all the
information need for a further class assignment by applying specific thresholds.
However, the further integration of the developed system in more domains or
scenes would be a difficult task since all the development andthe selected featurestotally depend on the domain definition and scene conditions. More information
about this work can be found in Section7.1.
1.2.3 SiRA
Title: Diseo y Desarrollo de un Sistema de Reconocimiento de Marcas
Comerciales en Emisiones Televisivas (Design and development of a system
for commercial brand recognition in TV broadcasts).
9
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
36/142
1. INTRODUCTION
Project typology:Industrial project supported by theGaitekprogramme.
Company name:Vilau.
Period:2007-2008.
1.2.3.1 Summary
This project is another example of a system based on a reduced semantic domain,
but in this case the approach was more general and some higher abstraction level
elements were introduced. The goal ofSiRAwas to detect logos in TV content in
order to automatize advertisement monitoring tasks. This project was also sup-
ported by the Basque Government and its industrial application was envisioned
by Vilau, a media communication company.
In this case, the constraints in terms of real time behavior and equipmentwere lighter than inBegira. However, the domain was broader: any type of logos
embedded in any type of content taken from different perspectives.
The approach followed in this case was to firstly detect a logo candidate as-
suming that a logo would be typically surrounded by a regular shape (square,
circle, triangle, etc.) and composed of very few colors. Once the logo was detected,
different feature extraction algorithms could be applied in order to compare the
results with those features corresponding to the target logo dataset. Depending
on the extracted features, different distance metrics were applied.
1.2.3.2 Conclusions
The results ofSiRAcan be integrated as a new feature in other content analysis
systems. In this case,SiRAwould provide information about potential logos exist-
ing in a specific video or still image. Moreover, even if the process itself is carried
out by using low-level operators, it can be considered that the result ofSiRAis a a
set of high level features with valuable semantic content as in general terms the
presence of a logo means that there is a product or and advertisement related toit.
1.2.4 SIAM
Title:Diseo y Desarrollo de un Sistema de Anlisis Multimedia de Con-
tenido Audiovisual en Plataformas Web Colaborativas (Design and devel-
opment of a system for multimedia analysis of audiovisual content in
collaborative web platforms).
10
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
37/142
1.2 R&D Projects
Project Typology:Industrial project supported by theGaitekprogramme.
Company name:Hispavista (http://hispavista.com/).
Period:2009-2010.
1.2.4.1 Summary
First ideas of this work related with a semantic analysis of multimedia content
were developed inSIAM. The goal of this project was to create content analy-
sis tools to improve the exploitation of large amounts of user generated content.
The context of the project was www.tu.tv,aYouTubelike video sharing platform
owned by Hispavista. According to this approach, the semantic labels can be ob-
tained from unstructured user comments. Then, by finding similar contents, new
non tagged content can be assigned to a previous label.
As the type of content analyzed in SIAMwere any kind of videos, the semantic
domain was too broad and complex to be defined where one of the main prob-
lems was the definition of a semantic unit in a video. The assumption of a video
as a semantic unit is to inconsistent in many cases as the elements on it can be
changing along the time. Therefore, each video was decomposed in shots and
each shot was analyzed and labeled. Finally, the entire video would be labeled as
the composition of each shot label.
1.2.4.2 Conclusions
The main outcome ofSIAMwas the shot based content analysis model and a shot
boundary detector that has been later used for semantic analysis purposes. More-
over, the potential of user generated metadata was addressed in this project. We
identified the potential of this amount of unstructured data that could be com-
plementary to the perfectly organized but expensive to populate professional
taxonomies.
1.2.5 Cantata
Title:Content Aware Networked systems Towards Advanced and Tailored
Assistance
Project typology:ITEA
Period:2007-2009
11
http://hispavista.com/http://www.tu.tv/http://www.tu.tv/http://www.tu.tv/http://hispavista.com/ -
8/9/2019 A framework for content based semantic information extraction from multimedia contents
38/142
1. INTRODUCTION
Consortium: Bosch Security Systems,Philips Electronics Netherlands,Philips
Medical Systems,Philips Consumer Electronics,TU/e, TU Delft, Multitel,
ACIC, Barco, Traficon, VTT, Solid, Hantro, Capacity Networks, I&IMS, Tele-
fonica VicomtechIK4, University Pompeu Fabra, CRP, Henri Tudor, Co-dasystem, Kingston University, University of York, INRIA.
1.2.5.1 Summary
ThegoalofCantatawas to create a distributed service for content analysis. The ap-
plication field included medical imaging, entertainment an security. Our activity
was focused in the entertainment sector where the content analysis modules were
connected to user profiles in order to create content recommendation systems.
In this case, the logo detection system was used to provide content informa-
tion to the main content analysis and recommendation system.
1.2.5.2 Conclusions
The recommendation system was intended to combine user activity information,
content metadata and low-level feature based information. However, the broad
domain definition required an unfordable amount of low-level descriptors and
even the combination of all these descriptors would be a very complex issue. Dueto this complexity, most recommendation systems rely basically on metadata.
1.2.6 RUSHES
Title:Retrieval of mUltimedia Semantic units for enHanced rEuSability.
Project typology: FP6-2005-IST-6.
Period:2007-2009
Consortium: Heinrich-Hertz-Institut (DE), University of Surrey (UK),Athens Technology Centre (GR), Vcomtech (ES), Queen Mary University of
London (UK), Telefonica I+D (ES), FAST Search & Transfer (NO), University
of Brescia (IT), ETB (ES).
1.2.6.1 Summary
The overall aim ofRUSHESwas to design, implement, validate, and trial a system
for both delivery of, access to raw media material (rushes) and the reuse of that
12
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
39/142
1.2 R&D Projects
content in the production of new multimedia assets, following multimodal ap-
proach that combines knowledge extracted from the original content, metadata,
visualization and highly interactive browsing functionality.
The core research issues ofRUSHESwere:
Knowledge extraction for semantic inference
Automatic semantic-based content annotation
Scalable multimedia cataloging
Interactive navigation over distributed databases
Non-linear querying and retrieval techniques using hierarchic descriptors.
Figure 1.3:Rushes content analysis workflow
1.2.6.2 Conclusions
TheRUSHESconsortium tried to address thesemantic gapby creating a powerful
architecture composed of low-level operators. The workflow designed in Rushes
1.3was able to combine multiple low-level features and multiple types of sources
13
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
40/142
1. INTRODUCTION
(video, audio, text). Moreover, the shot was considered as a semantic unit of a
video. Due to the fact that different shot boundary operators provide different
shots, an extra complexity was added to the metadata model where each feature
could define its temporal boundaries.All the low-level operators were applied to every content in the database. This
fact introduced a strong limitation in the scalability of the domain. In order to
identify new concepts, more low-level operators might be needed and as the size
of the feature-space dimensionality increased, the system became both com-
putationally too demanding and unfordable for the data mining and ontology
management processes. We presented a potential solution to this problem in
[OMK+09] by splitting the domain in sub-domains that only apply those low-
level feature extraction operators suggested by the domain definition (ontology).
However it requires a prior knowledge of the content that should be obtained by
applying low-level operators. Thischicken-eggproblem will be one of the key
topics of this research work.
1.2.7 Grafema
Title:Grafema: Multimodal content search platform
Project typology:Basic research project.
Period:2012.
1.2.7.1 Summary
The goal of the Grafema project was to create a base platform to store, annotate
and retrieve multimedia content of diverse nature. More than focusing on the
algorithms to obtain content descriptors or methods for automatic content an-
notation, Grafema was focused on the architectural aspects and the design of a
generic solution to deal with different types of content. In this sense, an asset
could be either text, image, audio, video, 3D or even a combination of these pre-
vious elementary units. According to this generic description of a digital asset,
similarity metrics must also adapt to each case or combination. As it can be ob-
served in Figure1.4, assets containing the labeltigercan be considered as similar
if they include this information in the metadata or if this label is found in any of
the elementary units that compose the content.
The workflow designed for Grafema (Figure1.5) is based in low-level oper-
ators that are independently processed. The information obtained from these
14
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
41/142
1.2 R&D Projects
Figure 1.4:Grafema Assets
operators is then ported to a higher level of abstraction by using data mining
techniques. The obtained information is then introduced in a semantic model
and stored in a database. The similarity of two assets can be then computed ac-
cording to this semantic model, but it is not limited to this metric. An iterative
process starts and enables the calculation of similarity metrics between assets
of the same type that belong to different instances. This iterative process, is the
basis of the Grafema architecture (Figure1.6) and provides a new paradigm of
content search a retrieval based more in a browsing process than in a pure text
based search.
Figure 1.5:Grafema System Workflow
15
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
42/142
1. INTRODUCTION
1.2.7.2 Conclusions
The results of Grafema have shown the big potential of iterative processes for mul-
timedia searching. Even if the tests have been carried out with limited datasets in
terms of size and domain complexity, the results show that text based search canbe dramatically improved datasets include high volumes of multimedia content.
Figure 1.6:Grafema System Architecture
Regarding the state of the art, the annotation and individual metrics as well
as the unsuitability of most common database solutions for multimedia data are
still the main drawbacks that limit the potential of these kind of systems.
1.2.8 IQCBM
Title:Image Query by Compression Based Methods Project typology:Industrial project.
Period:2011.
Consortium:DLR (German Aerospacial Agency).
1.2.8.1 Summary
The goal of this project was to create low-level operators and define distance met-
rics for satellite imagery that would be applied during the ingestion process of the
16
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
43/142
1.2 R&D Projects
delivered streams. The main idea behind this operators was to gather prior char-
acteristic information that could be useful for later retrieval operations. The lack
of knowledge regarding the queries that may be applied during the retrieval pro-
cess made difficult the definition of low-level features that might not be focusedin any specific aspect.
The domain of Remote Sensing is not as broad as those related with the audio-
visual sector, but are still too big and complex to be explicitly defined. Moreover,
new definitions and relationships could be dynamically introduced.
Rankeddocs
Comparisonframework
Rankeddocs
Rankeddocs
Codebook
a
nalysis
user classes
complexity
nanocodebooksInputimagepatch
RGBto
HSV
Q 64x64 Compute
Dict
MPEG-7 VST
TIFF LZW
JPEG/MPEG DCT
JPEG 2K WLTs
MonetDBstorage + execution
Random
FCD
PRDC
MonetDBDjangoadapter
UI
Guese-broekPAMI2001
Your post-processor here
Your feature extractor here
Your
pre-processor
here
Yourdistance
measurehere
Yourperformance
measures
here
Statisticalmeasures
Singlequery
measures
Query/ranking frameworkIndexing frameworkAnalysis frameworkPre-
process.framew.
Figure 1.7:IQCBM System Architecture
In order to address the lack of prior knowledge, global features were consid-
ered more adequate than the local ones. The first algorithm implemented in this
project was based on the codewords provided by a Lempel-ziv compressor as sug-
gested by Watanabe et al. [WSS02]. The L0 distance (Equation1.1) was used as
a metric for the codewords related to each element (in this case an element is
represented by each of the patches obtained after a tiling process applied to the
multi-resolution satellite imagery).
dL0 =n
i=1|xiyi|0 where: 00 = 0 (1.1)
1.2.8.2 Conclusions
The developed system (Figure1.8) was tested against Corel 1000 dataset [Cor]
and a subset of the Geoeye imagery [Glo]obtained good accuracy characteristics.
The length of the feature vector for each item was variable and was an attribute by
17
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
44/142
1. INTRODUCTION
itself as it provides a measure of the complexity of the image. However, in terms
of scalability, the average length (several thousands of codewords) obtained by
this algorithm might become a limitation.
Figure 1.8:Screenshots of the IQCBM user interface
As a result of this project, a deep study of the current trends of the communitywas carried out [QGO13]. Moreover, a new global feature extraction algorithm
was developed based on the ideas of Kadyrov et al. [KP98,KP01,KP06].
1.2.9 Relationship of projects and scientific activity
Figure1.9shows a summary of the main scientific activity around the aforemen-
tioned projects. It can be observed that different projects share the same research
activities and scientific background. However, this activity sharing does not imply
the reusability of previous developments as different semantic domains require
specific low-level features, metrics, mining techniques, etc.
In order to minimize these barriers to a higher practical re-usability, a new
architecture is proposed in this work.
18
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
45/142
1.2 R&D Projects
Figure 1.9:Relationship between R&D projects and scientific activity in multimedia
content analysis.
19
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
46/142
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
47/142
By three methods we may learn
wisdom: First, by reflection, which
is noblest; Second, by imitation,
which is easiest; and third by ex-
perience, which is the bitterest.
Confucius CHAPTER
2Computer Vision from a
Cognitive Point of View
The main approach for computer vision tasks has been based on the identifi-
cation of low-level features that can be used to segment, identify or determine
higher abstraction levels. Following the three learning methods stated by Confu-
cious in the previousquotation, we cansay that this approach provides knowledge
to the system by imitation. By giving this explicit knowledge based on existing
datasets or contents that have been used to allow researchers to understand the
relationship between the identification of real world phenomena and specific
low-level features, the computer vision system just reproduces the experiments
with different datasets. This process offers high performance results for narrow
domains but it cannot be reused in other contexts and it is not a scalable ap-
proach. The feature space created by these low-level operators easily gets too
complex for manual (and typically linear) thresholding techniques. For those
cases where the behavior of a set of low-level feature extractors is too complex
to model, data mining techniques are applied. In those cases, we can move to
the third level of Confucious statement. The experience of dealing with this data
(training for supervised classification and other specific metrics or criteria for
clustering) provides an adaptive behavior within the system to create the regions
or hyperplanes that fit best for each specific problem.
The use of ontologies introduces a new way of adding formal explicit knowl-
edge to the system. This is typically carried out by establishing concepts and
21
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
48/142
2. COMPUTER VISION FROM A COGNITIVE POINT OF VIEW
Figure 2.1:Information Retrieval Reference Model [Mar11]
relationships among them, defining a domain in this way. One common use of
ontologies is to establish shared vocabularies and taxonomies between scientistor professionals. However, from a cognitive system perspective, the most pow-
erful characteristic of ontologies is the capability of inference that creates new
rules that where not explicitly defined. The main drawback of ontologies comes
from the fact that broad complex domains such as those related to the common
vision understanding cannot be specifically defined, mainly because the size, the
complexity and the fuzziness of this kind of domains.
Content Based Image Retrieval (CBIR) systems can be considered as one of the
branches of cognitive vision since they require the four functionalities considered
as the pillars of a cognitive vision system: detection, localization, recognition and
understanding[Ver06]. Marcos et al propose a reference model that addresses the
use of ontologies for multimedia retrieval purposes [MIOF11]. This work presents
a reference model (Figure2.1) based on asemantic middleware. The main goal
of this approach is to create a layer to deal with semantic functionalities (e.g.:
knowledge extraction, semantic query expansion,. . . ).
Marcos proposes in his PhD work[Mar11] the use of the semantic middleware
22
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
49/142
2.1 Mandragora Framework
to automatically generate annotations of the multimedia assets. This approach,
initiated in the Rushes project by using a set of low-level features and applying
fuzzy reasoning to the information provided by those modules offered good re-
sults for narrow domains, but the system was unable to deal with a big numberof different low-level features and broad complex domains did not show a good
performance. One of the main drawbacks of this architecture was the fact that all
low-level features were considered at the same level when no prior information
was given.
2.1 Mandragora Framework
In order to overcome this scalability drawback, we presented a novel architec-
ture calledMandragora[OMK+09]. This architecture enhances the metadata with
new labels that can be ported to the semantic layer by using a two step itera-
tive approach. The implicit and explicit knowledge about a certain domain can
be introduced in the system with a combination of classifiers and the semantic
middleware. This combination allows the modeling of bigger and more complex
domains[SASK08] and reduces the semantic gap by connecting low-level features
with high-level hypothesis and reinforcement factors. The reinforcement factors
allow to extend the dimensionality of the domain and provide the framework for
specific analysis methods.
The main idea behind this two step approach is to break big domains in sub-
domains that are more homogeneous both semantically and in terms of low-level
features. Then, specific feature extractors and semantic definitions can be used
with much higher precision. One of the key aspects of this framework is the initial
domain estimation, the hypothesis that will be considered by the next layer to
launch domain specific analyzers that afterwards will feed the semantic middle-
ware. If the results of this second step confirm the characteristics of the estimated
domain, the hypothesis will be accepted and the elements identified in the con-
tent will be considered as descriptors of this specific asset. Otherwise, the process
will be restarted with a different hypothesis.
2.2 Image Processing and AI Approach
From an Artificial Intelligence perspective we can consider the climbing on the
DIKW Pyramid (Figure )2.3) as the process that our system has to follow from raw
23
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
50/142
2. COMPUTER VISION FROM A COGNITIVE POINT OF VIEW
Figure 2.2:Mandragora Architecture for automatic video annotation:[OMK+09]
unannotated images to structured content with semantic information. The main
issue consists of the semantic gap between the mathematical representation ob-tained by the developed operators and the high abstraction level concepts that
are intended to be discovered by using such low-level features. Smeulders et al.
define the Semantic Gap as: . . . the lack of coincidence between the information
that one can extract from the visual data and the interpretation that the same data
have for a user in a given situation. According to this definition, we can consider
the semantic gap as the distance betweendataandwisdomin the DIKW pyramid.
The typical approaches are both top-down, (ontologically driven approaches
24
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
51/142
2.2 Image Processing and AI Approach
that build domain definitions by creating relationships between high level con-
cepts) and bottom-up, automatic annotation or labeling approaches that try to
discover correspondences between high level annotations and automatically ex-
tracted features [HSL+06]. These both approaches can also be combined in thesame process.
Figure 2.3:DIKW Pyramid
Most bottom-up approaches relay on data-mining techniques to move from
low-level mathematical representation to classes that will be at a higher abstrac-
tion level. There is a enormous diversity of different supervisedunsupervised
classification or regression techniques, methods for feature space analysis, algo-
rithms for attribute selection etc. Thus, each specific problem requires the set
of tools and algorithms that suits best for each characteristics and requirements
(type of attributes and classes, dimensionality, dataset size, computational cost,
etc.).
25
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
52/142
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
53/142
The difference between stupidity
and genius is that genius has its
limits.
Albert Einstein
CHAPTER
3Domain Identification
As it has been stated in the previous section, the domain identification is one
of the key issues of cognitive vision as it allows the use of contextual information.
Current best performing systems are mainly those where the size and the com-
plexity of the domain are relatively low. Deng et al. [ DBLFF10] perform a study
the effects of dealing with more than 10,000 categories. The results show that:
Computational issues become crucial in algorithm design.
Conventional wisdom from a couple of hundred image categories on rela-
tive performance of different classifiers does not necessarily hold when the
number of categories increases.
There is a surprisingly strong relationship between the structure of the
WordNet and the difficulty of visual categorization.
Classification can be improved by exploiting the semantic hierarchy.
The process carried out by Deng et al. is based on state of the art descriptors
such as GIST[OT01] and SIFT[Low99]. The classification process uses Support
Vector Machines and the dataset includes more than 9 million assets.
Popular AI development results such as Deep Blue against Kasparov [Dee]
commonly considered as a great step in AI where machines are able to beat hu-
man minds are clear cases where the domain and the rules that define it are rather
simple, while combinatorial space derived from it becomes huge. For those cases,
27
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
54/142
3. DOMAIN IDENTIFICATION
Figure 3.1:Watson DeepQA High-Level Architecture [FBCC+10]
brute force algorithms can defeat human experience and heuristics capabilities.
In the case of Deep Blue its domain dependence was so high that even some
hardware components where specifically designed for chess playing purposes.
A step forward was done by Watson [Wat] in 2011 that won theJeopardy!prize
against former winners. In these cases, Watson was able to process natural lan-
guage by identifying keywords and accessing 200 million pages of structured and
unstructured content. As it is stated in the IBM DeepQA Research Team (devel-
opers of Watson) when they refer to WatsonThis is no easy task for a computer,
given the need to perform over an enormously broad domain, with consistently
high precision and amazingly accurate confidence estimations in the correctness of
its answers. However, even if the constraints to perform this task are much harder
than for chess playing, apart from the natural language processing module, the
task of playing theJeopardy!can be considered as an advanced text search en-
gine that does not require prior contextual knowledge as it can be observed in itsarchitectural design (Figure3.1).
The current state of the art is plenty of AI approaches that face the same limi-
tation observed in these two examples. They obtain a very good performance in a
specific narrow domain but fail when it scales up or when the same system is ap-
plied for a different problem. Current multimedia information retrieval systems
are exactly in this situation where contents belonging to specific contexts can be
successfully managed but have strong limitations of flexibility and scalability.
28
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
55/142
3.1 Domain characterization for CBIR
3.1 Domain characterization for CBIR
The importance of semantic context is very well known in Content Based Image
Retrieval (CBIR)[SF91,TS01]. This is especially relevant for broad-domain data
intensive multimedia retrieval activities such as TV production and marketing or
large-scale earth observation archive navigation and exploitation. Most modeling
approaches rely on local low-level features, based on shape, texture, color etc. The
drawback of these methods is that the characterization of the context requires
prior contextual information, introducing a chicken-and-egg problem[TMF10]. A
possible approach to reduce this dependency involves the exploitation of global
image context characterization for semantic domain inference. This prior in-
formation on scene context could represent a valuable asset in computer vision
for purposes ranging from regularization to the pre-selection of local primitive
feature extractors [SWS+00].
3.1.1 Broadcasting
The broadcasting sector has experienced a deep transformation with the intro-
duction of digital technologies. All internal work-flows have been affected by the
fact of representing content digitally. Regarding the Multimedia Asset Manage-
ment (MAM) systems, before the content was digital, all assets were centralized
and managed by documentalists/librarians, professionals that following a rigid
taxonomy were responsible of annotating, storing and retrieving the content.
Therefore, the work-flow was organized in a manner that documentalists offer
the content management service to editors. Since the digitalization of ingesting
and delivery processes, editors can directly and concurrently access to the con-
tent they are looking for. It offers great advantages in terms of efficiency allowing
non-linear editing and minimizing access times. However, this new work style in-
troduces much more inconsistencies since contents are concurrently annotated
by users that do not strictly follow a given taxonomy. in the metadata and in order
to create direct search and retrieval services, content annotations must be richer
and better since editors do not have the knowledge of documentalists to browse
among millions of assets. In order to get this improved metadata, manual annota-
tions result too expensive for most cases and automatic annotation systems are
not able to characterize high abstraction level categories, specially due to the size
and complexity of the broadcasting context.
29
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
56/142
3. DOMAIN IDENTIFICATION
From a technical point of view, there are many industrial solutions and stan-
dards for metadata (SMEF, BMF, Dublin Core, TV Anytime, MPEG-7, SMPTE
Descriptive Metadata, PB Core, MXF-DMS1, XMP etc.) that offer good retrieval
characteristics. However, all these technologies and specifications rely on a previ-ously annotated dataset that in most practical cases cannot be populated at an
affordable cost.
3.1.1.1 Alternative methods for massive content annotation
The explosion of prosumers and web video portals offer a new way of enriching
content with metadata. Most of these platforms offer the possibility of leaving
comments that can be used as annotations afterwards. However, these annota-
tions are always unstructured and their confidence is much lower. Therefore, theycannot be used directly as a source of metadata.
On the other hand, speech processing tools that nowadays are being used
to create subtitles, offer another source of textual information that is very repre-
sentative of the content. The use of the audio channel to create metadata faces
the same problem of unstructured text as users comments. However, it offers a
very rich and highly related text that fits very well with current text based search
engines.
3.1.2 Earth Observation, Meteorology
An extensive review of the state of the art of content-based retrieval in Earth
Observation (EO) image archives is presented in Section 7.2. Compared with
the broadcasting application field, EO archive volumes deal with even bigger
data volumes (approaching the zettabyte)1. The assets they contain are largely
under-exploited: the majority of records have never been accessed. Up to 95% of
records have never been accessed according to figures reported in conferences.
The situation is exacerbated by the growing interest in and availability of met-
ric and sub-metric resolution sensors, due to the ever-expanding data volumes
and the extreme diversity of content in the imaged scenes at these scales. As it
happens in the broadcasting sector, interpreters to manually annotate archived
content are expensive and tend to operate in applicative domains with stable,
1 The data volume for the EOC DIMS Archive in Oberpfaffenhofen is projected to about 2
petabytes in 2013 (Christoph Reck, DLR-DFD, presentation during ESA EOLib User Requirements
workshop, ESRIN November 17, 2011)
30
-
8/9/2019 A framework for content based semantic information extraction from multimedia contents
57/142
3.1 Domain characterization for CBIR
Figure 3.2: Idealized query process decomposition into processing modules and
basic operations based on an adaptation of Smeulders et al.[SWS+00].
well-formalized requirements rather than on the open-ended needs of the remote
sensing community at large or of broad efforts like GEOSS [KYDN11].
Regarding the domain, the EO semantic space is much more focused than the
one required for broadcasting content. In fact, Domain-specific ontologies help
to define concepts in a finer granularity. For specific uses such as the context of
disaster management in coastal areas: ontologies for Landsat1 and MODIS2 im-
agery based on the Anderson classification system[And76] have been developed.
However, the semantic gap between the huge amount of data remains still as an
issue to automatically fulfill these specific ontologies. A general decomposition
of a theoretical query process is depicted in Figure3.2.
1http://landsat.gsfc.nasa.gov/2http://modis.gsfc.nasa.gov/
31
http://landsat.gsfc.nasa.gov/http://modis.gsfc.nasa.gov/http://modis.gsfc.nasa.gov/http://landsat.gsfc.nasa.gov/ -
8/9/2019 A framework for content based semantic information extraction from multimedia contents
58/142
3. DOMAIN IDENTIFICATION
A special particularity of the EO domain is the diversity of type of data pro-
vided by the instruments installed in a satellite, where most of them are affected
by noise and distortions produced by the distance, atmosphere, etc.
Envisat (Environmental Satellite) launched on 2002 and operated by ESA(European Space Agency) includes the following instruments1 (Figure3.3):
ASAR: Advanced Synthetic Aperture Radar, operating at C-band, ASAR ensures
continuity with the image mode (SAR) and the wave mode of the ERS-1/2
AMI.
MERIS a programmable, medium-spectral resolution, imaging spectrometer op-
erating in the solar reflective spectral range. Fifteen spectral bands can be
selected by ground command, each of which has a programmable width
and a programmable location in the 390 nm to 1040 nm spectral range.
AATSR: Advanced Along Track Scanning Radiometer, continuity of the ATSR-1
and ATSR-2 data sets of precise sea surface temperature (SST) levels of
accuracy (0.3 K or better).
RA-2 Radar Altimeter 2 (RA-2) is an instrument for determining the two-way
delay of the radar echo from the Earths surface to a very high precision:less than a nanosecond. It also measures the power and the shape of the
reflected radar pulses.
MWR: microwave radiometer (MWR) for the measurement of the integrated
atmospheric water vapour column and cloud liquid water content, as cor-
rection terms for the radar altimeter signal. In addition, MWRmeasurement
data are useful for the determination of surface emissivity and soil moisture
over land, for surface energy budget investigations to support atmospheric
studies, and for ice characterization.
GOMOS: measures atmospheric constituents by spectral analysis of the spec-
tral bands between 250 nm to 675 nm, 756 nm to 773 nm, a