A framework for content based semantic information extraction from multimedia contents

8/9/2019 A framework for content based semantic information extraction from multimedia contents

1/142


2/142


3/142

COMPUTERS CIENCE FAC ULT Y. COMPUTERSCIENCE ANDARTIFICIAL

INTELLIGENCED EPARTMENT

A FRAMEWORK FORCONTENTBASED

SEMANTICINFORMATIONEXTRACTION

FROMMULTIMEDIACONTENTS

A thesis submitted in fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY by:

Igor Garca Olaizola

Supervised by Prof. Basilio Sierra Araujo

&

Dr. Julin Flrez Esnal

El doctorando El director El director

Donostia San Sebastian, Wednesday 11th September, 2013


4/142

A Framework for Content Based Semantic Information Extraction from Multimedia

Contents

Author: Igor Garca Olaizola

Advisor: Basilio Sierra Araujo

Advisor: Julin Flrez Esnal

SVN Version Control Data:

Dat e: 201309 1101:13:04+0200(W ed, 11Se p2013)

Aut hor: i ol ai zol a

Revision: 80M

DRAFT 1.7

The following web-page address contains up to date information about this dis-

sertation and related topics:

http://www.vicomtech.org/

Text printed in Donostia San Sebastin

First edition, September 2013
http://www.vicomtech.org/http://www.vicomtech.org/


5/142

Zuretzat aita.


6/142


7/142

Abstract

One of the main characteristics of the new digital era is the media

big bang, where images (still images or moving pictures) are one of

the main type of data. Moreover, this is an increasing trend mainly

pushed by the easy of capturing given by all new mobile devices that

include one or more cameras.

From a professional perspective, most content related sectors are

facing two main problems in order to operate efficient content man-

agement systems: a) need of new technologies to store, process and

retrieve huge and continuously increasing datasets and b) lack of

effective methods for automatic analysis and characterization of

unannotated media.

More specifically, the audiovisual and broadcasting sector which is

experiencing a radical transformation towards a fully Internet conver-

gent ecosystem, requires content based search and retrieval systemsto browse in huge distributed datasets and include content from

different and heterogeneous sources.

On the other hand, earth observation technologies are improving the

quantity and quality of the sensors installed in new satellites. This

fact implies a much higher input data flow that must be stored and

processed.

In general terms, the aforementioned sectors and many other me-

dia related activities are good examples of the Big Data phenomenon

where one of the main problem relies on the semantic gap; the in-

ability to transform mathematical descriptors obtained by image

processing algorithms into concepts that humans can naturally un-

derstand.

This dissertation work presents an applied research activity overview

along different R&D projects related with computer vision and mul-

timedia content management. One of the main outcomes of this


8/142


9/142

Resumen

Una de las caractersticas principales de la nueva era digital es el la

gran explosin producida alrededor de los contenidos multimedia

donde las imgenes (tanto estticas como en movimiento) suponen

el tipo principal de dato. Adems, esta tendencia sigue siendo cre-

ciente debido principalmente a la facilidad de captura ofrecida por

los dispositivos mviles que incluyen una o ms cmaras.

En lo referente a los diferentes sectores profesionales relacionados

con los medios digitales, el crecimiento tan exagerado de los datos

est causando dos problemas principales. Por un lado, se requieren

nuevas tecnologas que permitan el almacenamiento, proceso y la

recuperacin de contenidos de una manera efectiva en conjuntos

enormes y crecientes de datos. Por otro lado, tambin son necesarios

mtodos automticos de anlisis y caracterizacin de contenidos sin

anotacin previa.

De una forma ms especfica, podemos destacar el sector audiovi-

sual en el que se est produciendo una profunda transformacin

provocada principalmente por el proceso de convergencia con Inter-

net. En esta situacin, se hacen cada vez ms necesarios sistemas de

bsqueda y recuperacin de contenidos que permitan navegar en

conjuntos masivos de datos que cada vez son ms distribuidos y de

una procedencia ms heterognea.

En el caso de la observacin de la Tierra, los sistemas de adquisi-

cin de datos son cada vez ms numerosos y ms precisos, hecho

que genera flujos de contenido cada vez mayores que deben ser

continuamente procesados y almacenados.

En general, los sectores mencionados previamente y otras activi-

dades relacionadas con contenidos multimedia son claros ejemplos

del fenmeno Big Data que est produciendo y donde uno de los

problemas principales consiste en eliminar la brecha semntica (ms


10/142

conocida comosemantic gap). Podemos definir la brecha semntica

como la diferencia no salvada por el momento entre los conceptos

matemticos derivados de las diferentes tcnicas de procesamiento

de imgenes y los conceptos que los humanos manejamos para de-scribir estos mismos contenidos.

La presente memoria de tesis presenta una revisin sobre la activi-

dad de investigacin aplicada que se ha realizado mediante varios

proyectos relacionados con la visin por computador y la gestin

de contenido multimedia. Uno de los resultados principales de esta

actividad investigadora ha sido el modelo Mandrgora, un diseo

de arquitectura con el objetivo de minimizar la brecha semntica

y crear anotaciones automticas basadas en una ontologa previa-mente definida.

Debido a que uno de los problemas principales a los que se en-

frenta la implementacin de Mandrgora es el hecho de que la falta

de conocimiento previo sobre el contenido limita el anlisis inicial,

hemos propuesto un nuevo mtodo (DITEC) para la caracterizacin

semntica de imgenes. Los buenos resultados obtenidos en las prue-

bas experimentales realizadas han resultado en una adaptacin del

mtodo original basado en un descriptor global de manera que unavariante de dicho descriptor global resulte eficaz como descriptor

local. En este documento tambin se describe la variante DITEC lo-

cal en la que los resultados de las pruebas experimentales realizadas

(an con una implementacin en fase de desarrollo) han mostrado

un comportamiento altamente competitivo al ser comparadas con

los descriptores locales ms populares en la literatura cientfica.


11/142

Laburpena

Aro digital berriaren berezitasun nagusienetako bat media edukien

big bang edo ugaritze neurrigabekoa da, irudiak (bai argazki eta

baita bideoak ere) eduki mota nagusia direlarik. Joera hau gainera

oraindik ere gehiagora doa batik bat kamera bat edo bi dakartzaten

gailu mugikorrek edukia jasotzeko eskaintzen duten erraztasunak

lagunduta.

Ikuspegi profesionaletik begiratuz gero, edukiekin lotuta dauden

sektoreak edukien kudeaketa efiziente bat egiteko garaian bi arazo na-

gusiren aurrean aurkitzen dira. Alde batetik, datu kopuru itzel hauek

metatu, prozesatu eta norberaganatzeko teknologia berriak behar

dira. Bestalde, alde aurretik inongo anotaziorik ez duen edukiaren

analisi eta bereizte automatikorako metodo eraginkorrak garatzeke

daude oraindik.

Gehiago zehaztuz, ikus-entzunezko edukien eta irrati uhin bitarteko

hedabideen sektorea, une hauetan Internetekin bat egitera doan

bideari ekinda bizitzen ari den eraldatze prozesu sakonean inoizko

eduki kopuru handienak kudeatzen ari da. Gainera, edukiak gero

eta jatorri ezberdin eta izaera heterogeneoagoa azaltzen dute, garai

bateko eredu zentralizatu eta trinkoekin lan egiten zuten sistemen

eraginkortasuna erabat urrituz. Horregatik, datu hauetan bilaketa er-

aginkorrak egiteko edukia bera aztertzeko gai diren egitura malguko

sistemen garapena behar beharrezkoa da.

Bestalde, Lurraren behatze jarduerarako teknologiak gero eta kopuru

eta doitasun handiagoko instrumentazioa erabiltzen dute belaunaldi

berriko sateliteetan. Hau dela eta gero eta datu jario handiagoa

igortzen dute behaketa sistemek eta eduki guzti horiek gorde eta

prozesatzeko beharrak gero eta zailagoak dira betetzen.

Oro har arestian aipatutako sektoreak eta multimedia edukiekin

dabiltzan halako beste hainbat Big Data gertakariaren adibide


12/142

nabariak dira, bertan arrail semantikoa (semantic gap bezela

ezagutzen dena), hots, irudi prozesamenduko algoritmoen bidez er-

auzitako bereizgarri matematikoak gizakioi ulergarri egiten zaizkigun

kontzeptuetan bihurtzeko ezina, arazo nagusienetariko bat bihurtudelarik.

Tesi dokumentu honetan, konputagailu bidezko ikusmenari lotu-

tako ikerkuntza aplikatuko hainbat proiektutan lortutako emaitza

orokorrak azaltzen ditugu. Emaitza nagusienetako bat Mandragora

arkitektura da. Mandragoraren xede nagusia arrail semantikoa txik-

itzeko ontologi batetan oinarrituta dagoen irudien anotazio sistema

automatiko bat sortzea da.

Mandragoraren arazo nagusienetako bat hasierako ezagutzarik izangabe lehen prozesamendua itsuka egin beharra denez, irudien

domeinu semantikoa karakterizatzeko metodo berri bat aurkezten

dugu, DITEC izenekoa. Saiakera esperimentaletan lortutako emaitza

onak ikusirik, DITEC metodoaren muinean dagoen deskribatzaile

globala era lokalean erabiltzeko egokitzeko ahalegina egin dugu.

DITEC bereizgarri lokala ere azaltzen da beraz dokumentu honetan.

Metodoaren inplementazioa oraindik ere garapen egoeran dagoen

arren, lortutako emaitza esperimentalak oso onak izan dira zientzialiteraturan dauden deskribatzaile lokal ezagunenekin alderatuta.


13/142

Acknowledgements

This is not the story of a self-made man. Instead, all the achievements

presented in this work have a long chain behind, a chain composed

by people that have supported my entire professional career and

something that cannot be separated from personal experiences. At

this point, it is worth to acknowledge all these people.

In this sense, my both supervisors, Basilio Sierra and Julin Flrez

have been an essential part of this work, with an unconditional

commitment and a highly valuable scientific guidance. Dudarik

gabe, esan liteke, Julian, nire bide profesionalaren lehen hastapenak

zurekin eman nituela. Hasieratik zugandik sentitu nuen konfidantza

eta babesa ez dira hamarkada oso batetan gutxiagora joan, eta hori

bada zerbait. Denbora guzti honetan zugandik ikasia nire eguneroko

lanaren oinarri nagusienetako bat izanik, lan honetan ere halaxe isla-

datzen da. Bestalde, unibertsitatean irakasle egoki bat aukeratzeko

bidean, zorte izugarria izan nuen Basi ezagututa. Hasieratik jakin

izan du nire egoera profesionalak sortzen dizkidan etenaldi eta jar-

raipen faltara egokitzen. Aholku eta zuzendaritza zientifiko ezin hobe

bat egin dituela esango nuke eta era atsegin eta gogotsuan gainera,

gogor eta astuna izan litekeen prozesu bat, gustora egiten den lana

bihurtuz.

Dentro de Vicomtech, entorno en el que se ha movido la mayor parte

de mi actividad profesional y donde se enmarca esta tsis, he contado

con innumerables apoyos. Seguro que dejar alguno sin mencionar

(desde aqu mis disculpas) pero no por ello quiero dejar de citar al-

gunos tales como Jorge Posada, Director Adjunto, que me ha apoyado

en todo momento con nimos y consejos prcticos que vienen muy

bien cuando uno se centra demasiado en su problema. Amalia y

David, compaeros de fatigas que me demostraron que s es posible

hacer una tesis doctoral compaginada con la actividad profesional


14/142

en un centro tecnolgico. Shabs, this interesting man that always

shows me that things might have a non obvious point of view that it is

worth to observe. Edurne, gauza zailak errez eginaz behin eta berriz

bidea lautzen didan lankide eta laguna. Por supuesto, merece unamencin especial el departamento de TV Digital y Servicios Multime-

dia del que soy parte y del que he sentido un apoyo enorme durante

todo el proceso. Espero realmente poder corresponder en la misma

proporcin.

This work has been carried out in a strong collaboration with some

colleagues that deserve a special mention. Marco Quartulli, cien-

tfico renacentista con el que juego y aprendo cada da. Naiara

Aginako, irudiekin lan egiten hasi ginen egunetik bide lagun, laneanzorrotz bezain atsegin tratuan, zure txanda noiz iritsiko desiatzen

nago. Gorka, quien con su tesis marc el punto de partida de este

trabajo. Cuntas discusiones interesantes que derivaron en buenas

ideas o.. . en ms discusiones :-) Espero poder seguir disfrutando de

tu contrapunto. Y por supuesto Iigo Barandiaran, un investigator

meticuloso, creativo y con un gusto por el trabajo bien hecho que

para m sigue siendo un ejemplo. De las inmumerables horas que

hemos pasado juntos en este proceso, no ha habido un minuto en el

que no haya disfrutado. Me ilusiona saber que todava nuestro tra-

bajo require de mejoras y ms investigacin porque ser la manera

de que continuemos colaborando. Al final parece que nos vamos a

tomar esa cerveza!

Echando un poco la vista atrs, tampoco puedo olvidarme de otros

viejos amigos que aunque de una manera ms lejana han sido fun-

damentales para que este trabajo se haya realizado. Haritz,mein

Brder, karrera hasi genianetik horrenbeste ordu elkarrekin, hain-

bat proiektu eta diskusio, beti elkarlanean laguntzeko prest. Gero

urtebeteko abentura elkarrekin Wichernstrasse inguruetan. Nolako

injinerua nauken, hik baduk bai zeresana. Jaizki, aspaldiko lagun, kar-

rera garaitik eta gure Alemaniko abenturan triangelu zarauztarraren

beste erpina. Lehen bezela, orain ere ez didak gutxi lagundu. Esan

eta egin, artikuluaren zuzenketak eskatu orduko eta doitasun han-

diz gainera. Hitzez laguntza eskaintzea erreza dek, hik egitez erakutsi

didak.


15/142

Profesionalki lana buru belarri egin ahal izateko, pertsonalki oreka

lortzea behar beharrezkoa dela uste dutenetakoa naiz ni, eta hor-

retarako bizitza zurekin elkarbanatuz, goizero egunari ekitea gauza

zoragarri bat izatea zuri zor dizut, Myriam, batzutan garabi bat beharbaduzu ere.

Izan zirelako gara, eta garelako izango dira, Naroa eta Maddi, zuek

zarete nire benetako proiektua, txikiak izango zarete baina zuei begira

beste guztia geratzen zait niri txiki.

Esan bezala, izan zirelako gara, eta ni naizena baldin banaiz familiari

eta batik bat gurasoei zor diet (onetik dudana behintzat). Osaba Joxe,

nire bizitzan dauden oinarrizko printzipioen erakusle, horrenbeste

urte eta gero ez dira aldatu. Ama, zuk erakutsiak dira ahalegintzearenbalioa, lanean gustoa jartzearena, txukun ibiltzearena. Oraindik

ere halaxe erakusten didazu egunero. Aita, tesi lan honekin zuri

egin nahi dizut aipamenik garrantzitsuena. Ingenioa, irudimena eta

jakintzaren arteko konbinazio bezela maisutasunez erabiliz, zeuk

zuzendu ninduzun ingeniaritzara. Nik zure eredu hori jarraitzen

jartzen dut ahalegina. Lanean lanetik kanpo bezela, zuzen eta pula-

mentuz, ingurukoei laguntzen saiatuz eta zailak badira ere erabaki

zuzenei koherentzia osoz eutsiz. Egunero saiatzen naiz zuk hitzez etaegitez hain garbi erakutsitako bidea duintasunez betetzen.

Eskerrik asko guztioi

Igor Garca Olaizola

September 2013


16/142


17/142

Contents

List of Figures xvii

List of Tables xxi

I Work Description 1

1 Introduction 3

1.1 Context of this research activity . . . . . . . . . . . . . . . . . . . . . . 4

1.1.1 VicomtechIK4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.1.1 Relation with other VicomtechIK4 PhD. processes 5

1.1.2 Computer Science and Artificial Intelligence Department of

the Computer Engineering Faculty . . . . . . . . . . . . . . . . 6

1.2 R&D Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.1 Begira . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.1.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.1.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.2 Skeye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.2.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2.2.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2.3 SiRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.3.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.4 SIAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2.4.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2.5 Cantata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2.5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 12

xiii


18/142

CONTENTS

1.2.5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 12

1.2.6 RUSHES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.2.6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.2.6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 131.2.7 Grafema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2.7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2.7.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 16

1.2.8 IQCBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.2.8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.2.8.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 17

1.2.9 Relationship of projects and scientific activity . . . . . . . . . 18

2 Computer Vision from a Cognitive Point of View 212.1 Mandragora Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2 Image Processing and AI Approach . . . . . . . . . . . . . . . . . . . 23

3 Domain Identification 27

3.1 Domain characterization for CBIR . . . . . . . . . . . . . . . . . . . . 29

3.1.1 Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.1.1 Alternative methods for massive content annotation 30

3.1.2 Earth Observation, Meteorology . . . . . . . . . . . . . . . . . 30

3.1.2.1 Meteorology . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 Local features vs. global features in domain identification . . . . . . 34

4 Proposed Method: DITEC 37

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 General description of the DITEC method . . . . . . . . . . . . . . . 38

4.2.1 Sensor modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2.2 Data transformation . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2.2.1 Functionals . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.2.2 Geometrical constraints . . . . . . . . . . . . . . . . . 42

4.2.2.3 Quantization effects . . . . . . . . . . . . . . . . . . . 44

4.2.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.3.1 Statistical descriptors . . . . . . . . . . . . . . . . . . 48

4.2.3.2 Cauchy Distribution . . . . . . . . . . . . . . . . . . . 51

4.2.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2.4.1 Feature Subset Selection in Machine Learning . . . 51

4.2.4.2 Attribute contribution analysis . . . . . . . . . . . . . 53

xiv


19/142

CONTENTS

4.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3.1 Case study 1: Corel 1000 dataset . . . . . . . . . . . . . . . . . 54

4.3.2 Case study 2: Geoeye satellite imagery . . . . . . . . . . . . . 57

4.4 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . 614.4.1 Computational complexity of the trace transform . . . . . . . 61

4.4.2 Computational complexity of attribute selection and classifi-

cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.4.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.5 Conclusion of the presented method . . . . . . . . . . . . . . . . . . 63

4.6 Modified DITEC as local descriptor . . . . . . . . . . . . . . . . . . . 64

4.7 Implementation of DITEC as local descriptor . . . . . . . . . . . . . 64

4.7.1 Trace Transformation . . . . . . . . . . . . . . . . . . . . . . . 664.7.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.7.3 DITEC parameters . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.7.3.1 Angular and radial sampling . . . . . . . . . . . . . . 68

4.7.3.2 Effects of sampling in the computational cost . . . . 70

4.7.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . 70

4.7.4.1 Geometric Transformations . . . . . . . . . . . . . . 71

4.7.4.2 Photometric Transformations . . . . . . . . . . . . . 72

4.7.5 Current status of the local DITEC algorithm design . . . . . . 72

5 Main Contributions 75

5.1 Mandragora framework . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2 DITEC method as global descriptor . . . . . . . . . . . . . . . . . . . 75

5.3 DITEC feature space analysis . . . . . . . . . . . . . . . . . . . . . . . 76

5.4 DITEC method as local descriptor . . . . . . . . . . . . . . . . . . . . 76

5.5 Other contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6 Conclusions and Future Work 77

6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.1.1 Collaborative filtering, Big Data and Visual Analytics . . . . . 78

II Patents & Publications 81

7 Publications 83

7.1 Weather analysis system based on sky images taken from the earth 83

7.2 A review on EO mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

xv


20/142

CONTENTS

7.3 Acc. Obj. Tracking and 3D Visualization for Sports Events TV Broad-

cast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7.4 DITEC: Experimental analysis of an image characterization method

based on the trace transform . . . . . . . . . . . . . . . . . . . . . . . 847.5 Image Analysis platform for data management in the meteorologi-

cal domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7.6 Architecture for semi-automatic multimedia analysis by hypothesis

r einf or mc ement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7.7 Trace transform based method for color image domain identification 85

7.8 On the Image Content of the ESA EUSC JRC Workshop on Image

Information Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7.9 Authors other publications . . . . . . . . . . . . . . . . . . . . . . . . 86

8 Selected Patents 93

8.1 Method for detecting the point of impact of a ball in sports events . 93

8.2 Authors Other Related Patents . . . . . . . . . . . . . . . . . . . . . . 93

III Appendix and Bibliography 95

A Consideration on the Implementation Aspects of the trace transform 97

A.1 Development platforms . . . . . . . . . . . . . . . . . . . . . . . . . . 97

A.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

B Calculation of the clipping points in a circular region 103

Bibliography 107

xvi


21/142

List of Figures

1.1 Begira scene definition. . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2 Example of the cloud segmentation process . . . . . . . . . . . . . . 9

1.3 Rushes content analysis workflow . . . . . . . . . . . . . . . . . . . . 13

1.4 Grafema Assets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.5 Grafema System Workflow . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.6 Grafema System Architecture . . . . . . . . . . . . . . . . . . . . . . . 16

1.7 IQCBM System Architecture . . . . . . . . . . . . . . . . . . . . . . . . 17

1.8 Screenshots of the IQCBM user interface . . . . . . . . . . . . . . . . 18

1.9 Relationship between R&D projects and scientific activity in multi-

media content analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1 Information Retrieval Reference Model . . . . . . . . . . . . . . . . . 22

2.2 Mandragora Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3 DIKW Pyramid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1 Watson DeepQA High-Level Architecture . . . . . . . . . . . . . . . . 28

3.2 Idealized query process decomposition on EO image mining . . . . 31

3.3 Envisat instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4 General architecture of the meteorological information manage-ment system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1 DITEC System workflow . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Trace transform, geometrical representation . . . . . . . . . . . . . . 41

4.3 Trace transform contribution mask at very high resolution parame-

ters (Image resolution:100x100px.n = 1000, n = 1000, n = 5000). 444.4 Pixels relevance in trace transform scanning process with different

parameters (n, n, n). Original image resolution = 384x256. . . . . 45

xvii


22/142

LIST OF FIGURES

4.5 Trace Transform and subsequent Discrete Cosine Transform of

Lenna. (Y channel of YCbCr color space) . . . . . . . . . . . . . . . . 48

4.6 Conceptual scheme: DCT matrix transformation into , kpair vector. 49

4.7 Statistical properties of all Kurtosis measurements made on thedistributions obtained by processing Corel 1000 dataset . . . . . . . 50

4.8 Examples of probability density distribution and histograms ob-

tained by the samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.9 Samples of Corel 1000 dataset. The dataset includes 256x384 or

384x256 images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.10Distance among classes in the Corel 1000 dataset according to mis-

classified instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.11Distance among most inter-related classes in the Corel 1000 dataset

according to misclassified instances. . . . . . . . . . . . . . . . . . . . 57

4.12Corel 1000 picture corresponding to classArchitectureand classified

asMountain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.13 Corel 1000 precision results with different feature extraction algo-

rithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.14Samples of satellite footage dataset. 256x256px patches at different

scales. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.15Distance among classes in the Geoeye dataset according to misclas-

sified instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.16 Time performance behavior. . . . . . . . . . . . . . . . . . . . . . . . 62

4.17 System workflow for DITEC as local feature . . . . . . . . . . . . . . . 65

4.18 Matching accuracy depending on the number of angular samples . 68

4.19 Matching accuracy depending on the number of radial samples . . 69

4.20 Matching accuracy depending on the number of simultaneous in-

crease of angular and radial sampling . . . . . . . . . . . . . . . . . . 69

4.21 Computation time depending on the simultaneous increase of an-

gular and radial sampling . . . . . . . . . . . . . . . . . . . . . . . . . 704.22 In-plane Rotation Transformation matching results. . . . . . . . . . 71

4.23 Scale Transformation matching results. . . . . . . . . . . . . . . . . . 71

4.24 Projective Transformation matching results. . . . . . . . . . . . . . . 72

4.25 Exposure change photometric Transformation matching results. . . 73

4.26 Trace transform row and column analysis . . . . . . . . . . . . . . . . 73

A.1 DITEC development platform . . . . . . . . . . . . . . . . . . . . . . . 98

A.2 Circular patch image . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

xviii


23/142

LIST OF FIGURES

A.3 Result of (,) space exploration with Bresenham . . . . . . . . . . . 99

A.4 First half of the source image is sampled (blue regions) while areas

around vertical and horizontal axes are not considered. . . . . . . . 100

A.5 Second half of the source image is sampled (red and green). Theseregions are moved to

4,

34

,5

4 ,

74

,areas in order to be sampled

with the Bresenham algorithm. . . . . . . . . . . . . . . . . . . . . . . 100

A.6 Result of (,) sampling with Bresenham algorithm and a single

image rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

A.7 Result of (,) pixelwise sampling with image rotation for each

angular iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

A.8 Result of different sampling strategies of (,) space. . . . . . . . . 102

B.1 Scanline defined in terms of and . . . . . . . . . . . . . . . . . . . 104

xix


24/142


25/142

List of Tables

4.1 List of Trace Transform functionals proposed in [KP01]. . . . . . . . 42

4.2 Quantization effects of the trace transform . . . . . . . . . . . . . . . 46

4.3 Corel 1000 dataset confusion matrix. . . . . . . . . . . . . . . . . . . 564.4 Geoeye dataset confusion matrix. . . . . . . . . . . . . . . . . . . . . 59

xxi


26/142


27/142

Part I

Work Description

1


28/142


29/142

If our brains were simple enough

for us to understand them, wed be

so simple that we couldnt.

Ian Stewart

CHAPTER

1Introduction

Artificial Intelligence (AI) is probably one of the most exciting knowledge fields

where even the definition of the term becomes controversial due to the manifold

understanding of the intelligence that remains as a hard epistemological prob-

lem. Learning, reasoning, understanding, abstract thought, planning, problem

solving and other related topics are all different aspects that imply intelligence.

The emergence of programmable digital computers in the late 1940 offered a

revolutionary way to experimentally explore new methods for formal reasoning

and logic. However, the initial great expectation of AI did not come into reality

and the prediction made by Herbert A. Simon1 machines will be capable, within

twenty years, of doing any work a man can do still remains as a Science Fiction

topic.

The fashions of AI over the years have moved from automated theorem prov-

ing to expert systems that later on where substituted by behaviour-based robotics

and now seem to find the solution on learning from big data[Lev13]. All these

trends have not been able to meet the expectation that the founders of AI put

on the field[Wan11]. Patrick Winston( director of the MIT Artificial Intelligence

Laboratory from 1972 to 1997) cited the problem of mechanistic balkanization,

with research focusing on ever-narrower specialties such as neural networks or

genetic algorithms. When you dedicate your conferences to mechanisms, theres a

1Herbert Alexander Simon (June 15, 1916 February 9, 2001), ACMs Turing Award for making

basic contributions to artificial intelligence, the psychology of human cognition, and list processing

in (1975) and considered one of the founders of AI

3


30/142

1. INTRODUCTION

tendency to not work on fundamental problems, but rather [just] those problems

that the mechanisms can deal with[Cas11].

However, there has been a great scientific and technological advance in many

AI related domains formal logic, reasoning, statistics and data mining, geneticprogramming, knowledge representation, etc. that without satisfying the founda-

tions proposed by Winston or Chomsky, has enabled the creation of technological

solutions for different application fields such as natural language processing,

computer vision, drug design, medical diagnosis, genetics, finance & economy,

user recommendation systems and many others.

1.1 Context of this research activity

The research activity described in this dissertation has been mainly carried out

withing the applied researchperspective given by both, basic and applied research

projects developed in VicomtechIK41. VicomtechIK4 is an applied research

center located in San-Sebastian (Basque Country, Spain) and combines the excel-

lence in pure basic research with its application and transfer to the industry. In

this sense, some of these projects have been transferred to the industry and their

intellectual property has been protected by applying patents. In other cases, the

scientific progress carried out within the project has been published in journalsor conferences.

The knowledge of the market needs, technological state of the art and real

integration which introduces many constraints coming from the real world are

combined with the scientific method and basic research activities in collabora-

tion with Universities. In this case, the collaboration with the University of the

Basque Country2 and more specifically with the Computer Science and Artificial

Intelligence Department of the Computer Engineering Faculty has been a key

element for the balanced applied/scientific progress of the research work.

1.1.1 VicomtechIK4

VicomtechIK4 as an applied research centre is focused on all aspects related with

multimedia and visual communication technologies along the entire content pro-

duction pipeline, from generation, through processing and transmission until

1http://www.vicomtech.org2http://www.ehu.es

4
http://www.vicomtech.org/http://www.ehu.es/http://www.ehu.es/http://www.vicomtech.org/


31/142

1.1 Context of this research activity

rendering, interaction and reproduction. VicomtechIK4 is structured in 6 de-

partments that offer different views and specific technological solutions around

the aforementioned research activity. These departments are the following:

Digital Television and Multimedia Services

Speech and Natural Language Technologies

eTourism and Cultural Heritage

Intelligent Transport Systems and Engineering

3D Animation and Interactive Virtual Environments

eHealth and Biomedical Applications

The research described in this dissertation has been carried out within the

Digital TV & Multimedia Servicesdepartment with a strong collaboration withthe department ofeHealth and Biomedical Applications. In fact, the problem of

computer vision and multimedia content understanding is one of the main re-

search lines of VicomtechIK4. In this case, one of the main problems addressed

in this dissertation is aligned with one of the main difficulties of current multime-

dia management systems in diverse sectors such as broadcasting, remote sensing,

medical imaging, etc.: The huge amount of unannotated data and extremely

broad domains that cannot be explicitly defined.

1.1.1.1 Relation with other VicomtechIK4 PhD. processes

The research and technological activity performed in this work has been carried

out in a strong collaboration with other two PhD processes made in Vicomtech

IK4. These two works analyze and develop other aspects related with the analysis

and understanding of multimedia. More specifically Marcos[Mar11] studied the

multimedia retrieval problem from a semantic point of view, creating a semantic

middlewarebased approach as an intermediary layer between users high level

queries and systems low-level annotations. Some of the final considerations of

the work carried out by Marcos and the requirements identified as future work to

feed this semantic middleware from a bottom-up approach are the basis of the

initial context of this dissertation.

On the other hand, Barandiaran [Bar13]focused his work in the analysis of

local descriptors. The collaboration with Barandiaran has resulted into a local

adaptation of the global descriptor as one of the main contributions proposed

in this dissertation (see Section4.6). This novel local feature has demonstrated

highly robust characteristics as a local descriptor .

5


32/142

1. INTRODUCTION

1.1.2 Computer Science and Artificial Intelligence Department

of the Computer Engineering Faculty

The Robotics and Autonomous Systems Group which belongs to the Computer

Science and Artificial Intelligence Department of the Computer Engineering

Faculty is very active in two main areas:

Mobile Robotics

Behavior-based control architectures for mobile robots.

Bio-inspired robot navigation.

Use of visual information for navigation.

Machine Learning

Dynamic learning mechanisms

Classifier combination

New paradigms for supervised classification

Optimization problems

The deep knowledge of this group on machine learning science and tech-

nologies has provided the scientific foundations for the more technological work

developed in VicomtechIK4. This combination provides a high potential context

for scientific research.

1.2 R&D Projects

This dissertation work has been carried out based on R&D projects with common

underlying scientific needs and customer specific requirements. The knowledge

and experience acquired during these projects has driven the general framework

presented as one of the main contributions of this work.

1.2.1 Begira

Title:Diseo y Desarrollo de un Sistema Seguimiento Preciso de Objetos

en Transmisiones Deportivas (Design and development of a high accuracy

object tracking system for sports broadcasting).

Project typology:Industrial project partially supported by theGaitekpro-

gramme.

Company name:G93.

6


33/142

1.2 R&D Projects

Period:2005-2009.

1.2.1.1 Summary

Augmented reality projects require a deep knowledge of the scene that has to beextracted/updated in real time. In order to ensure the accuracy and real-time

performance of the system, the knowledge must be explicitly defined.

The goal of ofBegiraproject was to develop a single-camera system to track

the ball trajectory and position the bouncing point for Basque Pelota live TV

transmissions. The main constraints of the system were:

Single camera.

Broadcasting camera (720p@50).

Tracking, positioning and virtual reconstruction under 20 seconds.

Single standard computer for processing purposes.

From an Artificial Intelligence perspective, we can consider it as a system

where the knowledge domain is reduced to a single scene (the Basque Pelota

court) and thus can be explicitly defined. The main elements that define this

domain are:

3D environment: A court composed by 3 plane surfaces (front wall, side wall

and ground).

The relative position of the camera to the court is obtained during a

calibration process by putting a checkerboard on the ground.

Once the camera is calibrated, its position is fixed during the entire

match.

Dynamic objects: There are only 2 types of dynamic objects in the scene:

Players: There can be two or 4 players. Their size is much bigger than the

ball and most of the time their lowest part is touching the ground.

Ball: It is white, round and much smaller than the players. Sudden trajec-

tory changes are due to the hit of the players or a bounce. The ball is

so rigid that the bounce can be considered elastic.

According to the domain defined with the aforementioned concepts, a homog-

raphy matrixHis calculated to obtain cameras extrinsic parameters. Then the

7


34/142

1. INTRODUCTION

amera rg n

(xi, yi)

(Z=0 Plane Origin)

(a)

amera r g n

(x y )i i

(xi, yi)

R

(Z=0 Plane Origin)

(b)

Figure 1.1:Scene definition: Ball trajectory samples used to estimate the paramet-

ric curves and the calculation of the bouncing point on the ground once the center

position of the ball is obtained (crossing point of the two curves).

ball is initially detected and the tracking system follows its trajectory. Abrupt tra-

jectory changes define the limit between the instant before and after the bounce.

Once the two parametric curves are estimated, their crossing point is calculated

on the image. This two-dimensional position (in pixels) is then converted to the

3D space using the inverse of the homography matrix (H1). To solve the uncer-

tainty of the 3D position obtained by the 2D projection, the conditionZ= 0 isestablished for the bouncing point. More details of the project can be found in

Section7.3.

1.2.1.2 Conclusions

TheBegiraproject is a good example of expert systems applied to image process-

ing and computer vision. The technical goals were successfully achieved and the

results of the project were exploited by the Basque public broadcaster ETB and

the TV content producer G93. However, the knowledge acquired by the system

was sohardcodedthat it is very difficult to extend or integrate it in other more

general solutions. The good performance and accuracy results rely on its reduceddomain definition and rigid nature.

1.2.2 Skeye

Title: Sistema de anlisis meteorolgico basado en imgenes del cielo

tomadas desde tierra (Meteorological analysis system based on images

taken from the earth).

Project typology:Industrial project supported by theGaitekprogramme.

8


35/142

1.2 R&D Projects

Company name:Dominion.

Period:2007-2008.

1.2.2.1 SummaryMeteorological stations provide multiple sensor data as well as some more sub-

jective information such as the cloudiness. The goal ofSkeyewas to provide

an automatic system to accurately estimate the cloudiness factor, avoiding any

human intervention.

As mentioned in Begira, the semantic domain was small and quite straight

forward to model. The four classes that compose the domain are:cloud,sun,blue

skyand earth. The project was carried out by analyzing the features that were

characteristic for each class and the scene was defined in terms of a dome withnormalized illumination conditions.

(a) (b)

Figure 1.2:Example of the cloud segmentation process

1.2.2.2 Conclusions

Similarly toBegira, in this case, the feature extraction process provided all the

information need for a further class assignment by applying specific thresholds.

However, the further integration of the developed system in more domains or

scenes would be a difficult task since all the development andthe selected featurestotally depend on the domain definition and scene conditions. More information

about this work can be found in Section7.1.

1.2.3 SiRA

Title: Diseo y Desarrollo de un Sistema de Reconocimiento de Marcas

Comerciales en Emisiones Televisivas (Design and development of a system

for commercial brand recognition in TV broadcasts).

9


36/142

1. INTRODUCTION

Project typology:Industrial project supported by theGaitekprogramme.

Company name:Vilau.

Period:2007-2008.

1.2.3.1 Summary

This project is another example of a system based on a reduced semantic domain,

but in this case the approach was more general and some higher abstraction level

elements were introduced. The goal ofSiRAwas to detect logos in TV content in

order to automatize advertisement monitoring tasks. This project was also sup-

ported by the Basque Government and its industrial application was envisioned

by Vilau, a media communication company.

In this case, the constraints in terms of real time behavior and equipmentwere lighter than inBegira. However, the domain was broader: any type of logos

embedded in any type of content taken from different perspectives.

The approach followed in this case was to firstly detect a logo candidate as-

suming that a logo would be typically surrounded by a regular shape (square,

circle, triangle, etc.) and composed of very few colors. Once the logo was detected,

different feature extraction algorithms could be applied in order to compare the

results with those features corresponding to the target logo dataset. Depending

on the extracted features, different distance metrics were applied.

1.2.3.2 Conclusions

The results ofSiRAcan be integrated as a new feature in other content analysis

systems. In this case,SiRAwould provide information about potential logos exist-

ing in a specific video or still image. Moreover, even if the process itself is carried

out by using low-level operators, it can be considered that the result ofSiRAis a a

set of high level features with valuable semantic content as in general terms the

presence of a logo means that there is a product or and advertisement related toit.

1.2.4 SIAM

Title:Diseo y Desarrollo de un Sistema de Anlisis Multimedia de Con-

tenido Audiovisual en Plataformas Web Colaborativas (Design and devel-

opment of a system for multimedia analysis of audiovisual content in

collaborative web platforms).

10


37/142

1.2 R&D Projects

Project Typology:Industrial project supported by theGaitekprogramme.

Company name:Hispavista (http://hispavista.com/).

Period:2009-2010.

1.2.4.1 Summary

First ideas of this work related with a semantic analysis of multimedia content

were developed inSIAM. The goal of this project was to create content analy-

sis tools to improve the exploitation of large amounts of user generated content.

The context of the project was www.tu.tv,aYouTubelike video sharing platform

owned by Hispavista. According to this approach, the semantic labels can be ob-

tained from unstructured user comments. Then, by finding similar contents, new

non tagged content can be assigned to a previous label.

As the type of content analyzed in SIAMwere any kind of videos, the semantic

domain was too broad and complex to be defined where one of the main prob-

lems was the definition of a semantic unit in a video. The assumption of a video

as a semantic unit is to inconsistent in many cases as the elements on it can be

changing along the time. Therefore, each video was decomposed in shots and

each shot was analyzed and labeled. Finally, the entire video would be labeled as

the composition of each shot label.

1.2.4.2 Conclusions

The main outcome ofSIAMwas the shot based content analysis model and a shot

boundary detector that has been later used for semantic analysis purposes. More-

over, the potential of user generated metadata was addressed in this project. We

identified the potential of this amount of unstructured data that could be com-

plementary to the perfectly organized but expensive to populate professional

taxonomies.

1.2.5 Cantata

Title:Content Aware Networked systems Towards Advanced and Tailored

Assistance

Project typology:ITEA

Period:2007-2009

11
http://hispavista.com/http://www.tu.tv/http://www.tu.tv/http://www.tu.tv/http://hispavista.com/


38/142

1. INTRODUCTION

Consortium: Bosch Security Systems,Philips Electronics Netherlands,Philips

Medical Systems,Philips Consumer Electronics,TU/e, TU Delft, Multitel,

ACIC, Barco, Traficon, VTT, Solid, Hantro, Capacity Networks, I&IMS, Tele-

fonica VicomtechIK4, University Pompeu Fabra, CRP, Henri Tudor, Co-dasystem, Kingston University, University of York, INRIA.

1.2.5.1 Summary

ThegoalofCantatawas to create a distributed service for content analysis. The ap-

plication field included medical imaging, entertainment an security. Our activity

was focused in the entertainment sector where the content analysis modules were

connected to user profiles in order to create content recommendation systems.

In this case, the logo detection system was used to provide content informa-

tion to the main content analysis and recommendation system.

1.2.5.2 Conclusions

The recommendation system was intended to combine user activity information,

content metadata and low-level feature based information. However, the broad

domain definition required an unfordable amount of low-level descriptors and

even the combination of all these descriptors would be a very complex issue. Dueto this complexity, most recommendation systems rely basically on metadata.

1.2.6 RUSHES

Title:Retrieval of mUltimedia Semantic units for enHanced rEuSability.

Project typology: FP6-2005-IST-6.

Period:2007-2009

Consortium: Heinrich-Hertz-Institut (DE), University of Surrey (UK),Athens Technology Centre (GR), Vcomtech (ES), Queen Mary University of

London (UK), Telefonica I+D (ES), FAST Search & Transfer (NO), University

of Brescia (IT), ETB (ES).

1.2.6.1 Summary

The overall aim ofRUSHESwas to design, implement, validate, and trial a system

for both delivery of, access to raw media material (rushes) and the reuse of that

12


39/142

1.2 R&D Projects

content in the production of new multimedia assets, following multimodal ap-

proach that combines knowledge extracted from the original content, metadata,

visualization and highly interactive browsing functionality.

The core research issues ofRUSHESwere:

Knowledge extraction for semantic inference

Automatic semantic-based content annotation

Scalable multimedia cataloging

Interactive navigation over distributed databases

Non-linear querying and retrieval techniques using hierarchic descriptors.

Figure 1.3:Rushes content analysis workflow

1.2.6.2 Conclusions

TheRUSHESconsortium tried to address thesemantic gapby creating a powerful

architecture composed of low-level operators. The workflow designed in Rushes

1.3was able to combine multiple low-level features and multiple types of sources

13


40/142

1. INTRODUCTION

(video, audio, text). Moreover, the shot was considered as a semantic unit of a

video. Due to the fact that different shot boundary operators provide different

shots, an extra complexity was added to the metadata model where each feature

could define its temporal boundaries.All the low-level operators were applied to every content in the database. This

fact introduced a strong limitation in the scalability of the domain. In order to

identify new concepts, more low-level operators might be needed and as the size

of the feature-space dimensionality increased, the system became both com-

putationally too demanding and unfordable for the data mining and ontology

management processes. We presented a potential solution to this problem in

[OMK+09] by splitting the domain in sub-domains that only apply those low-

level feature extraction operators suggested by the domain definition (ontology).

However it requires a prior knowledge of the content that should be obtained by

applying low-level operators. Thischicken-eggproblem will be one of the key

topics of this research work.

1.2.7 Grafema

Title:Grafema: Multimodal content search platform

Project typology:Basic research project.

Period:2012.

1.2.7.1 Summary

The goal of the Grafema project was to create a base platform to store, annotate

and retrieve multimedia content of diverse nature. More than focusing on the

algorithms to obtain content descriptors or methods for automatic content an-

notation, Grafema was focused on the architectural aspects and the design of a

generic solution to deal with different types of content. In this sense, an asset

could be either text, image, audio, video, 3D or even a combination of these pre-

vious elementary units. According to this generic description of a digital asset,

similarity metrics must also adapt to each case or combination. As it can be ob-

served in Figure1.4, assets containing the labeltigercan be considered as similar

if they include this information in the metadata or if this label is found in any of

the elementary units that compose the content.

The workflow designed for Grafema (Figure1.5) is based in low-level oper-

ators that are independently processed. The information obtained from these

14


41/142

1.2 R&D Projects

Figure 1.4:Grafema Assets

operators is then ported to a higher level of abstraction by using data mining

techniques. The obtained information is then introduced in a semantic model

and stored in a database. The similarity of two assets can be then computed ac-

cording to this semantic model, but it is not limited to this metric. An iterative

process starts and enables the calculation of similarity metrics between assets

of the same type that belong to different instances. This iterative process, is the

basis of the Grafema architecture (Figure1.6) and provides a new paradigm of

content search a retrieval based more in a browsing process than in a pure text

based search.

Figure 1.5:Grafema System Workflow

15


42/142

1. INTRODUCTION

1.2.7.2 Conclusions

The results of Grafema have shown the big potential of iterative processes for mul-

timedia searching. Even if the tests have been carried out with limited datasets in

terms of size and domain complexity, the results show that text based search canbe dramatically improved datasets include high volumes of multimedia content.

Figure 1.6:Grafema System Architecture

Regarding the state of the art, the annotation and individual metrics as well

as the unsuitability of most common database solutions for multimedia data are

still the main drawbacks that limit the potential of these kind of systems.

1.2.8 IQCBM

Title:Image Query by Compression Based Methods Project typology:Industrial project.

Period:2011.

Consortium:DLR (German Aerospacial Agency).

1.2.8.1 Summary

The goal of this project was to create low-level operators and define distance met-

rics for satellite imagery that would be applied during the ingestion process of the

16


43/142

1.2 R&D Projects

delivered streams. The main idea behind this operators was to gather prior char-

acteristic information that could be useful for later retrieval operations. The lack

of knowledge regarding the queries that may be applied during the retrieval pro-

cess made difficult the definition of low-level features that might not be focusedin any specific aspect.

The domain of Remote Sensing is not as broad as those related with the audio-

visual sector, but are still too big and complex to be explicitly defined. Moreover,

new definitions and relationships could be dynamically introduced.

Rankeddocs

Comparisonframework

Rankeddocs

Rankeddocs

Codebook

a

nalysis

user classes

complexity

nanocodebooksInputimagepatch

RGBto

HSV

Q 64x64 Compute

Dict

MPEG-7 VST

TIFF LZW

JPEG/MPEG DCT

JPEG 2K WLTs

MonetDBstorage + execution

Random

FCD

PRDC

MonetDBDjangoadapter

UI

Guese-broekPAMI2001

Your post-processor here

Your feature extractor here

Your

pre-processor

here

Yourdistance

measurehere

Yourperformance

measures

here

Statisticalmeasures

Singlequery

measures

Query/ranking frameworkIndexing frameworkAnalysis frameworkPre-

process.framew.

Figure 1.7:IQCBM System Architecture

In order to address the lack of prior knowledge, global features were consid-

ered more adequate than the local ones. The first algorithm implemented in this

project was based on the codewords provided by a Lempel-ziv compressor as sug-

gested by Watanabe et al. [WSS02]. The L0 distance (Equation1.1) was used as

a metric for the codewords related to each element (in this case an element is

represented by each of the patches obtained after a tiling process applied to the

multi-resolution satellite imagery).

dL0 =n

i=1|xiyi|0 where: 00 = 0 (1.1)

1.2.8.2 Conclusions

The developed system (Figure1.8) was tested against Corel 1000 dataset [Cor]

and a subset of the Geoeye imagery [Glo]obtained good accuracy characteristics.

The length of the feature vector for each item was variable and was an attribute by

17


44/142

1. INTRODUCTION

itself as it provides a measure of the complexity of the image. However, in terms

of scalability, the average length (several thousands of codewords) obtained by

this algorithm might become a limitation.

Figure 1.8:Screenshots of the IQCBM user interface

As a result of this project, a deep study of the current trends of the communitywas carried out [QGO13]. Moreover, a new global feature extraction algorithm

was developed based on the ideas of Kadyrov et al. [KP98,KP01,KP06].

1.2.9 Relationship of projects and scientific activity

Figure1.9shows a summary of the main scientific activity around the aforemen-

tioned projects. It can be observed that different projects share the same research

activities and scientific background. However, this activity sharing does not imply

the reusability of previous developments as different semantic domains require

specific low-level features, metrics, mining techniques, etc.

In order to minimize these barriers to a higher practical re-usability, a new

architecture is proposed in this work.

18


45/142

1.2 R&D Projects

Figure 1.9:Relationship between R&D projects and scientific activity in multimedia

content analysis.

19


46/142


47/142

By three methods we may learn

wisdom: First, by reflection, which

is noblest; Second, by imitation,

which is easiest; and third by ex-

perience, which is the bitterest.

Confucius CHAPTER

2Computer Vision from a

Cognitive Point of View

The main approach for computer vision tasks has been based on the identifi-

cation of low-level features that can be used to segment, identify or determine

higher abstraction levels. Following the three learning methods stated by Confu-

cious in the previousquotation, we cansay that this approach provides knowledge

to the system by imitation. By giving this explicit knowledge based on existing

datasets or contents that have been used to allow researchers to understand the

relationship between the identification of real world phenomena and specific

low-level features, the computer vision system just reproduces the experiments

with different datasets. This process offers high performance results for narrow

domains but it cannot be reused in other contexts and it is not a scalable ap-

proach. The feature space created by these low-level operators easily gets too

complex for manual (and typically linear) thresholding techniques. For those

cases where the behavior of a set of low-level feature extractors is too complex

to model, data mining techniques are applied. In those cases, we can move to

the third level of Confucious statement. The experience of dealing with this data

(training for supervised classification and other specific metrics or criteria for

clustering) provides an adaptive behavior within the system to create the regions

or hyperplanes that fit best for each specific problem.

The use of ontologies introduces a new way of adding formal explicit knowl-

edge to the system. This is typically carried out by establishing concepts and

21


48/142

2. COMPUTER VISION FROM A COGNITIVE POINT OF VIEW

Figure 2.1:Information Retrieval Reference Model [Mar11]

relationships among them, defining a domain in this way. One common use of

ontologies is to establish shared vocabularies and taxonomies between scientistor professionals. However, from a cognitive system perspective, the most pow-

erful characteristic of ontologies is the capability of inference that creates new

rules that where not explicitly defined. The main drawback of ontologies comes

from the fact that broad complex domains such as those related to the common

vision understanding cannot be specifically defined, mainly because the size, the

complexity and the fuzziness of this kind of domains.

Content Based Image Retrieval (CBIR) systems can be considered as one of the

branches of cognitive vision since they require the four functionalities considered

as the pillars of a cognitive vision system: detection, localization, recognition and

understanding[Ver06]. Marcos et al propose a reference model that addresses the

use of ontologies for multimedia retrieval purposes [MIOF11]. This work presents

a reference model (Figure2.1) based on asemantic middleware. The main goal

of this approach is to create a layer to deal with semantic functionalities (e.g.:

knowledge extraction, semantic query expansion,. . . ).

Marcos proposes in his PhD work[Mar11] the use of the semantic middleware

22


49/142

2.1 Mandragora Framework

to automatically generate annotations of the multimedia assets. This approach,

initiated in the Rushes project by using a set of low-level features and applying

fuzzy reasoning to the information provided by those modules offered good re-

sults for narrow domains, but the system was unable to deal with a big numberof different low-level features and broad complex domains did not show a good

performance. One of the main drawbacks of this architecture was the fact that all

low-level features were considered at the same level when no prior information

was given.

2.1 Mandragora Framework

In order to overcome this scalability drawback, we presented a novel architec-

ture calledMandragora[OMK+09]. This architecture enhances the metadata with

new labels that can be ported to the semantic layer by using a two step itera-

tive approach. The implicit and explicit knowledge about a certain domain can

be introduced in the system with a combination of classifiers and the semantic

middleware. This combination allows the modeling of bigger and more complex

domains[SASK08] and reduces the semantic gap by connecting low-level features

with high-level hypothesis and reinforcement factors. The reinforcement factors

allow to extend the dimensionality of the domain and provide the framework for

specific analysis methods.

The main idea behind this two step approach is to break big domains in sub-

domains that are more homogeneous both semantically and in terms of low-level

features. Then, specific feature extractors and semantic definitions can be used

with much higher precision. One of the key aspects of this framework is the initial

domain estimation, the hypothesis that will be considered by the next layer to

launch domain specific analyzers that afterwards will feed the semantic middle-

ware. If the results of this second step confirm the characteristics of the estimated

domain, the hypothesis will be accepted and the elements identified in the con-

tent will be considered as descriptors of this specific asset. Otherwise, the process

will be restarted with a different hypothesis.

2.2 Image Processing and AI Approach

From an Artificial Intelligence perspective we can consider the climbing on the

DIKW Pyramid (Figure )2.3) as the process that our system has to follow from raw

23


50/142

2. COMPUTER VISION FROM A COGNITIVE POINT OF VIEW

Figure 2.2:Mandragora Architecture for automatic video annotation:[OMK+09]

unannotated images to structured content with semantic information. The main

issue consists of the semantic gap between the mathematical representation ob-tained by the developed operators and the high abstraction level concepts that

are intended to be discovered by using such low-level features. Smeulders et al.

define the Semantic Gap as: . . . the lack of coincidence between the information

that one can extract from the visual data and the interpretation that the same data

have for a user in a given situation. According to this definition, we can consider

the semantic gap as the distance betweendataandwisdomin the DIKW pyramid.

The typical approaches are both top-down, (ontologically driven approaches

24


51/142

2.2 Image Processing and AI Approach

that build domain definitions by creating relationships between high level con-

cepts) and bottom-up, automatic annotation or labeling approaches that try to

discover correspondences between high level annotations and automatically ex-

tracted features [HSL+06]. These both approaches can also be combined in thesame process.

Figure 2.3:DIKW Pyramid

Most bottom-up approaches relay on data-mining techniques to move from

low-level mathematical representation to classes that will be at a higher abstrac-

tion level. There is a enormous diversity of different supervisedunsupervised

classification or regression techniques, methods for feature space analysis, algo-

rithms for attribute selection etc. Thus, each specific problem requires the set

of tools and algorithms that suits best for each characteristics and requirements

(type of attributes and classes, dimensionality, dataset size, computational cost,

etc.).

25


52/142


53/142

The difference between stupidity

and genius is that genius has its

limits.

Albert Einstein

CHAPTER

3Domain Identification

As it has been stated in the previous section, the domain identification is one

of the key issues of cognitive vision as it allows the use of contextual information.

Current best performing systems are mainly those where the size and the com-

plexity of the domain are relatively low. Deng et al. [ DBLFF10] perform a study

the effects of dealing with more than 10,000 categories. The results show that:

Computational issues become crucial in algorithm design.

Conventional wisdom from a couple of hundred image categories on rela-

tive performance of different classifiers does not necessarily hold when the

number of categories increases.

There is a surprisingly strong relationship between the structure of the

WordNet and the difficulty of visual categorization.

Classification can be improved by exploiting the semantic hierarchy.

The process carried out by Deng et al. is based on state of the art descriptors

such as GIST[OT01] and SIFT[Low99]. The classification process uses Support

Vector Machines and the dataset includes more than 9 million assets.

Popular AI development results such as Deep Blue against Kasparov [Dee]

commonly considered as a great step in AI where machines are able to beat hu-

man minds are clear cases where the domain and the rules that define it are rather

simple, while combinatorial space derived from it becomes huge. For those cases,

27


54/142

3. DOMAIN IDENTIFICATION

Figure 3.1:Watson DeepQA High-Level Architecture [FBCC+10]

brute force algorithms can defeat human experience and heuristics capabilities.

In the case of Deep Blue its domain dependence was so high that even some

hardware components where specifically designed for chess playing purposes.

A step forward was done by Watson [Wat] in 2011 that won theJeopardy!prize

against former winners. In these cases, Watson was able to process natural lan-

guage by identifying keywords and accessing 200 million pages of structured and

unstructured content. As it is stated in the IBM DeepQA Research Team (devel-

opers of Watson) when they refer to WatsonThis is no easy task for a computer,

given the need to perform over an enormously broad domain, with consistently

high precision and amazingly accurate confidence estimations in the correctness of

its answers. However, even if the constraints to perform this task are much harder

than for chess playing, apart from the natural language processing module, the

task of playing theJeopardy!can be considered as an advanced text search en-

gine that does not require prior contextual knowledge as it can be observed in itsarchitectural design (Figure3.1).

The current state of the art is plenty of AI approaches that face the same limi-

tation observed in these two examples. They obtain a very good performance in a

specific narrow domain but fail when it scales up or when the same system is ap-

plied for a different problem. Current multimedia information retrieval systems

are exactly in this situation where contents belonging to specific contexts can be

successfully managed but have strong limitations of flexibility and scalability.

28


55/142

3.1 Domain characterization for CBIR


The importance of semantic context is very well known in Content Based Image

Retrieval (CBIR)[SF91,TS01]. This is especially relevant for broad-domain data

intensive multimedia retrieval activities such as TV production and marketing or

large-scale earth observation archive navigation and exploitation. Most modeling

approaches rely on local low-level features, based on shape, texture, color etc. The

drawback of these methods is that the characterization of the context requires

prior contextual information, introducing a chicken-and-egg problem[TMF10]. A

possible approach to reduce this dependency involves the exploitation of global

image context characterization for semantic domain inference. This prior in-

formation on scene context could represent a valuable asset in computer vision

for purposes ranging from regularization to the pre-selection of local primitive

feature extractors [SWS+00].

3.1.1 Broadcasting

The broadcasting sector has experienced a deep transformation with the intro-

duction of digital technologies. All internal work-flows have been affected by the

fact of representing content digitally. Regarding the Multimedia Asset Manage-

ment (MAM) systems, before the content was digital, all assets were centralized

and managed by documentalists/librarians, professionals that following a rigid

taxonomy were responsible of annotating, storing and retrieving the content.

Therefore, the work-flow was organized in a manner that documentalists offer

the content management service to editors. Since the digitalization of ingesting

and delivery processes, editors can directly and concurrently access to the con-

tent they are looking for. It offers great advantages in terms of efficiency allowing

non-linear editing and minimizing access times. However, this new work style in-

troduces much more inconsistencies since contents are concurrently annotated

by users that do not strictly follow a given taxonomy. in the metadata and in order

to create direct search and retrieval services, content annotations must be richer

and better since editors do not have the knowledge of documentalists to browse

among millions of assets. In order to get this improved metadata, manual annota-

tions result too expensive for most cases and automatic annotation systems are

not able to characterize high abstraction level categories, specially due to the size

and complexity of the broadcasting context.

29


56/142


From a technical point of view, there are many industrial solutions and stan-

dards for metadata (SMEF, BMF, Dublin Core, TV Anytime, MPEG-7, SMPTE

Descriptive Metadata, PB Core, MXF-DMS1, XMP etc.) that offer good retrieval

characteristics. However, all these technologies and specifications rely on a previ-ously annotated dataset that in most practical cases cannot be populated at an

affordable cost.

3.1.1.1 Alternative methods for massive content annotation

The explosion of prosumers and web video portals offer a new way of enriching

content with metadata. Most of these platforms offer the possibility of leaving

comments that can be used as annotations afterwards. However, these annota-

tions are always unstructured and their confidence is much lower. Therefore, theycannot be used directly as a source of metadata.

On the other hand, speech processing tools that nowadays are being used

to create subtitles, offer another source of textual information that is very repre-

sentative of the content. The use of the audio channel to create metadata faces

the same problem of unstructured text as users comments. However, it offers a

very rich and highly related text that fits very well with current text based search

engines.

3.1.2 Earth Observation, Meteorology

An extensive review of the state of the art of content-based retrieval in Earth

Observation (EO) image archives is presented in Section 7.2. Compared with

the broadcasting application field, EO archive volumes deal with even bigger

data volumes (approaching the zettabyte)1. The assets they contain are largely

under-exploited: the majority of records have never been accessed. Up to 95% of

records have never been accessed according to figures reported in conferences.

The situation is exacerbated by the growing interest in and availability of met-

ric and sub-metric resolution sensors, due to the ever-expanding data volumes

and the extreme diversity of content in the imaged scenes at these scales. As it

happens in the broadcasting sector, interpreters to manually annotate archived

content are expensive and tend to operate in applicative domains with stable,

1 The data volume for the EOC DIMS Archive in Oberpfaffenhofen is projected to about 2

petabytes in 2013 (Christoph Reck, DLR-DFD, presentation during ESA EOLib User Requirements

workshop, ESRIN November 17, 2011)

30


57/142


Figure 3.2: Idealized query process decomposition into processing modules and

basic operations based on an adaptation of Smeulders et al.[SWS+00].

well-formalized requirements rather than on the open-ended needs of the remote

sensing community at large or of broad efforts like GEOSS [KYDN11].

Regarding the domain, the EO semantic space is much more focused than the

one required for broadcasting content. In fact, Domain-specific ontologies help

to define concepts in a finer granularity. For specific uses such as the context of

disaster management in coastal areas: ontologies for Landsat1 and MODIS2 im-

agery based on the Anderson classification system[And76] have been developed.

However, the semantic gap between the huge amount of data remains still as an

issue to automatically fulfill these specific ontologies. A general decomposition

of a theoretical query process is depicted in Figure3.2.

1http://landsat.gsfc.nasa.gov/2http://modis.gsfc.nasa.gov/

31
http://landsat.gsfc.nasa.gov/http://modis.gsfc.nasa.gov/http://modis.gsfc.nasa.gov/http://landsat.gsfc.nasa.gov/


58/142


A special particularity of the EO domain is the diversity of type of data pro-

vided by the instruments installed in a satellite, where most of them are affected

by noise and distortions produced by the distance, atmosphere, etc.

Envisat (Environmental Satellite) launched on 2002 and operated by ESA(European Space Agency) includes the following instruments1 (Figure3.3):

ASAR: Advanced Synthetic Aperture Radar, operating at C-band, ASAR ensures

continuity with the image mode (SAR) and the wave mode of the ERS-1/2

AMI.

MERIS a programmable, medium-spectral resolution, imaging spectrometer op-

erating in the solar reflective spectral range. Fifteen spectral bands can be

selected by ground command, each of which has a programmable width

and a programmable location in the 390 nm to 1040 nm spectral range.

AATSR: Advanced Along Track Scanning Radiometer, continuity of the ATSR-1

and ATSR-2 data sets of precise sea surface temperature (SST) levels of

accuracy (0.3 K or better).

RA-2 Radar Altimeter 2 (RA-2) is an instrument for determining the two-way

delay of the radar echo from the Earths surface to a very high precision:less than a nanosecond. It also measures the power and the shape of the

reflected radar pulses.

MWR: microwave radiometer (MWR) for the measurement of the integrated

atmospheric water vapour column and cloud liquid water content, as cor-

rection terms for the radar altimeter signal. In addition, MWRmeasurement

data are useful for the determination of surface emissivity and soil moisture

over land, for surface energy budget investigations to support atmospheric

studies, and for ice characterization.

GOMOS: measures atmospheric constituents by spectral analysis of the spec-

tral bands between 250 nm to 675 nm, 756 nm to 773 nm, a

A framework for content based semantic information extraction from multimedia contents

Documents

Transcript of A framework for content based semantic information extraction from multimedia contents