Fachberichte INFORMATIK · Mixed-reality as a challenge to image understanding and artiﬁcial...

Mixed-reality as a challenge to imageunderstanding and artificial

intelligence

Dietrich Paulus, Detlev Droege

11/2005

FachberichteINFORMATIKISSN 1860-4471

Universitat Koblenz-LandauInstitut fur Informatik, Universitatsstr. 1, D-56070 Koblenz

E-mail: [email protected] ,

WWW: http://www.uni-koblenz.de/FB4/

KI 2005 Workshop 7

Mixed-reality as achallenge to imageunderstanding andartificial intelligence

September 11th, 2005Koblenz

Dietrich PaulusDetlev Droege(Eds.)

Preface

Knowledge representation and use has been a central concernfor computer visionsince decades. This topic becomes even more important as it is now possible to aug-ment the reality thru real-time computer graphics in combination with real-time com-puter vision. As both disciplines need to cooperate, they also need to agree on com-mon representation schemes for world, objects, functions,actions, etc. Vision andgraphics together need the disourse with knowledge representation experts.

Knowledge-based methods are required when images are requested from largeimage collections. Such problems often occur when web-technologies are applied.Image processing meets semantic web technologies in this context. The first fourcontributions in this volume are related to this scenario.

The role of knowledge-based processing in computer graphics is presented in onecontribution.

The remaining other four contributions deal with topics that are related to know-ledge-representation and preception in general or with tasks that need to be solved forthe construction of systems for augmented reality.

The semantic gap between image processing and knowledge-based analysis stillseems to be open in large systems for augmented reality. We hope that the contribu-tions in this volume help make this gap narrower.

Koblenz, September 2005

Bärbel Mertsching and Dietrich Paulus

Organization

This workshop 7 as part of KI 2005 is organized by the institute of ComputationalVisualistics, Universität Koblenz-Landau, (Dietrich Paulus and Detlev Droege).

Technical Program Chairs

Prof. Dr.-Ing. Dietrich Paulus (Universität Koblenz–Landau)Prof. Dr. Bärbel Mertsching (Universität Paderborn)

Program Committee

Prof. Dr.-Ing. Gerd Sagerer (Universität Bielefeld)Prof. Dr.-Ing. Dietrich Paulus (Universität Koblenz-Landau)Prof. Dr. Steffen Staab (Universität Koblenz-Landau)Prof. Dr. Stefan Müller (Universität Koblenz-Landau)Prof. Dr. Thomas Strothotte (Universität Magdeburg)Prof. Dr. Josef Schneeberger (Schema AG, Nürnberg)

Address

Universität Koblenz-Landau, KoblenzInstitut für Computervisualistik,Universitätsstr. 156070 Koblenz

http://www.uni-koblenz.de/icv

+49 (261) 287-2750 (phone)

Table of Contents

Combined Domain Specific and Multimedia Ontologies for ImageUnderstanding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Kosmas Petridis, Frederic Precioso, Thanos Athanasiadis,Yannis Avrithisand Yiannis Kompatsiaris

Knowledge-Based Image Analysis Applied to Ornaments in Arts . . . . . . . . . . . 8C. Schmidt, C. Schneider, B. Schüler C. Saathoff, D. Paulus

Diagnostic Reasoning supported by Content-Based Image Retrieval . . . . . . . . . 19Ch. Münzenmayer, A. Hirsch, D. Paulus, Th. Wittenberg,

Visual Scene Memory Based on Multi-Mosaics. . . . . . . . . . . . . . . . . . . . . . . . . 27Birgit Möller, Stefan Posch (University Halle)

The Mental Continuum: Control Models for Virtual Humans in Real WorldSituations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Johannes Strassner, Marion Langer, Stefan Müller

Using Augmented Reality for Interactive Model Acquisition. . . . . . . . . . . . . . . 41S. Wachsmuth, M. Hanheide, S. Wrede, Ch. Bauckhage

Dependence of Conceptual Representations for Temporal Developments inVideosequences on a Target Language. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Aleš Fexa

Outline of a Computational Theory of Human Vision. . . . . . . . . . . . . . . . . . . . . 55Fridolin Wild

Combined Domain Specific and Multimedia Ontologiesfor Image Understanding

Kosmas Petridis1, Frederic Precioso1, Thanos Athanasiadis2, Yannis Avrithis2 andYiannis Kompatsiaris1

1 Informatics and Telematics Institute, GR-57001 Thermi-Thessaloniki, Greece2 National Technical University of Athens, School of Electrical and Computer Engineering, GR-15773

Zographou, Athens, Greece

Abstract. Knowledge representation and annotation of multimedia documents typically havebeen pursued in two different directions. Previous approaches have focused either on low leveldescriptors, such asdominant color, or on the content dimension and corresponding manual anno-tations, such aspersonor vehicle. In this paper, we present a knowledge infrastructure to bridge thegap between the two directions. Ontologies are being extended and enriched to include low-levelaudiovisual features and descriptors. Additionally, a tool for linking low-level MPEG-7 visual de-scriptions to ontologies and annotations has been developed. In this way, we construct ontologiesthat include prototypical instances of domain concepts together with a formal specification of thecorresponding visual descriptors. Thus, we combine high-level domain concepts and low-levelmultimedia descriptions, enabling for new media content analysis.

1 Introduction

Representation and semantic annotation of multimedia content have been identified asan important step towards more efficient manipulation and retrieval of visual media.Today, new multimedia standards such as MPEG-4 and MPEG-7, provide importantfunctionalities for manipulation and transmission of objects and associated metadata.The extraction of semantic descriptions and annotation of the content with the corre-sponding metadata though, is out of the scope of these standards and is still left to thecontent manager. This motivates heavy research efforts in the direction of automaticannotation of multimedia content.

Here, we recognize a broad chasm between existing multimedia analysis meth-ods and tools on one hand and semantic description, annotation methods and toolson the other. The state-of-the-art multimedia analysis systems are severely limitingthemselves by resorting mostly to visual descriptions at a very low level, e.g. thedominant color of a picture. However, ontologies that express key entities and rela-tionships of multimedia content in a formal machine-processable representation canhelp to bridge thesemantic gap[1, 2] between the automatically extracted low-levelarithmetic features and the high-level human understandable semantic concept.

Work onsemantic annotation[3] currently addresses mainly textual resources [4]or simple annotation of photographs [5]. In themultimedia analysisarea, knowledgeabout multimedia content domains is a promising approach bywhich Semantic Webtechnologies can be incorporated into techniques that capture objects through auto-matic parsing of multimedia content. In [6], ontology-based semantic descriptions

2

Fig. 1. Ontology Structure Overview

of images are generated based on appropriately defined rulesthat associate MPEG-7low-level features to the concepts included in the ontologies. The architecture pre-sented in [7] consists of an audio-visual ontology in compliance with the MPEG-7specifications and corresponding domain ontologies.

Acknowledging the relevance between low-level visual descriptions and formal,uniform machine-processable representations, we try to bridge the chasm by provid-ing a knowledge infrastructure design focusing both on multimedia related ontologiesand domain specific structures. The remainder of the paper isorganized as follows:in section 2 we present the general ontology infrastructuredesign, including a briefdescription of a tool to assist the annotation process needed for initializing the knowl-edge base with descriptor instances of domain concepts in question. A small overviewand results from the knowledge-assisted analysis process,which are exploiting thedeveloped infrastructure and annotation framework are presented in section 3. Weconclude with a summary of our work in section 4.

2 Knowledge Representation

Based on the above, we propose a comprehensive Ontology Infrastructure, the com-ponents of which will be described in this section. The challenge is that the hybridnature of multimedia data must be necessarily reflected in the ontology architecturethat represents and links multimedia and content layers. Fig. 1 summarizes the devel-oped knowledge infrastructure.

OverviewOur framework usesRDFS (Resource Description Framework Schema)as modeling language. This decision reflects the fact that a full usage of the increasedexpressiveness ofOWL (Web Ontology Language)requires specialized and more ad-vanced inference engines that are not yet available, especially when dealing with largenumbers of instances with slot fillers.

Core OntologyThe role of the core ontology in this overall framework is to serveas a starting point for the construction of new ontologies, to provide a reference point

3

for comparisons among different ontological approaches and to serve as a bridgebetween existing ontologies. In our framework, we have usedDOLCE [8] for thispurpose.

Prototype ApproachDescribing the characteristics of concepts for exploitation inmultimedia analysis naturally leads to a meta-concept modeling dilemma. This issueoccurs in the sense that using concepts as property values isnot directly possiblewhile avoiding2nd order modeling, i.e. staying within the scope of OWL DL. Inour framework, we propose to enrich the knowledge base with instances of domainconcepts that serve asprototypesfor these concepts. This status is modeled by havingthese instances also instantiate an additionalVDO-EXT:Prototype concept froma separateVisual Annotation Ontology (VDO-EXT). Each of these instances is thenlinked to the appropriate visual descriptor instances. Theapproach we have adoptedis thus pragmatical, easily extensible and conceptually clean.

Multimedia Ontologies Multimedia Ontologiesmodel the domain of multimediadata, especially the visualizations in still images and videos in terms of low-level fea-tures and media structure descriptions. Structure and semantics are carefully modeledto be largely consistent with existing multimedia description standards like MPEG-7.

Visual Descriptor OntologyThe Visual Descriptor Ontology (VDO) contains therepresentations of the MPEG-7 visual descriptors, modelsConceptsandPropertiesthat describe visual characteristics of objects. Althoughthe construction of the VDOis tightly coupled with the specification of the MPEG-7 Visual Part [9], several mod-ifications were carried out in order to adapt to the XML Schemaprovided by MPEG-7 to an ontology and the data type representations availablein RDF Schema. TheVDO:VisualDescriptor concept is the top concept of the VDO and subsumesall modeled visual descriptors. It consists primarily of six subconcepts, one for eachcategory that the MPEG-7 standard specifies. These are:color, shape, texture, mo-tion, localizationandbasic descriptors. Each of these categories includes a numberof relevant descriptors that are correspondingly defined asconcepts in the VDO.

Multimedia Structure OntologyThe Multimedia Structure Ontology (MSO) mod-els basic multimedia entities from the MPEG-7 Multimedia Description Scheme [10]and mutual relations like decomposition. Within MPEG-7, multimedia content is clas-sified into five types:image, video, audio, audiovisualandmultimedia.

Domain Ontologies In the multimedia annotation framework, the domain on-tologies are meant to model the content layer of multimedia content with respect tospecific real-world domains, such as sports events like tennis. All domain ontologiesare explicitly based on or aligned to the DOLCE core ontology, and thus connectedby high-level concepts, what in turn assures interoperability between different domainontologies at a later stage.

In the context of our work, domain ontologies are created andmaintained by con-tent managers or indexers. They are defined to provide a general model of the domain,with focus on the users´ specific point of view. In general, the domain ontology needsto model the domain in a way that on the one hand the retrieval of pictures becomes

4

more efficient for a user of a multimedia application and on the other hand the in-cluded concepts can also be automatically extracted from the multimedia layer. Inother words, the concepts have to be recognizable by automatic analysis methods, butneed to remain comprehensible for a human.

M-OntoMat-Annotizer framework In order to exploit the ontology infrastruc-ture presented above and annotate the domain ontologies with low-level multimediadescriptors, the usage of a tool is necessary. Our implemented framework is calledM-OntoMat-Annotizer1 (M stands for Multimedia) [11]. The development was based onan extension of the CREAM (CREAting Metadata for the Semantic Web) framework[4] and its reference implementation,OntoMat-Annotizer2.

For this reason, theVisual Descriptor Extraction (VDE)tool was implemented asa plug-in to OntoMat-Annotizer and is the core component forextending its capabili-ties and supporting the initialization of domain ontologies with low-level multimediafeatures. The VDE plug-in manages the overall low-level feature extraction and link-ing process by communicating with the other components. Using this tool, we manageto build the knowledge base that will serve as the primary reference resource for themultimedia content analysis process presented in the next section.

3 Knowledge-Assisted Multimedia Analysis

The Knowledge-Assisted Analysis system (KAA) includes methods that automati-cally segment images, video sequences and key frames into areas corresponding tosalient semantic objects (e.g. cars, road, people, field, etc), track these objects overtime, and provide a flexible infrastructure for further analysis of their relative mo-tion and interactions, as well as object recognition, metadata generation, indexingand retrieval. Recognition is performed by comparing existing semantic descriptionscontained in the multimedia-enriched domain ontologies tolower-level features ex-tracted in the signal (image/video), thus identifying objects and their relations in themultimedia content.

A more precise description of the KAA general architecture scheme is given inFig.2. The core of the architecture is defined by the region adjacency graph. Thisgraph structure holds the region-based representation of the image during the anal-ysis process. During image/video analysis, a set of atom-regions is generated by aninitial segmentation. Each node of the graph corresponds toan atom-region and holdsthe Dominant Color and Region Shape MPEG-7 visual descriptors extracted for thisspecific region. The next step for the analysis is to compute amatching distance valuebetween each one of these atom-regions and each one of the prototype instances ofall concepts in the domain ontology. This matching distanceis evaluated by means oflow-level visual descriptors. In order to combine the current two modalities, Domi-nant Color and Region Shape, in a unique matching distance, we use a neural network

1 seehttp://www.acemedia.org/aceMedia/results/software/m -ontomat-annotizer.html2 seehttp://annotation.semanticweb.org/ontomat/

5

Fig. 2.Knowledge-assisted analysis architecture

Fig. 3.Holiday-Beach domain results

approach that provides us with the required distance weighting. Finally, a unique se-mantic label is assigned to each region corresponding to theconcept with minimumdistance. Spatial relations (such as “above", "below", "isincluded in"...) are extractedfor each atom-region. Such information can be further used in a reasoning process inorder to refine the semantic labelling. This approach is generic and applicable to anydomain as long as new domain ontologies are designed and madeavailable.

As illustrated in Fig. 3, the resulting system output is a segmentation mask outlin-

6

ing the semantic description of the scene. The different colors assigned to the gener-ated atom-regions corresponding to the object classes defined in the domain ontology.

4 Conclusion

In this paper, an integrated infrastructure for semantic multimedia content annotationand analysis was presented. This framework comprises ontologies for the descriptionof low-level visual features and for linking these descriptions to concepts in domainontologies based on a prototype approach. The generation ofthe visual descriptorsand the linking with the domain concepts is embedded in a user-friendly tool, whichhides analysis-specific details from the user. Thus, the definition of appropriate visualdescriptors can be accomplished by domain experts, withoutthe need to have a deeperunderstanding of ontologies or low-level multimedia representations.

Finally, despite the early stage of multimedia analysis experiments, first resultsbased on the ontologies presented in this work are promisingand show that it is pos-sible to apply the same analysis algorithms to process different kinds of images orvideo, by simply employing different domain ontologies. Apart from visual descrip-tions and relations, future focus will concentrate on the reasoning process and thecreation of rules in order to detect more complex events. Theexamination of the inter-active process between ontology evolution and use of ontologies for content analysiswill also be the target of our future work, in the direction ofhandling the semanticgap in multimedia content interpretation.

AcknowledgementsThis research was partially supported by the European Commission under contract FP6-

001765 aceMedia. The expressed content is the view of the authors but not necessarily the view of the aceMedia

project as a whole.

References

1. R. Brunelli, O.M., Modena, C.: A survey on video indexing.Journal of Visual Communications and ImageRepresentation10 (1999) 78–112

2. Smeulders, A., Worring, M., Santini, S., Gupta, A., Jain,R.: Content-based image retrieval at the end of theearly years. IEEE Transactions on Pattern Analysis and Machine Intelligence (22)

3. Handschuh, S., Staab, S., eds.: Annotation for the Semantic Web. IOS Press (2003)4. Handschuh, S., Staab, S.: Cream - creating metadata for the semantic web. Computer Networks42 (2003)

579–598 Elsevier.5. A.Th. Schreiber, B. Dubbeldam, J.W., Wielinga, B.: Ontology-based photo annotation. IEEE Intelligent

Systems (2001)6. Hunter, J., Drennan, J., Little, S.: Realizing the hydrogen economy through semantic web technologies. IEEE

Intelligent Systems Journal - Special Issue on eScience19 (2004) 40–477. Troncy, R.: Integrating Structure and Semantics into Audio-Visual Documents. In: Proceedings of the 2nd

International Semantic Web Conference (ISWC 2003). (2003)8. Gangemi, A., Guarino, N., Masolo, C., Oltramari, A., Schneider, L.: Sweetening Ontologies with DOLCE.

In: Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web, Proceedingsof the 13th International Conference on Knowledge Acquisition, Modeling and Management, EKAW 2002.Volume 2473 of Lecture Notes in Computer Science., Siguenza, Spain (2002)

7

9. : (ISO/IEC 15938-3 FCD Information Technology - Multimedia Content Description Interface - Part 3:Visual, March 2001, Singapore)

10. : (ISO/IEC 15938-5 FCD Information Technology - Multimedia Content Description Interface - Part 5:Multimedia Description Scemes, March 2001, Singapore)

11. Bloehdorn, S., Petridis, K., Saathoff, C., Simou, N., Tzouvaras, V., Avrithis, Y., Handschuh, S., Kompat-siaris, I., Staab, S., Strintzis, M.: Semantic Annotation of Images and Videos for Multimedia Analysis. In:Proceedings of the 2nd European Semantic Web Conference (ESWC 2005). (2005)

Knowledge-Based Image Analysis Applied toOrnaments in Arts

C. Schmidt, C. Schneider, B. Schüler C. Saathoff, and D. Paulus

Institute for Computational Visualistics,Institute for Arts and Sciences and

Institute for Computer ScienceUniversity of Koblenz and Landau

http://www.uni-koblenz.de/agas

Abstract. Science of arts knows the three categories architecture, creation (paintings and graph-ics, among others), and ornaments. In this contribution we describe a project for automatic analysisof images that contain ornaments. These pictures may be taken from historic buildings, or theymay show objects like carpets and furniture. Currently, ornamental parts are selected manually ininput images before they are subject to the proposed analysis. We describe the image database,the approach to analysis in combination with a knowledge-based image analysis and the impacton arts and sciences.

1 Motivation

The aim of the project ”Analysis of images to classify ornaments” is to establisha base for identification of ornaments from images accordingto their structure andorigin.

Science of art differentiates the three categories architecture, creative activities(painting, sculpture, graphics), and ornament [1]. Ornaments give decoration an or-der [2], which can be placed on different things like architecture, vases, and pages ofbooks, among others. Focussing on the category ornaments, an ornament is a samplepainted on ground. This ground can be an object in architecture, a page in a book ora surface. Ornaments contain structures which repeat several times in various cases,but which may also appear once only. Repetition of a pattern is essential for an orna-ment. In contrast to creative activities, the ornament is mainly two-dimensional. Theornament is the principle which is transfered from ornamentto other categories.

There are several large collections of those ornaments in books, which are taken astemplates by artists, architects, and artisans who apply them again e.g. to buildings orceilings. This technique has been used for centuries. An example of the category “ar-chitecture” is shown in Fig. 1. People in arts have a sophisticated classification schemefor the parts of an ornament. One example of an ornament foundon the building inFig. 1 is shown in Fig. 2.

The integration of detailed knowledge about ornaments and of several patternrecognition algorithms using a knowledge base is a key feature of our approach. Ourpaper is organized as follows: In Sect. 2 we outline approaches to computer-assistedanalysis and retrieval of pictures, as known from literature. We present an overview

9

Fig. 1.Palace in Venice Fig. 2. Quarterfoil of Palace in Venice

and the architecture of our system in Sect. 3. The image database and preprocessingsteps are described in Sect. 4.1 followed by operators used for the analysis of anROI (Region of interest) in Sect. 4.2. For analysis an approach is choosen whichuses models of ornaments (Sect. 5). A knowledge base containing information aboutthe structure of ornaments and parameters of pattern recognition algorithms and acontroller connecting knowledge base and algorithms are described in (Sect. 6). Weconclude in Sect. 7.

2 State of the Art

Other approaches interconnecting science of arts with computer science deal with in-dexing methods, thesauri of works of art and buildings as well as the use of these. Anexample for such a system is MIDAS (Marburger Inventarisation, Documentation undAdministrationssystem) and the IMAGO image database (image database Humboldtuniversity Berlin/Germany). The field of research which is engaged to content basedretrieval of image data is applied in heraldry. The goal is toread heraldric pictures anddescribe them by a multidimensional vector [3]. This is donein the HERON projectat the university of Augsburg.1 Our project ”Computer Analysis of Ornaments” isdifferent to these with respect to the following features:

– We implement and provide a digital image database which allows for storing aclassification scheme of pictures.2

– We establish a system to analyze ornaments. A basic assumption in our projectis the fact that an ornament is a two-dimensional digital image. We will use pat-tern recognition algorithms to find regions of ornaments in agiven picture and to

1 HERON-Projekt: DFG-Projekt der Universität Augsburg im Fachbereich Informatik2 The image database as well as several operators are available to the public via internet in

http://www.uni-koblenz.de/puma

10

classify an ornament in the identified region. This will allow us to implement ex-amples of appearing ornaments in special epoches and geographical areas. Someresearch on the computational construction of ornaments has already been done,for example in [4]. Another approach is described in [5]. We use methods of bothof them to implement algorithms for region detection and ornament analysis.

– We store the result of analysis represented by a feature vector in the database.Once we gain knowledge about the ornaments in an image, regional analysis con-cerning the landscape of arts and the artist can be made. Thisimplies ornamentsand the use of ornaments in architecture, among others.

– In general, automatic annotation of images has been under research for a ratherlong time. Basically multimedia content can be divided intotwo intertwined lay-ers, themultimedia layerand thecontent layer. The former corresponds to thelow-level characteristics of an image and the latter to the semantic meaning ofan image. Most approaches for annotating images concentrated on one of thoselayers, i.e. either taking into account only low-level features, e.g. for scene clas-sification, or using manual annotation for indexing or further knowledge basedanalysis [6]. Lately approaches try to combine both layers in order to improveresults and to give the low-level chareceristics meaning. In [7] an integrated ap-proach is used to analyze hydrogen fuel cells. During a training steps rules arelearned to recognize automatically regions of interest in fuel cell images. The sys-tem is based on an OWL representation of the MPEG-7 standard.In [8] a similarapproach is employed for recognizing natural complex objects in images based onframe based logic. A mapping of numerical data into symbolicdata is carried outand rules are used to request specific additional information. So the mapping oflow-level features to high-level concepts is accomplishedby introducing simpleobjects. In our approach the mapping will be more direct, as the complex objectsare described using the low-level characteristics directly. Especially, we will em-ploy more specialized methods, in order to achieve better results for the concreteproblem. In [9] a semi-automated image annotation system ispresented that useshints given by a user in natural language, to guide the analysis procedure. Thus,using a hint like in the upper left corner there is L-shaped building, the analysisprocedure can prune the search space significantly. In our approach we aim atgenerating these hints automatically from already derivedand background knowl-edge about the image and the domain. Finally, in [10] a description logics basedsystem is used to annotate medical images automatically. The low-level featuresof semantic meaningful regions are described in order to findsimilar regions inother images. Using a description logics reasoner, these regions can be classifiedinto the correct class and based on these findings, the overall image is annotated.In this approach, the analysis is basically unidirectional, i.e. no communicationbetween the reasoner and the low-level features extractorsis carried out.

11

3 System Architecture

The main task of our project is to develop methods of computerscience to analyzethe field of research of ornaments. Pictures of ornaments andtheir application existin the slide collection of the science of arts’ institute at the University of Koblenz andLandau. Within the project, pictures are stored in the database and are manually clas-sified in the sophisticated scheme (Sect. 1). Thus it is possible to compare the picturewith others. Furthermore, algorithms are developed which are able to recognize andanalyze ornaments, and their results are compared to the ones in the database. So pic-tures can be categorized by the dimensions time, region, andform. Fig. 3 shows anoverview of the system: Pictures from the established database are normalized, thenthe region of interest is extracted followed by an automaticclassification which usesseveral feature extraction operators and a knowledge base.The controller uses infor-mation from the knowledge base to direct the application of the various operators.Results are stored in a classification vector for each picture.

KnowledgeBase

ClassificationResult

NCCF

FracDim

Symmetrie

ImageDatabase

PUMA

Interestman or auto

PUMA

PreprocessedImage

PUMA

Region OfRregion Of

Interest ControllerPreprocessing

0

B

B

B

@

c1

c2

...cn

1

C

C

C

A

Fig. 3. Overview of the complete system

An image database has been established which consists of up to 1000 pictures,all of which contain ornaments. These pictures are taken from [11] and the slide col-lection of the institute of arts and sciences at the University of Koblenz and Landau.Every image of the database is individually classified by place, artist, time, and somefurther features. During the process of automatic classification, a picture is normal-ized (see Sect. 4.1). Borders (e.g. from scanning) and distortions are eliminated in this

12

phase. The resulting picture is then also stored in the database. The following stepsdeal with the recognition of regions in a picture. Here, the first step is image analysisusing texture and symmetric operators, resulting in regions where ornaments could befound. The classification is done using operators like a fractal dimension estimation(see Sect. 4.2) and methods like the comparison with computationally constructedornaments (see Sect. 5). The decision which feature is computed and which modelfor comparision is used is made by the controller in dependency of the rules in theknowledge database. The approach is to analyse a picture andits possible ornaments.There are two main phasis during analysis: region detectionand analysis of a region.Region detection is done manually in the first step because ofthe topic’s complexity.First view is to the analysis of the region. Basically we can devide ornaments into thetwo classes: Symmetric and non symmetric ones. Both classesdo have rules for theuse of ornaments.We follow the principleanalysis by synthesis(Sect. 5) which is embedded inbidi-rectional communicationbetween image processing algorithems and controller. Theidea is to include e.g. information about how different kinds of symmetry relate todifferent types of ornaments in an ontology. Furthermore, the knowledge base willcontain information about how input parameters of certain image processing algo-rithms have to be chosen. Besides these two parts information about the history ofornaments might be included. The controller uses the knowledge base to determinewhich feature extraction methods should be run using which parameters in order toobtain an exact classification as fast as possible. The feature extraction methods weplan to use can be divided into two groups: complex operators(e.g. fractal dimension)and simple operators (e.g. color change frequency).

All programs and scripts are implemented as extensions and lication modules tothe PUMA environment [12]. The knowledge base and the controller are located in adatabase project of ISWeb.

4 Picture analysis for image analysis

4.1 Preprocessing Steps

As the ornament database contains copyrighted material, itis divided into two parts.One selection of images can be accessed freely on the image database (see foot-note 2). For another part, which is only available for scientific purposes, a passwordis required.

Especially the pictures originating from a slide need to be preprocessed. The mainproblem we have to solve is the elimination of borders and distortion. The prepro-cessed image is stored in the image database for further analysis (see Fig. 4).

The second step is the elimination of distortion. One correction possibility is toestimate vanishing points from detected edges of a building, for example. Similar to

13

Fig. 4. Processing stages for clipping the border (left to right): scanned slide; gradient image; binary edge image;original image with lines, framing the searched image; result: corrected image

the previous step, we apply edge detection filters and perform a Hough transforma-tion to find these lines. Afterwards, appropriate parameters are chosen to do a finalnormalisation.

4.2 Operators for Ornaments

Up to now, regions containing ornaments are marked manuallyin the normalizedimages resulting from the previous steps. However we describe here some operatorsgiving us distinctive features of ornaments that can be usedfor automatic detection ofthose regions later. The same operators will be appliable for the final classification ofthe marked ornaments.

Fractal Dimension Fractal dimension can be regarded as a real number that ex-presses the dimension of the embedding space of a set of points.3 It is well knownthat this feature is mostly independent from scale and orientation of the set. Fractaldimension cannot be computed directly on binary images, butit is possible to esti-mate it. We implemented three common estimation approaches, all of which measurethe “mass” of the set of points at different scales and then find a linear estimate of thechange of mass with size.

– Thebox dimensionapproach uses meshes of fixed-size squares to measure massas the number of squares containing at least one pixel.

– The information dimensionapproach uses similar meshes, but measures the en-tropy at each scale based on the probability of a pixel to lie in a certain square.

– Thecorrelation dimensionapproach takes distances between pairs of foregroundpixels and measures the number of pairs with a distance smaller than a certainthreshold.

Symmetry One major feature of ornaments is symmetry. In order to gain symmetryinformation on the image, we follow two basic steps. An overview of the symmetricornament classis is given in Fig. 4.2.The first step applies a Fourier-Mellin transform on the image. In the resulting fre-3 Actually the term comes from the field of fractal theory [13].

14

research:reflection and rotation

research:reflection, rotation,sliding reflection andtranslation

CENTRALaxis of reflection

angle rotationgroup, who returns

which transformationappear

PLANEgroup, who returns

which transformationappear

FRIEZE

reject, if no symmetryavailable

research:reflection, rotation andsliding reflection

basic symmetryclassificator

central symmetry

= yes

frieze symmetry = yes

plane symmetry = yes

OUTPUT:

INPUT: (x , y )(x , y )

1 1

2 2

n n

...

(x , y )

quency domain, we are able to detect repetitions of basic geometric shapes whichbuild the ornament groups. The second step is to reconstructthe numerous affinetransformations that cause the positioning of elements in these groups.4

Normalized Cross Correlation Function The normalized cross correlation func-tion (NCCF) is computed as described in [12]. The NCCF is a block matching proce-dure which can serve as a good measure for similarity betweentwo images. Our ownNCCF implementation has again been integrated into the PUMAprogramming envi-ronment [12]. It calculates the results in three different ways by first using only onepixel and then considerung the four- and eight-pixel-neighborhoods. Also it considersthe position of an ornament. Therefore it uses for each modelall possible orientations.Matching an image with itself, the result has the value 1.

5 Modelling Ornaments

Fig. 3 deals with matching of ornaments. While we located ornaments in the imagesduring the previous steps, we will now match computationally generated ornamentsto them for further analysis. We call this approachanalysis by synthesisand refer tothe computationally constructed ornaments asmodels. For generating models, bookscontaining templates of ornaments are used, which allow us to derive mathematicaldescriptions [2]. Several models are implemented up to now.Best examples are theGothic quarterfoil and three-pass ornaments. for computational construction. Fig. 5

4 Brian Sanderson already worked on this problem, seehttp://www.maths.warwick.ac.uk/~bjs/images/patrecog .jpg .

15

h1

h2

h3

h4 h5

h6

r

rt

4. Zeichne KreisbÃ¶gen von BerÃ¼hrpunkt an Raute rt zum nÃ¤chsten BerÃ¼hrpunkt

1. Zeichne Kreis mit Radius r

3. Konstruiere Mittelpunkt fÃ¼r die Innenkreise2. Zeichne Hilfslinien h1−h6

αα

α

Fig. 5. Quarterfoil: construction drawing (left) - Computationally generated model (right)

shows an example for the construction rules and the result ofa program generating aGothic quarterfoil. The programs needs the parameters: diameter, the angle of rota-tion and line strength. We will extend this model library andcommand line duringthe project to get a wide range of models for analysis.

5.1 Gradient Cross Correlation

To find an ornament in an image, we compute the gradient cross correlation on it. Itis based on two steps: after applying a Sobel filter for edge detection, the normalizedcross correlation is computed (Sect. 4.2). Fig. 6 and 7 show intermediate processingsteps. The result needs to be compared to the model in Fig. 5 (right).

Fig. 6.Quarterfoil Fig. 7. Quarterfoil after edge detection

So we can find e.g. a quarterfoil within an image and get as result his diameterand orientation. To improve the results, we will establish amechanism to compare thegeometrical structures of the image with those of the model,using a suitable Hough-transformation.

16

6 Knowledge Database and Controller

In our current approach we now establish a knowledge base to direct the analysis ofornaments in images. We hope that this will result in even better results and less time-consuming computations. There are basically three reasonswhy we think a knowl-edge base can contribute to the system.

The first reason lies in the nature of the operators for feature extraction. We intendto consider complex and computationally expensive operators as well as simple ones.The operators described in sections 4.2 are rather complex and computationally ex-pensive ones. An operator based on color change frequency isan example of a simpleone. Furthermore, not every low-level extractor is suited to provide useful informa-tion for every ornament. E.g., the fractal dimension extractor applied to a quaterfoildoes not reveal any useful information that can be used to identify a quaterfoil in theimage. Finally, the complex operator take a lot of parameters into account which areunknown at the beginning. Their performance increases considerably if some param-eters can be supplied using information from other sources.The controller can makeuse of the knowledge base to optimize the application of operators to an ornament.The knowledge base is supposed to contain information aboutthe typical low-levelcharacteristics of a given ornament. Based on these characteristics a decision in made,which low level extractor will provide the biggest information gain, i.e. specializingthe classification the most.

Another good reason for using a knowledge base is the availability of extensiveand detailed background knowledge about ornaments. There is detailed informationabout how ornaments were constructed at different times by different schools of arti-sans. Further, from arts history we know a lot about epochs, artists, styles, geographicregions and relations among them. In a concrete applicationscenario of the systemsome information might also be supplied by the user. If, for example, a user uploadsan unknown picture to the system he may not know how to classify the ornamentsinside, but where it is from. Also some pictures from the image database have beenmanually classified already.

The last reason we want to mention here is that a knowledge base provides meansfor integration of knowledge, querying and inference. Background knowledge, in-formation from different operators, from user input and from previously classifiedornaments can all be integrated and used during classification. A major use case forthe system might be a user querying for images of artwork froma certain region andtime. Information about the origin of an ornament might be inferred using informa-tion from the knowledge base. For instance if it turns out that specific ornaments inconnection with certain colors only appear in artwork from acertain geographicalregion. That way more knowledge about arts might even be discovered.

Now that we have stated for what we want to use the knowledge base for a fewwords about what will be in there. The knowledge base should contain an ontologyabout pieces of art which covers possible features with respect to the feature extrac-

17

tion operators as well as features regarding the origin. A fragment of the history ofarts related to a set of ornaments should be included in another ontology. Finally,knowledge about parameters and application of operators needs to be incorporatedinto the knowledge base. Images (e.g. from the aforementioned image database) willbe inserted as instances into the knowledge base as well as operators, artists, schoolsof artisans and so forth.

In our proposal, the knowledge base and the controller are basically seen as twotightly integrated modules. The purpose of the controller is two-fold. First of all it willtake care of the overall control of the analysis by using the domain specific knowledgein the knowledge base. Further, it is also supposed to provide feedback to low-levelextractors through a callback functionality.

7 Conclusion and Future Work

We proposed some methods for analysis of images containing ornaments which mainlycover preprocessing, region detection, feature extraction, and model matching. Ournext step is the combination of these single implementations into one classificationsystem to provide all the described steps by a single user interface. In addition, moreexperiments with the existing operators have to be carried out, and other operators forfeature extraction have to be considered and tested. Also anontology and controllingmechanisms for analyzing image containing ornaments will be established.

Acknowledgement

Thanks to the students who contributed to the project: Matthias Dennhardt, TimoDickscheid and Andrea Fürsich. This project was funded partially by the Germanstate Rheinland-Pfalz under grant 1513.

References

1. Bauer, H.: Eine kritische Einführung in das Studium der Kunstgeschichte. Beck (1976)2. Meyer, F.S.: Handbuch der Ornamentik. VEB E.A Seemann Verlag (1997)3. Balke, W.T.: Untersuchungen zur bildinhaltlichen Datenbank-Recherche in einer Wappensammlung anhand

des IBM Ultimedia ManagersR©. Master’s thesis, Universität Augsburg (1997)4. Herfort, A., Klatz, P.: Ornamente und Fraktale. Vieweg Verlagsgesellschaft (1996)5. Flachsmeyer, J., Feiste, U., Manteuffel, K.: Mathematikund ornamentale Kunstformen. Verlag Harri Deutsch

(1990)6. Hollink, L., Schreiber, G., Wielemaker, J., Wielinga, B.: Semantic annotation of image collections. In:

Proceedings of the Second International Conference on Knowledge Capture K-CAP, Sanibel, Florida, USA(2003)

7. Hunter, J., Drennan, J., Little, S.: Realizing the hydrogen economy through semantic web technologies. IEEEIntelligent Systems Journal - Special Issue on eScience19 (2004) 40–47

8. Hudelot, C., Thonnat, M.: A cognitive vision platform forautomatic recognition of natural complex objects.In: Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2003).(2003)

18

9. Srihari, R.K., Zhang, Z.: Show&tell: A semi-automated image annotation system. In: IEEE Multimedia.Volume 7. (2000) 63–71

10. Hu, B., Dasmahapatra, S., Lewis, P., Shadbolt, N.: Ontology-based medical image annotation with descriptionlogics. In: Tools with Artificial Intelligence, 2003. Proceedings of the 15th Int. Conf. on Tools with ArtificialIntelligence (ICTAI’03). (2003)

11. Bittel, K.: Propyläen-Kunstgeschichte. Volume 1. Propyläen-Verlag (1967) PKG-I.12. Paulus, D., Hornegger, J.: Applied pattern recognition: A practical introduction to image and speech process-

ing in C++. 4 edn. Advanced Studies in Computer Science. Vieweg, Braunschweig (2003)13. Hausdorff, F.: Dimension und äußeres Maß. Mathematische Annalen (1918) 157–179

Diagnostic Reasoning supported by Content-BasedImage Retrieval

Christian Münzenmayer1, Annika Hirsch2,Dietrich Paulus2, and Thomas Wittenberg1

1 Fraunhofer Institut für Integrierte Schaltungen,Am Wolfsmantel 33, D-91058 Erlangen{mzn,wbg}@iis.fraunhofer.de

2 Arbeitsgruppe Aktives Sehen, Universität Koblenz-LandauUniversitätsstr. 1, D- 56070 Koblenz

[email protected]

Abstract. Due to the demographic development and the increasing life time in the industrialcountries, the time, a doctor can deal with a patient will decrease dramatically over the next fewyears. Computer-assisted diagnosis (CAD) systems are one (technological) aspect for a possiblesolution to these pressing problems in tomorrow’s health care system. Central element of ourCAD prototype is a case-database which contains medical cases consisting of decisive imagesdepicting objects and regions of interest as well as classifications for these objects. To accessthe diagnostic knowledge in our database, we apply algorithms known fromContent-Based Im-age Retrieval (CBIR)andcolor texture analysis. Query point movementanddimension weightingmethods steered by the user’s retrieval feedback have been implemented with a graphical user in-terface designed for non-specialist end users for easy access to the case database. The system wasvalidated with a comprehensive data set of 482 pre-classified regions from esophageal endoscopy.Simulations assuming perfect observers show that feedbackiterations can improve the number ofreturned relevant cases significantly. The system behaves stable with increasing correctness up toa ratio of 20-30% wrong decisions per feedback iteration.

1 Introduction

Due to the demographic development and the increasing life time in industrial coun-tries, the time, a doctor can deal with a patient will decrease dramatically over thenext few years. That means an ever increasing number of patients has to be treatedby a limited number of physicans which in turn have only limited time and means tofind and validate their diagnosis. Under these constraints the risk of false decisionsand oversight of critical developments will increase.

Computer-assisted diagnosis (CAD)systems are one (technological) aspect for apossible solution to these pressing problems in tomorrow’shealth care system. Mod-ern CAD systems in general are interactive systems which provide direct diagnosticsupport by means of case databases, which may also make use ofhospital information(HIS) andpicture archiving (PACS)systems. One paradigm behind CAD systems isthe so-calledCase-Based Reasoning (CBR)principle which does not try to establishabstract mathematical rules but works by finding similarities in the evidence providedby characteristic and representative parameters in images, e.g. from ultra-sound, to-mography, X-ray microscopy or endoscopy. Therefore, it is of utmost importance to

20

represent and provide access to diagnostic knowledge of experts for a wide variety ofmedical disciplines.

2 State of the art

Over the last decade multiple commercial and scientificCBIR systems have beendeveloped, also in the medical field. TheQuery by Image Content (QBIC)system[2] developed by IBM is one of the most widely known commercial CBIR systems.Searches can be applied, based on color, contour and texturefeatures. Queries can beformulated by means of a sample image, i.e.query by example (QBE)or a sketch ofthe image. Already in the early versionsrelevance feedbackwas used to enhance re-trieval accuracy.WebSEEkis an online version ofVisualSEEkfor the internet and thefirst CBIR system developed by the Columbia University [10].Its particularity is thepossibility to capture visual features with spatial relationships. This is implementedby a graphical editor which allows drawing of geometric primitives to sketch the typeof images the user is interested in. Color histograms and texture features are indexedby binary trees. Based on theGNU Image Finding Tool (GIFT)[11], medGIFT isa system for CBIR on medical images developed a the university of Geneva. Someadaptions have been made toGIFT so that higher gray level dynamics occuring inradiologic image modalities can be exploited [5]. TheImage Retrieval in MedicalApplications (IRMA)system is also specialized to radiologic image modalities andallows application independent classification of X-ray images [4]. Major inspirationfor the retrieval part of our system was theMultimedia Analysis and Retrieval System(MARS)developed at the university of Illinois Urbana [9] which emphasizes iterativerefinement of retrieval byrelevance feedbackalgorithms.

In contrast to the describedCBIRsystems our work concentrates on domain spe-cific images which are to be classified in a pathologic sense, i.e. we do not want tosearch for e.g. CT scans of the head from a pool of miscellaneous radiologic imagesbut for more subtle changes within a domain of images such as endoscopic imagesof the esophagus. By means ofrelevance feedback, users have the ability to interactwith the CAD-system by judging and evaluating the resultingimages with a relevancevalue, thus directing the image search into the appropriatedirection. The system hasbeen implemented with a graphical user interface designed for non-specialist endusers for easy access to the case database. This concentration on a narrow applicationfield has great influenence on the features used which have to be optimized therefore.To sum up, the methodology is applicable to a wide range of different modalities buttreats each domain by separate databases with customized features.

Primary objective of our medical application scenario in this work is the earlydetection of so-calledBarrett’s esophaguswhich is a pre-malignant state of the ep-ithelium in the upper digestive tract, i.e. near the so-called Cardia which is the con-nection between esophagus and stomach. High-resolution digital color images from a

21

magnifying flexible endoscopy system allow visualization of fine-granular structuresof gastrointestinal mucous tissue after acid-instillation of the mucosa.

3 Knowledge Representation

As mentioned in the abstract, one key in our CAD system consists of a referenceknowledge-database which contains certified medical cases, consisting of represen-tative images depicting objects and regions of interest as well as classifications forthese objects which are based on expert knowledge and additional means such asstandard biopsy and histological analysis. Since the information retrival for the di-agnostic support is done by a direct comparison of representative extracted featuresfrom the input image or interactively marked regions from that image with referencefeature vectors, the organization of our knowledge-database consists on the lowestlevel of a so-calledfeature database, where each pre-classified reference object isrepresented by a characteristic feature vector calculatedfrom color texture analysis.For each feature vector, a pointer is stored towards its original image patch or im-age object, which in turn carries information about the expert classification of theimage object and a pointer towards the original image. Finally, for each individualimage in the knowledge-base, demographic information of the corresponding patientcan be stored, as well as pointers to additional external information, which might beneeded to support the dignosis. Thus, in a broad sense, the complete knowledge isrepresented in a semantic network, where the nodes represent images, image objectsor image regions and attributes, and where the vertices represent inclusive relationsbetween these objects, such as ’has-a’ or ’belongs-to’ relations [12].

For storage purposes and persistance, the knowledge-database including refer-ences to the annotated image data objects can be stored in a generic XML format[3, 13], which is similar to the well-known MPEG-7 standard for image annotion. Forthe annotation of the images and the connection of expert knowledge with descriptivefeature vetors of pathological image regions, a especially-designed software packageof the Fraunhofer IIS was used. Using this annotation tool, expert physicians wereable to perform an interactive segmentation and classification of suspicious lesions inthe digitized images.

4 Color Texture based Image Retrieval

Different color and color texture based feature extractionalgorithms were used in im-age retrieval so far. For the purpose of this work we conducted a few initial compar-isons of different algorithms (e.g. color histograms, co-occurrence features, sum- anddifference-histograms, local binary patterns) and found the color version of Chen’sstatistical geometrical features (SGF)[1] to perform advantageously when applied tothe classification problem on our database of 482 pre-classified regions.

22

These 16 statistical measures of the SGF are based on the geometrical propertiesof connected regions in a series of binary images. These binary images are producedby thresholding operations on the gray scale image under investigation. Geometricalproperties like the number of connected regions and their irregularity together withtheir statistics describing the stack of binary images are used. A color version of thisalgorithm, described in [6] combines binary images of different color channels bymeans of boolean operations and thus captures dependenciesbetween different spec-tral wavelengths. Finally, a feature vector of 48 scalars isused to measure similaritiesof images orregions of interest (ROI), respectively. All features are statistically nor-malized to zero mean and unity variance.

With p ∈ IRL being a feature vector withL elements in the case database andq ∈ IRL the feature vector of the query image the distance measure isdefined as ageneralized Euclidean distance by

d(p, q) =[

(p − q)TW (p − q)]

1

2 . (1)

In the simplest case the weighting matrixW is the identity matrixI, yielding the Eu-clidean distance with equal weighting of all features. To implement a feature weight-ing relevance feedback as described in the next section a diagonal matrixWD or afully occupied matrixWG can be used. The resulting distance of each database vec-tor to the query vector is the ordering criterion of the retrieval result of theCBIRsystem.

5 Relevance Feedback

Basically, there are two types of relevance feedback which are used in literature andalso implemented in ourCAD system. The first one works by optimizing the queryvector and is calledquery point movement (QPM), while the second one weights thefeature dimensions and is known asfeature dimension weighting (FDM). At this pointsome notational conventions have to be introduced. We denote byI the set of indicesof all feature vectors in the case database. After a query theset of resulting indicesis denoted byIR ⊂ I. In the feedbackprocess, the user divides this result into theset of relevant indicesI+

R, the irrelevant indicesI−

R, and a neutral setI◦

R about whichhe does not care. The index sets are complete (IR = I

+

R∪ I

−

R∪ I

◦

R) and mutuallyexclusive (I+

R∩ I

−

R= ∅, I

+

R∩ I

◦

R = ∅, I−

R∩ I

◦

R = ∅).QPM is the process to calculate a new query vectorqt+1 based on the retrieval

result of thet’th iteration and the previous query vectorqt. In the first iteration theoriginal query vector is used:q0 = q. Thus, in each refinement step the query vectormoves toward the optimal query vectorq∗. Developed for the purpose ofinformationretrieval, Rocchiosformula [8] is widely used inCBIRapplications. The new queryvector qt+1 is computed as a weighted average of the previous vectorqt and the

23

positive discounting the negative feedback vectors as

qt+1 = αqt +β

|I+

R|

∑

n∈I+

R

pn −γ

|I−

R|

∑

m∈I−

R

pm (2)

with weightingsα, β andγ. Depending on the weights, convergence and stabilityof the QPM can be adjusted. A quantitative evaluation of these parameters can befound in the experimental part of this work. In [7] a modified version of (2) includingrelevance weights is used.

Relevance feedback byFDM is mainly concerned with optimizing the weightingmatrix W in (1). A diagonal weighting scheme is proposed in [9]. The diagonalmatrixWD is computed from the inverses of the standard deviations of features in thepositive relevance feedback set. Formally, the diagonal elements ofWD are computedas

wll =1

σl∑

l1

σl

(3)

whereσl is the standard deviation of featurel within the relevant vectors. With thediagonal weighting matrixWD, the feature space is expanded in the direction of fea-tures with low variance. Thus, features which show a consistent behaviour within theset of positive feedback vectors receive more weight in the final distance calculation.Note, that the neutral and irrelevant vectors are not considered.

6 Experiments and Results

Validation of ourCAD prototype was conducted on a data set of 482 pre-classifiedregions from high magnification zoom endoscopy. The images originate from an on-going study to evaluate methods of color texture analysis for early detection ofBar-rett’s esophagusby classification of different types of mucous tissue insidethe esoph-agus. All images were acquired by a high-resolution magnification endoscope3 afterapplication of acid solution to enhance mucous structures.For each tissue class, irreg-ularly bounded ROI’s were classified by clinical experts with histologic confirmationby standardized biopsy. The whole data set includes 390 images with a total of 482ROI’s.

To obtain the accuracies a leaving-one-out scheme with 10 return images wasapplied. We simulated a perfect observer by using all returned cases matching thecurrent class label as positive feedback and the others as negative feedback. Laterartificial false decisions based on a random number generator were included.

Our first experiment investigates the influence of differentweightings inRocchio’sformula (2). Fig. 1(a) shows the classification accuracies for the initial query and

3 Olympus GIF Q160Z

24

9 successive refinement iterations. The weight’s are displayed in the orderα, β, γ

and afterwards normalized to unity. Whenever negative feedback vectors are included(xx1), classification accuracy deteriorates with the first feedback iteration. The trivialcase 100, i.e.α = 1, β = 0, γ = 0, remains constant and including the positivefeedback (010 and 110) improves the results significantly.

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 2 4 6 8 10

Acc

urac

y

Iterations

100010001110101011111

(a) QPM - Rocchio Weights

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 2 4 6 8 10

Acc

urac

y

Iterations

σ−1 (100)σ−1 (010)σ−1 (110)

(b) FDM - Rocchio Weights

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 2 4 6 8 10

Acc

urac

y

Iterations

0 σ−1 (110)0.1σ−1 (110)0.2σ−1 (110)0.3σ−1 (110)0.4σ−1 (110)0.5σ−1 (110)

(c) False Decision Influence

Fig. 1. Evaluation of QPM and FDM relevance feedback for the Barrettesophagus database with 10 refinement iterations. (a)Different weightings (α, β, γ) for Rocchio averaging query point movement. (b) Feature dimension weighting with differentRocchio weightings. (c) False decision influence evaluatedfor different noise levels (percentage of false decisions)with diagonalFDM weighting andα = 1, β = 1.

25

FDM alone (Fig. 1(b), 100) has only minor impact on the overall accuracy. How-ever, combined with RocchioQPM by positive feedback vectors a little improvementcan be obtained over pureQPM.

A final experiment is aimed at the stability of the system withrespect to falsedecisions by a human user. Therefore, we used a random numbergenerator and apre-configured false decision rate to simulate human error rates. The results forFDMcombined withQPM (110) with different false error rates are compiled into Fig. 1(c).Up to 20% error rate almost nothing changes, up to 40% error rate the classifica-tion accuracy still improves slightly and with 50% false decisions a deterioration isinevitable.

7 Conclusion

In this work we presented the evaluation of a prototypicalCAD system which allowsthe retrieval of diagnostic information from a case database by means ofCBIR andcolor texture analysis.

We have validated our system with a comprehensive data set of482 pre-classifiedregions showing three different types of tissue. With idealfeedback an improvementof 15% retrieval accuracy could be obtained. Our simulations assuming perfect ob-servers show that feedback iterations can improve the number of relevant cases signif-icantly. The system behaves stable with increasing correctness up to a ratio of 20-30%wrong decisions per feedback iteration. Thus, it is able to access relevant cases if atleast a moderate understanding and judgement of image content on behalf of the usercan be assumed.

Therefore, we believe that our approach to support diagnostic reasoning by content-based image retrieval has the potential to provide a benefit for our health care systems.Toward that direction, clinical evaluation of our system will be one of the next thingsto do. Another important research direction is to find ways how to incorporate theinformation fed into the system by user’s feedback, a process also calledmemorylearningwhich will supposedly further improve the retrieval process.

Acknowledgements

The authors would like to thank PD Dr. B. Mayinger, Krankenhaus Pasing, andPD Dr. S. Mühldorfer, Klinikum Bayreuth, for providing the annotations of the imagedatabase.

References

1. Y. Q. Chen, M. S. Nixon, and D. W. Thomas. Statistical geometrical features for texture classification.Pat.Rec., 28(4):537–552, September 1995.

2. M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, and B. Dom. Query by image and video content:The qbic system.Computer, 28(9):23–32, 1995.

26

3. M. Grobe, H. Kuziela, C. Münzenmayer, T. Wittenberg, and R. Schmidt. Erstellung von klassifiziertenReferenzdatensätzen durch Experten für die Evaluierung von Algorithmen. In U. Boenick and A. Bolz,editors,Proc’s DGBMT 2004, volume 49, Ergänzungsband 2, Teil 2 ofBiomedizinische Technik, pages 916 –917. 38. DGBMT Jahrestagung BMT 2004, 22.-24. September 2004, Technische Universität Ilmenau, 2004.

4. T. Lehmann, B. Wein, D. Keysers, J. Bredno, M. Güld, H. Schubert, and M. Kohnen. Image retrieval in med-ical applications: The irma-approach. InVISIM Workshop: Information Retrieval and Exploration in LargeMedical Image Collections, Fourth Interantional Conference on Medical Image Computing and Computer-Assisted Intervention, Utrecht, The Netherland, October 2001.

5. H. Müller, A. Rosset, J. Vallee, and A. Geissbuhler. Comparing feature sets for content based image retrievalin a medical case database. In A. Geissbuhler, editor,In Proceedings of the Medical Informatics EuropeConference, St. Malo, France, 2003.

6. C. Münzenmayer, H. Volk, D. Paulus, F. Vogt, and Wittenberg. Statistical Geometrical Features for TextureAnalysis and Classification. In8. Workshop Farbbildverarbeitung, Autorenvorträge, pages 87–94, Ilmenau,2002. Zentrum für Bild- und Signalverarbeitung e.V. Ilmenau.

7. M. Ortega and S. Mehrotra.Handbook of Video Databases: Design and Applications, volume 8 ofInternetand Communications Series, chapter Relevance Feedback in Multimedia Databases, pages 103–109. CRC-Press, 2003.

8. J. Rocchio. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments inAutomatic Document Processing, pages 313–323. Englewood Cliffs, 1971.

9. Y. Rui, T. Huang, and S. Mehrotra. Content-based image retrieval with relevance feedback in MARS. InProceedings of IEEE International Conference on Image Precessing, pages 815–818, October 1997.

10. J. Smith and S. Chang. Querying by color regions using visualseek content-based visual query systems.Intelligent multimedia information retrieval, pages 23–41, 1997.

11. D. Squire, W. Müller, H. Müller, and T. Pun. Content-based query of image databases: inspirations from textretrieval.Pattern Recognition. Lett., 21(13-14):1193–1198, 2000.

12. T. Wittenberg. The need of annotation for reference image data sets. In H. Lemke, K. Inamura, K. Doi,M. Vannier, and A. Farman, editors,Proc’s 19th Int. Congress and Exhibition Computer AssistedRadiologyand Surgery (CARS) 2005, pages 453 – 458, 2005.

13. T. Wittenberg, M. Grobe, H. Kuziela, C. Münzenmayer, K. Spinnler, and R. Schmidt. Tools and datastructures for content annotation in medical reference images. InProc. 49. Jahrestagung der DeutschenGesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Innsbruck, 2004.

Visual Scene Memory Based on Multi-Mosaics

Birgit Möller and Stefan Posch

Institute of Computer Science, Martin-Luther-UniversityHalle-Wittenberg,06099 Halle/Saale, Germany,

{moeller,posch}@informatik.uni-halle.de ,WWW home page:http://www.informatik.uni-halle.de/˜posch/AG/

Abstract. Visual data acquired with active cameras yields an important source of informationfor interactive systems. However, since image sequences usually comprise large data volumes andnotable portions of redundant information analysis is often diffcult. Hence, data structures are re-quired that allow for compact representation of image sequences. In this paper we introduce ourconcept of a visual scene memory. The memory is based on mosaic images enabling compactimage sequence representation by fusing all sequence images into one single frame while elimi-nating redundancies. Since interactive systems put special demands on mosaicing techniques wedeveloped a new mosaic concept calledmulti-mosaicswell-suited to be used with interactive sys-tems. The memory is focussed on adaquate representation of iconic data, however, not restrictedto it. Rather higher-level data, particularly motion data as well as data suitable for active cameracontrol are additionally included completing the visual scene representation.

1 Introduction

Visual data is one of the most important sources of information for interactive andmobile artificial systems. Active acquisition of this data enables these systems at leastin principle to autonomously act in dynamically changing environments and to per-form intuitive interactions with human communication partners. However, interactiveand especially mobile systems usually accommodate only forlimited resources tostore and process data. As a consequence, it is not possible for these systems to storeand process all redundant image data acquired by an active camera. Rather sophisti-cated mechanisms for efficient data selection and storage are required that enable thesystems to gather visual data and delay analysis as needed bylater requirements.

Image sequences contain different kinds of information, dynamic as well as staticdata. Additionally the data implicitly cover different levels of abstraction, rangingfrom pure iconic information to intermediate-level primitives like edges or corners,and finally to semantic information like object recognitionresults. As the level of ab-straction increases, also the compactness of structures used to represent this knowl-edge increases. However, the more abstract the data is the more limited is its appli-cability (e.g. 3D data for robot navigation or specific features for object recognitionpurposes).

In this paper we present our concept of a visual scene memory for representingiconic multi-resolution image data. The basis for this memory is given by mosaicimagesthat enable efficient representation of image sequences acquired with activecameras. Such a memory supports a wide variety of possible areas of application

28

due to its unspecialized, low-level data representation. However, since mosaic im-ages extend a camera’s field of view in space as well as in time it is straightforwardto enhance the pure iconic representation with additional higher-level data that alsomight benefit from an extended field of view. Thus, our visual memory also supportsrepresentation of higher-level data like motion data as well as feature maps for au-tonomous scene exploration suitable to control active cameras. This yields a scenerepresentation covering different levels of abstraction.

Using a visual scene memory based on mosaic images with interactive systemsputs special demands on the algorithms used. On the one hand it is essential to sup-port online data integration and easy data updates. Furtheron, an easy to use interfacefor data access is required which in particular has to support the application of con-ventional image analysis techniques directly to the data. Fulfilling these requirementshas led to the development of a new mosaic image concept called multi-mosaicsthatenable interactive systems to efficiently represent and analyse image sequence dataacquired with active cameras.

2 Mosaic Image Basics

Mosaic images are a widely used approach for efficient representation of image se-quence data acquired with active cameras. The basic idea is to warp all images of agiven sequence into a common coordinate frame applying suitable transformations(registration). Subsequently one single mosaic image is constructed from all warpedimages fusing their color information(integration). A mosaic image thus extends acamera’s field of view in space and time and allows to eliminate redundancies withina given sequence. Consequently, the data volume of a sequence is significantly re-duced when represented in terms of a mosaic image and hence data storage as well asanalysis is notedly simplified.

Image sequence registration is usually based on a suitable mathematical modelfor the camera motion. It allows to describe changes betweensubsequent images of asequence induced by the camera motion. The complexity of possible models mainlydepends on the degrees of freedom of the camera and on scene structure. In our frame-work we use stationary but rotating and zooming cameras. Movements of such cam-eras can be described by a projective motion model. Althoughsuch cameras enforcemobile systems to stay at a fixed position within a scene during data acquisition, mostof the time scenes can adequately be modeled by acquiring image data from a few”key positions” within a scene. Thus, it is usually not necessary to allow arbitrarycamera movements which usually cannot be modeled by closed form transformationsat all.

The motion of stationary rotating and zooming cameras can bedescribed usinghomographies with 8 dofs. During registration, for each image of a sequence param-eters for this model are estimated that allow to warp the image into the commoncoordinate frame. In our system parameter estimation is accomplished with theper-

29

spective flowapproach [1]. It is based on optical flow computations restricted by theprojective motion model. For image integration new image data is essentially copiedregion wise to the final mosaic image. To smooth discontinuities along region bound-aries appropriate blending functions are applied.

3 Multi-Mosaics

As already outlined, using mosaic images with interactive and particularly mobilesystems enforces special constraints on the mosaicing algorithms that exclude manyexisting approaches to be applied directly to this new area of application. Primarily,mosaicing has to be done inonline mode. Due to limited resources of mobile systemsthe complete image sequences cannot be stored and processed, as e.g. proposed in [2]or [3]. Rather it is necessary to register and integrate eachnew image immediately asit becomes available to overcome the need for storing all sequence images explicitly.

Fig. 1.Exemplary multi-mosaic: image data is projected onto a poly-topial coordinate frame minimizing distortions while providing Eu-clidean coordinates.

A second important aspectwhen representing mosaic im-ages is to choose an appropriatereference frame the sequenceimages are warped into. Com-mon choices for such frames arefor example a single plane, acylinder or a sphere. The latertwo choices allow for adequateand distortion free representa-tions of image data acquiredwith rotating cameras as used inour approach. With regard to in-teractive systems, however, rep-resenting image data of rotatingcameras in spherical coordinates has drawbacks like singularities when representingthe complete viewing sphere and absence of collinearity. Since the vast majority ofexisting image analysis algorithms depends on Euclidean coordinates they cannot beapplied to mosaic data projected onto spheres. This would severely restrict possi-ble areas of application for the memory. Thus, our approach is based onpolytopesthat yield piecewise planar approximations of a sphere and,hence, reduce distortionswhile at the same time providing Euclidean coordinates. Additionally a mosaic con-sists of a set of differently scaled polytopes nested into each other to account foradequate representation of multi-resolution data resulting from a zooming camera.According to the current focal length of the camera the polytope instance is chosenfor data projection that minimizes scaling effects. The resulting visual memory datastructure consisting of multiple planes and multiple levels of resolution is called amulti-mosaic image(Fig. 1).

30

Fig. 2. Polytope with focus imageplane attached.

Besides providing Euclidean coordinates multi-mosaics also support efficient online mosaicing. Althoughthe piecewise planar tiles already enable easy registrationand integration of new data we adopt an additional plane,the so calledfocus image plane, to further improve thehandling of the memory structure and to minor the in-fluence of discontinuities between neighboring tiles. Thefocus plane is attached tangentially to the polytope (Fig.2). New image data is directly registered and integratedinto this plane, hence polytope access is omitted. The fo-cus plane traces the camera trajectory and its position and orientation is updated ifthe camera orientation differs too much from its current orientation. Only in thesesituations image data is copied into the multi-mosaic data structure. Hence, the focusplane serves as some kind of mediator between input data and memory. It stores themost recent data for direct access while the polytope itselfyields a longer-term iconicmemory.

4 Extensions to Higher-Level Data

The multi-mosaics provide an efficient iconic representation of image sequences.They yield a large flexibility in data analysis by supportingthe direct applicationof existing image analysis algorithms. Nevertheless, the representation can be fur-ther improved by additionally providing data structures toinclude higher-level dataresulting from intermediate processing steps as well. Employing the extended viewof the multi-mosaics to represent these data allows for moreflexibility analyzing im-age sequences and finally leads to better exploition of available data to improve thecapabilities of interactive systems. Our implementation is currently focussed on rep-resenting motion information as well as data for guiding active scene exploration asoutlined below. However, other kinds of data can easily be included as well, and pre-liminary work in this direction has already been done to include object recognitionresults.

4.1 Motion Data

One kind of higher-level data important for scene analysis and understanding is mo-tion data. Detection of independently moving objects not covered by the global mo-tion model yields the base for extracting dynamic data contained in image sequencesand, thus, is of high importance for scene understanding. Inaddition, detecting thesemovements is important for registration and integration since they often detoriate pa-rameter estimation and cause integration errors.

To handle moving objects motion detection and tracking algorithms are thus in-cluded in our memory. Independently moving objects are firstdetected computing

31

intensity residuals. Subsequently moving pixels are masked from integration and sub-sequent registration steps. In addition they are segmentedinto regions and connectedcomponents which are tracked over time to extract the trajectories of moving objects(Fig. 3).

Fig. 3. Representation of higher-level data: moving objects are representedin terms of their trajectories included within a correspondence graph datastructure.

Temporal correspon-dences of connected com-ponents and related tra-jectories are then repre-sented in an additionaldata structure, thecor-respondence graph. Be-sides encoding the tra-jectories of moving ob-jects it also allows toderive rudimentary inter-pretations of scene data[4].

4.2 Active Camera Control: Scene Exploration

One important question that has to be answered adopting active cameras for data ac-quisition is how to control the camera’s movements. Often camera control is donesimulating mechanisms of human visual attention and scene exploration [5]. In doingso focus points are automatically selected according to local interest measures calcu-lated on the current input image. Compared to these single-view approaches mosaicimages yield a richer source of information for focus point selection. Their tempo-rally and spatially extended field of view facilitates to exploit all visual data of thescene acquired so far. Therefore, the attention mechanism allows the system to alsoexplore scene parts not currently visible which is not feasable without an iconic mem-ory. Using an iconic memory may also be of importance if the relevance of a givenmeasure may vary in the course of time due to changing requirements, for examplevia user interaction, and computing all conceivable measures of interest in advance isprohibitive due to their potential number. Given the multi-mosaic the measures maybe computed on demand when the information is actually required. Our multi-mosaicmemory also supports the representation of interest measures. This is accomplishedwith an additional polytope for each interest measure to be represented. Currently lo-cal entropy and motion information are used for focus point selection. However, dueto the Euclidean coordinate system of the multi-mosaics arbitrary image processingalgorithms can be applied to the data allowing for flexible feature extraction.

Figure 4 shows an example mosaic image automatically acquired by active sceneexploration. The clips enlarged were automatically chosento be explored in detail byappropriate camera zoom. The selection was based on high entropy and motion infor-mation. This example demonstrates that the new multi-mosaic concept is well-suited

32

Fig. 4. Mosaic image captured by autonomous scene exploration. Theimage clips show regions automaticallyselected for detail exploration by camera zooming due to high entropy (blue boxes) or motion (red box) as can beseen from the images on the left.

not only to support efficient iconic and even higher-level representations of imagesequences acquired with active cameras but also to support autonomous active sceneexploration. Thus, the visual memory supports interactivesystems with an integratedframework for image sequence representation and active data acquisition.

5 Conclusion

The visual memory presented yields a well-suited approach to efficiently representimage data acquired with active cameras. The iconic representation supports a widevariety of possible areas of application, in particular mobile robots that perform in-teractions with humans will benefit from such a visual memory[6]. In this setup, themobile system acquires multi-mosaic images from differentpositions while waitingfor request from the human communication partner. The mosaics yield a rich sourceof information that can be exploited to solve specific tasks according to future de-mands, e.g. object learning. Thereby the memory is not restricted to pure iconic databut also supports the representation of higher-level data.Such a strategy facilitating avisual memory is superior to collecting data always ”just-in-time” when it is actuallyrequired, and thus significantly improves the flexibility ofinteractive systems actingin everyday life environments.

References1. Mann, S., Picard, R.: Video orbits of the projective group: A new perspective on image mosaicing. Technical

Report 338, MIT Media Laboratory Perceptual Computing Section, Boston, USA (1996)2. Bishop, G., McMillan, L.: Plenoptic modeling: An image-based rendering system. In: Proc. Int. Conf. on

Computer Graphics and Interactive Techniques (SIGGRAPH),Los Angeles, CA (1995) 39–463. Shum, H.Y., Szeliski, R.: Systems and experiment paper: Construction of panoramic image mosaics with

global and local alignment. Int. Journal of Computer Vision36 (2000) 101–1304. Möller, B., Posch, S.: Analysis of object interactions indynamic scenes. In: Pattern Recognition, Proc. of

DAGM Symp. LNCS 2449, Schweiz, Springer (2002) 361–3695. Wolfe, J.: Visual attention. In De Valois, K., ed.: Seeing. 2. edn. Academic Press, San Diego, CA (2000)

335–3866. Möller, B., Posch, S., Haasch, A., Fritsch, J., Sagerer, G.: Interactive object learning for robot companions

using mosaic images. In: Proc. IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, Edmonton, Alberta,Canada (2005) to appear.

The Mental Continuum: Control Models for VirtualHumans in Real World Situations

Johannes Strassner1, Marion Langer2, and Stefan Müller2

1 Rhönstr.7, 64354 Reinheim, Germany, Email: [email protected] University of Koblenz and Landau, Institut für Computervisualistik, Universitätsstraße 1, 56016 Koblenz,

Germany

Abstract. Virtual humans serve as configurable interfaces, sensing the virtual world, storingknowledge and using it to perform tasks. For many tasks it is important to know where an ob-ject is and what is known about an object in order to perform a task. Vision can cluster objectsdependent on the task, so the same objects need to be grouped in different ways dependent on thesituation. Visualization is used to visually construct views of the environment and to maintain thespatial knowledge. We proposethe mental continuumfor the perception and knowledge pipeline,which allows to describe the current knowledge about an object by it’s level of detail and it’s form.A virtual human can use this model to independently perceiveand forget thewhereandwhat formof knowledge.

1 Introduction

While humans explore the world they are perceiving their environment and are con-tinuously collecting information from it. The perceptual process integrates perceptionwith knowledge. It consists of a number of modules, such as perception, action, at-tending a stimulus and knowledge processing. Apart from some other sensory sys-tems, like auditive and haptic, the visual sensing is an important input for humansand supports understanding of new situations and task. To get the detailed input, weare looking for (e.g. in any search) we are forced to direct our gaze directly at theobject of interest. This traces back to the fact that humans only have a good visualacuity in a small area, where only cones are concentrated, the fovea [1]. The conesare responsible for sensing color, while the rods are highlysensitive to light, whichare richly spreaded over the peripheral retina. Thus searching for a familiar face ina crowd results in looking at one face after another, bringing every persons face intothe all-cone foveal vision to receive enough detail to recognize the person. The restof the faces in the rod-rich peripheral retina can’t be recognized because of a lack ofacuity. Nevertheless these "things" in the periphery can still be identified as faces andhumans.Virtual humans can be equipped with a number of capabilitieswhich allow them to in-teract and behave in a virtual environment similarly to humans. They need to perceivethe properties of objects and to maintain personal knowledge about the environmentin order to perform a task. Task descriptions for synthetic characters use propertiesof objects to integrate characters in the environment and enable them to search forand interact with objects. Knowledge about the environmentis perceived while they

34

explore the environment and could be forgotten again. In this paper we propose themental continuumfor the perception and knowledge cycle. The knowledge aboutanobject is available at a varying level of detail. The continuum captures the details ofan object so that there can be in principle continuous variations of details of the ob-ject. The design of the continuum has been inspired by Kosslyn’s mental images [2],which describe images as functional representations in thehuman memory. They cancapture continuous variations of shape and can be processedby mechanisms similarto real object perception. Images may be less detailed than the corresponding perceptbecause of memory capacity limits.

2 The Mental Continuum

We use the LOD as an underlying representation for the perception cycle. Every mod-ule of the perception and knowledge process describes it’s state on a detail axis. Be-cause all of the modules are sharing the same ground, the perceptual cycle can beeasily extended, adapted to special applications or modules can be exchanged withmore sophisticated ones. The knowledge about an attribute of an object is availablewith varying level of detail. The correct value of an attribute is provided by an au-thor or can be calculated by a procedure. We identified different kinds of knowledge,which we denote asformsof knowledge. Each form is perceived and forgotten inde-pendently using a different model matching the characteristics of the form. Forms or-ganize the knowledge into parallel streams, which can be independently implementedand used in a task.For many tasks it is important to knowwherean object is andwhat is known aboutthis object in order to perform a task. Accordingly we differseparate forms describ-ing where an object is located and what an object is. Depending on the LOD thisknowledge is more or less accurate. So a buyer might rememberthe exact locationof a stand on a market place but only roughly the sellers hair color and size. Or heremembers only approximately the area where a restaurant islocated, but the rightprice for a daily dinner they serve.Applications consist of a number of situations. Situationsprovide a constrained viewon the world. They regulatehowto do a task and integrate the knowledge of the what-and where- form. Knowledge can be available in all situations or only in some. Wedenote this knowledge, which is only available in some situations as “situated”.The different forms “where”, “what” and “how” build a coordinate system (Figure 1).

3 The Perception System

The system provides a number of modules to perceive, recognize, store and remem-ber an object with a certain level of detail. The LOD is the central mechanism of thesystem, which allows the modules to handle the data by the same abstract parameter

35

where

what where

what

where

what

where

what

Situation 2

Object 1

Object 2

Object 3

Object 4

Task 1

Task 2

how

Fig. 1. where-what-how diagram (wwh)

mechanism.A virtual human needs to perceive or know relevant objects inorder to perform a taskin an environment. But often the environment is only incompletely known, becausevirtual humans can not attend and store everything. What is stored in the memorydepends on the perceiver, the task and the environment. The perception knowledgesystem takes care of the different dependencies and adapts the knowledge using fil-ters. A task can directly use the current knowledge about an object stored in theLOD-memory.We describe now the different modules of the perception and knowledge system(Fig. 2):We use a small number oftask primitivesto describe activities for actors. The in-

tention is to provide the actor with a list of activities, which are directly connectedto the state of the attributes of an object. Theexecutorreceives a list of task primi-tives which should be executed. Priorities are assigned to every task primitive. A taskprimitive can be evaluated by the system if required attributes are available with acertain level of detail. The evaluation is finished if all attributes are available with themaximum level of detail or the goal is reached. Missing detail is acquired by othertask primitives automatically added to the task list with the same priority as the taskprimitive which needs the information.Vision groups objects dependent on their gestalt, e.g. several attributes of the objectsor their proximity during a preattentive phase, which is called perceptual grouping[3]. The object definitionmaps the objects of the environment to a set of objectswhich depends on their current meaning for the task. The object definition has to bedynamic and usually independent from the rendering oriented world scene graph. The

36

View

Visualization

3D

Environment

Task List

Filter

False color

View

Object

Recognition

Filter

Object

Definition

Task

PrimitiveExecutor SpaceObjects

Object

Attributes (KF)

Knowledge

Form (KF)

KF Pipe 0

LOD

Object

Memory (KF)

Fig. 2.perception and knowledge modules

same objects or parts of objects need to be grouped in different ways according to thesituation. To keep it independent from the modelling of 3D objects, it has to be pos-sible to visually build objects also from parts of an 3D object.The view visualizationrenders every relevant object from the definition phase withan unique false color. So it’s easier to deal with occlusion problems, to separate thedifferent objects appearing in the characters view or to identify free space for nav-igation. Objects can be also too small, almost completely occluded or too far awayfor perception. This results in pixels humans usually can not seperate from the back-ground or are associated with related objects. Visualization should develop strategiesto prevent “unwanted” pixels.Therecognitiondefines an upper bound for the level of detail an object can be recog-nized. Objects can be identified by their unique color. But ifthey are occluded theyneed to be recognized with less detail.Thefilter for the current situation maps the possible knowledge aboutan object to theappropriate level of detail (LOD). The possible detail of anattribute belonging to anobject is a set (LOD1:value1 = weight1, LOD2:value2 = weight2, ... , LODn:valuen =weightn). Filters can reduce the set by building iteratively sub groups of the possibledetail and by defining a position LOD in the current sub group.The resulting detailcan be also described by a normalized weight, a float value between 0.0 and 1.0. TheLOD is then the LODi, which weighti is the closest to weight. So it’s possible to de-

37

scribe the LOD for an attribute or a number of attributes independently from a set ora sub group.

A knowledge formis a mechanism which can build an ordered number of detailsof an attribute. It consists of a subset of the objects’ attributes and a LOD method. TheLOD method produces or provides the appropriate detail for the attributes accordingto the LOD value. The result is stored in theobject memory.

4 Synthetic Visual Perception

Our approach to synthetic vision combines visualization with a mechanism to inter-pret the view. Visualization allows us to simulate groupingmechanisms similar topreattentive perception[3]. The view mechanism can be usedto interactively buildand test behaviours which depend on the visualization. Thisincludes e.g. searchingfor objects or navigating through an environment. Our synthetic vision is based on amodel described by Noser et al. [4], which uses false-coloring and dynamic octreesfor representing the visual memory. Kuffner et al. [5] removed Noser’s octree struc-ture. Peters and Sullivan [6] extended then the model by providing different visionmodes. Their extension provides different visual modes, which are associated withseperate color tables. Using those modes different levels of information can be ex-pressed. They refer to the two main modes as distinct mode, inwhich each object isfalse-colored, and grouped mode, in which groups of objectsreceive one false color.The objects could be grouped e.g. according to luminance, type, shape. Bordeux [7]introduced a perception pipeline which filters the objects of the scene graph. Thepresented filters concentrated on the object space only. Conde [8] works on the inte-gration of different perception modalities to understand perceptual organization andto create mental representations of reality.Our synthetic vision system uses a quadtree instead Noser’soctree because it bettermatches objects projected in a 2D plane or map like representations of an environ-ment. We also use a false color technique which can visually split objects into parts.We use virtual structures of perceptual scenes which are independent from the scenegraph, because objects need to be grouped in a different way according to the taskand the perceiver.

4.1 Visualization

Visualization is used to visually construct views of the environment to support virtualhumans, agents or other processes. Perceptual grouping of objects is visualized byassigning a unique false color to every object (group). We use these visualizations for2 modules in a different way:

1. character’s view: A false-colored image of the scene is rendered from the charac-ters point of view to support its vision. By limiting the image size we can restrictthe perception to objects, which are not almost occluded. The resolution depends

38

on the application. For the market place we have the best results with images of64x64 pixels.

2. bird eye view: We need to describe where an object is, whichdepends on theperceived size. Humans are pretty good in guessing the size of an object. Forsimplification we are not modelling the visual process to determine the positionand size of an object, but use a bird eye view, an orthographicimage, which isrendered from above the environment looking straight down on it. The dimensionsare power of two so it can be recursively partitioned, e.g. tobuild a quadtree. Forthe market place we use 256x256 pixels. 1 pixel corresponds to almost 1 squaremeter of the real environment. The pixel coordinates are then used to build thequadtree representing the perceived spatial knowledge of acharacter.

4.2 View

A view zone model divides the view in different zones. Depending on the area in thefield of view, the object is perceived with a certain level of detail and the attributesare stored with the corresponding LOD. Then the Objects are assigned to a viewingzone. We differ three zones : the fovea, the periphery and theouter border of the fieldof view, which we refer asview directing zone.The objects seen in the center, in thefoveal zone, can be perceived with the highestLOD. The ones perceived in theperipherycan only be perceived up to LOD 2 or less.And the objects of thedirecting zoneare not stored, but direct the characters view toobjects of interest, which are used by tasks in the character’s task list.

5 Knowledge Forms

We distinguish three different types of knowledge, spatialknowledge, and semanticknowledge.

5.1 Spatial knowledge quadtree

The bird view image is used to build a dynamic quadtree of the spatial knowledge.E.g. an image of 256x256 pixels can be recursively partitioned up to 8 times. Thismeans the spatial knowledge with the highest detail will be assigned to level 8. Figure3L shows a quadtree up to depth 6 for all objects. Pointers areassigned to the nodesof the quadtree to remember the position of an object. E.g. the pointer of an object,which only covers one pixel in the bird’s-eye view is stored at level 8 of the quadtree.It has the most accurate spatial knowledge. If the characterstarts to forget the locationof that object, the pointer is moved a level higher (parent node). Therefore this objectcovers now four pixels in the characters knowledge map. The object can not be locatedcorrectly anymore.

39

Fig. 3. L: Quadtree representation of a market place up to depth 6 forall objectsR: Spatial forgetting process illustrated with a stand in the market place.Upper row: Left: The location of the stand is remembered and represented in the quadtree up to depth 7.Right: By time the accurate spatial information is forgotten and the stand is only represented in the quadtree up todepth 5Lower row: After some more time the stands location is only remembered up to depth 4 (left) and later up to depth2 (right).

5.2 Semantic knowledge

Semantic knowledge provides a number of attributes for every object describing whatan object is. Authors can annotate this knowledge using realworld situations. Duringan annotation, objects will be equipped with a number of attributes. An attribute canbe available with a variety of levels of detail. An LOD value is assigned to everylevel of detail. The semantic objects are stored in a hash table and can be accessed bytheir unique object id or a name. The following example showsa part of the semanticknowledge for an object:Tomato{ ID (21),

Price (LOD1: 1.99, LOD2: 1.80-2.20,LOD3: 1-3),Quality (LOD1: B, LOD2: A or B)

},The current semantic knowledge is organized in a hierarchical structure, which wedenote assemantic tree. It exists in parallel to the scene graph. A template for asemantic tree is automatically derived from the scene graph. The template is thenmodified by an author during an annotation phase before the application is started.The author can e.g. add objects to the semantic tree, which are not explicitly modeledas separate 3D objects. The intention of the semantic tree isto make larger objects bygrouping objects according to their attributes, features or containers. We concentrate

40

on expressing knowledge of generalization, e.g. a tomato isa vegetable, a vegetablecan be found in a vegetable stand, a vegetable stand is a stand.

6 Perception and Knowledge Cycle: Exploration and Forgetting

Perceiving and forgetting information are driving factorsduring task completion. Ev-ery character is continuously acquiring and forgetting knowledge while it is exploringthe world. Forgetting means, the current knowledge is filtered depending on the timethe object has been seen the last time, the frequency it has been used, its personal state(e.g. whether that character has a good spatial memory). Thus, whenever the spatialknowledge of an object has to be forgotten more, the depth of the object pointer isdecreased (see figure 3R). Independently from the spatial knowledge, the semanticknowledge can be partly forgotten. This is done by switchingto the next lower LOD.In the example given above, the tomato was initialized with all attributes and thereare all LODs available for the character. After some time thecharacter switched fromLOD1 to LOD2 for the price. So the character has only inaccurate knowledge aboutthe price of that tomato. If a character perceives an object its exact size will be cal-culated. The more he forgets the object, the more it will grow. If it is perceived againwith full detail, it will shrink to its right size again. The character can be placed orwalk on the space in between the objects. To estimate the freespace, we build anenvironment map, which is an overlay of all the remembered knowledge.

References

1. Goldstein, E.B.: Sensation and perception. Pacific Grove, CA: Wadsworth (2002)2. Kosslyn, S.M.: On the ontological status of visual mentalimages. In: TINLAP-2: Proceedings of the theoret-

ical issues in natural language processing-2. (1978) 167–1713. J.R.Hill: Modeling perceptual attention in virtual humans. In: 8 th Conf. Computer Generated Forces and

Behavioral Representation. (1999)4. Noser, H., Renault, O., Thalmann, D., Thalmann, N.M.: Navigation for digital actors based on synthetic vision,

memory, and learning. Computers and Graphics19 (1995) 7–195. Kuffner, J., Latombe, J.: Fast synthetic vision, memory,and learning models for virtual humans. In: Proceed-

ings of CA ’99: IEEE International Conference on Computer Animation, Geneva, Switzerland (1999)6. Peters, C., O’Sullivan, C.: Synthetic vision and memory for autonomous virtual humans. Computer Graphics

Forum21 (2002)7. Christophe Bordeux, R.B., Thalmann, D.: Physically-based rendering and animation: An efficient and flexible

perception pipeline for autonomous agents. Computer Graphics Forum 2318 (1999) 23–308. Conde, T., Thalmann, D.: An artificial life environment for autonomous virtual agents with multi-sensorial

and multi-perceptive features. Computer Animation and Virtual Worlds15 (2004)

From Images via Symbols to Contexts: UsingAugmented Reality for Interactive Model Acquisition

Sven Wachsmuth+, Marc Hanheide+, Sebastian Wrede+ and Christian Bauckhage∗

+Bielefeld University, Faculty of Technology, D-33594 Bielefeld, Germany{swachsmu,mhanheid,swrede}@techfak.uni-bielefeld.de

∗York University, Centre for Vision Research, Toronto ON, M3J 1P3, [email protected]

Abstract. Systems that perform in real environments need to bind the internal state to externallyperceived objects, events, or complete scenes. How to learnthis correspondence has been a longstanding problem in computer vision as well as artificial intelligence. Augmented Reality providesan interesting perspective on this problem because a human user can directly relate displayedsystem results to real environments. In the following we present a system that is able to bootstrapinternal models from user-system interactions. Starting from pictorial representations it learnssymbolic object labels that provide the basis for storing observed episodes. In a second step, morecomplex relational information is extracted from stored episodes that enables the system to reacton specific scene contexts.

1 Introduction

Mixed reality systems combine real world views with views ofa virtual environment[1]. In the sub-field of augmented reality virtual augmentations are added to the realworld view of the user. This is typically realized by using a setup with a head-mounteddevice which is equipped with cameras and a display. Most of the research on com-puter vision in this field is dedicated to the problem of aligning real and virtual objects(cf. e.g. [1, 2]). This is mostly based on pre-defined 3-d CAD models. The AR systemis either used to present a virtually changed environment tothe user or to supportthe user in a pre-defined task, e.g. [2]. In the VAMPIRE1 project we take a differentapproach in that we focus on the problem of how a system can bootstrap its knowl-edge about an unknown real environment. By using Augmented Reality techniques,the computer vision system is embodied through the tight interaction with the user.In this kind of scenario, augmentations, like bounding boxes, text labels, or arrows,are used in order to close the feedback cycle to the user. In turn, the user is able toreact based on the augmentations by changing the view or acting in the scene. Thus,the coupling between the user and the vision system is highlydynamic and dependson the interaction history. The learning of visual models based on human feedbackhas been explored in several different scenarios. Roy uses video data from motherchild interactions in order to learn the association between acoustic and visual pattern

0 The work was partially supported by the European Commissionunder project id IST-2001-34401.1 Visual Active Memory Processes and Interactive REtrieval –IST-2001-34401

42

(a) AR gear

symbolic descriptiongrounded throughobject models

contextual modelslearned fromrecorded episodesin grounded sub−scenes

from image patchesobject models learned

<OBJECT><HYPOTHESIS>

<RELIABILITY value="0.6"/>

<CLASS>Cup</CLASS>

</OBJECT>...

...

...episodicmemory

memorybasedfeature−

conceptualmemory

pictorialmemory

...

...

.........

...

...

(b) Memory organization

Fig. 1.An AR based interaction loop is used for bootstrapping the system.

[3]. Steels introduces the term of social learning in a scenario where a human teachesdifferent kinds of objects to an Aibo robot [4]. In [5], imitation learning is explored associal learning and teaching process with aims at socially intelligent robots. Finally,Heidemann et al. [6] present an augmented reality system forinteractive object learn-ing which was developed within the VAMPIRE project. However, most systems limitthe learning capability on a single aspect, like learning a classifier for an individualobject. A more general approach needs to deal with various kinds of data structuresand needs to integrate different learning processes in a single framework.

2 AR interaction and formation of memory content

In Fig. 1(a) the scenario of the system is shown. The user is sitting at a regular officetable and wears a head-mounted device which is equipped withcameras and a display.Information about recognized objects and results of user queries are visualized usingaugmented reality (AR). The head of the user is tracked usinga CMOS camera and aninertial sensor that are mounted on the top of the helmet. Thehead pose is computedfrom an artificial landmark that is placed in the scene and defines a global coordinatesystem. The system is able to detect objects, and user activities, like moving an ob-ject. It copes with varying lighting conditions as well as cluttered video signals. Byselecting from a menu displayed on the right of the field of view by speech or a mousewheel the user can trigger learning sessions or retrieve information.

43

In order to realize a bootstrapping behavior of the system starting from image-based representations to symbol-based representations, the organization of memorycontent plays a key role. The technical basis for storing andretrieving various kindsof information as well as the coordination of different visual behaviors is providedby theActive Memory Infrastructurewhich is also described in [7]. The persistenceback-end is the native Berkeley XML DB. Binary data is storeddirectly in the under-lying relational database and is referenced from stored XMLdocuments. Thus, XMLprovides a unified data model for structured information that is exchanged betweensystem components and stored in the memory.

On the conceptual level, we distinguish four different kinds of abstraction layersin the memory representation (see Fig. 1(b)) that are storedusing the same infras-tructure. On the pictorial layer images and image patches are temporarily stored. Thefeature-based layer includes learned object models and configuration data of the ob-ject recognition components. In the episodic memory layer recognition results arestored that have been reliably detected during an interactive session with a user. Fi-nally, the categorical layer consists of a couple of contextual models that e.g. describetypical configurations of objects. Each higher layer is grounded in a layer that is nearerto the signal. Object models in the feature-based memory arelearned from imagepatches that are stored during system usage; detected objects and events are relatedto learned prototypes in the feature space; finally, contextual models are learned fromepisodic sequences that capture a spatial context, e.g. theuser was looking around onthe writing area of his or her desk.

Interpretation as well as learning processes are working asynchronously on mem-ory representation. They are can easily access stored memory items at all abstractionlayers and are coordinated through memory event notification [7], e.g. the object an-choring component is triggered if a new object hypothesis isstored in the memory.

3 Image-based scene decomposition and acquisition of objectviews

In the Augmented Reality scenario, the user and the system share a common view.The images of the head-mounted cameras are directly shown onthe head mountedstereo display, so that the user sees what the camera recordsand the system knowswhich part of the scene is focused by the user. Two different visual behaviors are usedon this pictorial representation level.Mosaicing: In indoor environments meaningful sub-scenes are typically defined byplanes, e.g. table top, front side of a shelf, walls. However, if we are keeping a suffi-cient level of image detail these kind of contextual areas cannot completely be seenthrough a single view. In [8] we present a unique approach to create mosaics for ar-bitrarily moving head-mounted cameras. It uses a three stage architecture. First, wedecompose the scene into approximated planes using stereo information, which after-wards can be tracked and integrated to mosaics individually(see Fig. 2). This avoids

44

Fig. 2. Constructing and tracking of planar sub-scenes. The mosaicing approach has constructed three differentplanar sub-scenes that are stored in the pictorial memory. They were constructed from an image sequence of thehead mounted cameras which is incrementally processed in soft real-time. The user turned his or her head fromthe right side of the table to the left side. The system has correctly identified the two different desk levels.

the problem of parallax errors usually occurring from arbitrary motion and provides acompact and non-redundant representation of the scene. Each plane defines a coarsespatial context from which contextual models can be learnedthat interrelate objectsthat frequently co-occur in such a sub-scene.Object tracking:The acquisition of object models is a key to higher-level descriptionsof a scene. For object recognition an appearance-based VPL-classifier [9] is used thatcan directly be trained from image patches. These are automatically extracted whilea user is focusing the target object. An entropy measure is used in order to segmentunknown objects from a more or less homogeneous table plane.In the learning modeof the system the detected area is augmented to the view of theuser. Once the firstview is registered by the system a data-driven tracking technique [10] is started thatprovides additional views of the object. Each view that the system collects for learningis checked with the user so that he or she can control the learning process. The patchescan be stored in the pictorial memory of the system for a fast online learning ofobjects as well as a more accurate object learning on a longertime scale [9]. A labelis currently given by speech input based on a pre-defined lexicon.

4 Object anchoring and the role of context

Object anchoring links corresponding object hypotheses that are detected at differ-ent points in times to the same symbol. This is essential for representing episodesover an extended period of time. In addition to the trajectory information from ob-ject tracking, a second strategy is applied for linking thattakes the 3-d position ofthe object hypotheses into account. This can be estimated based on a self-localizationof the cameras [11]. Currently, we assume that each object islying on a table plane.

45

Tpose

Tobject Ptable

Landmark

(a) Object localization and anchoringbased on 3-d pose.

0.5170.483computer

0.2640.736desk

falsetrue

0.2120.788computer

0.3360.664desk

falsetrue

0.5850.415computer

0.2270.773desk

falsetrue

0.1440.856computer

0.4820.518desk

falsetrue

0.5180.482

computerdesk

scene

monitor

keyboardcup

sharpener

(b) Bayesian network for scenery classification can be learned fromanchored object hypotheses.

Fig. 3.Contextual models are learned from episodic memory content.

Object hypotheses are fused over time if the 3-d positions are close enough to eachother. A Gaussian curve models the probability that two hypotheses refer to the sameobject. (see Fig. 3(a)). For the final classification result the labels provided by theobject recognition component is integrated over a short period of time. Thereby, thereliability value of a specific hypothesis is adapted. Only those hypotheses that havea highly rated reliability value are permanently stored in the episodic memory.

Based on episodic data, contextual models can be estimated that represent typicalconfigurations of objects in a sub-scene. For that, we use simple Bayesian networkswith discrete conditional probability tables. In Fig. 3(b)a learned parameterizationof a Bayesian network is shown. The contextual models in turncan be used to judgecertain object hypotheses given their context as well as canbe used to classify moregeneral scene contexts, like ’office table’ if a keyboard andcomputer mouse has beenfound. Thus, higher-level categories can be detected that are defined through relationsbetween objects.

5 Conclusion and Outlook

In this paper we presented a bootstrapping approach for the acquisition of knowledgein unknown environments. Augmented Reality techniques areused in order to closethe interaction loop with the user. This acquisition process combines several visualbehaviors that are integrated using the active memory infrastructure. It is shown howthe tight coupling with the user can be used in order acquire grounded higher-levelrepresentations. The demonstration system is running on 5 different laptops allowinga soft real-time behavior. New objects can be learned in about 2-3 minutes acquir-ing between 4 to 6 object views. Contextual models are learned on a longer timescale. Parameters of Bayesian networks are estimated from about 5 minutes of regu-lar system usage where the corresponding scenery label is given by the user. Furthersystem development will focus on a further integration of the mosaiced sub-scenesand the structural learning of contextual models. We believe that the triadic interac-

46

tion between the system, the human, and the environment provides an ideal basis forpushing the cognitive development of artificial systems to afurther level. Augmentedreality offers strong interaction patterns for this purpose. On the other side, cognitivesystem capabilities will lead to a next generation of assistance technology offering avariety of applications.

References

1. Drascic, D., Milgram, P.: Perceptual Issues in AugmentedReality. In Bolas, M.T., Fisher, S.S., Merritt, J.O.,eds.: Stereoscopic Displays and Virtual Reality Systems III. Volume 2653 of SPIE., San Jose, California,USA (1996) 123–134

2. Klinker, G., Ahlers, K., Breen, D., Chevalier, P.Y., Crampton, C., Greer, D., Koller, D., Kramer, A., Rose,E., Tuceryan, M., Whitaker, R.: Confluence of Computer Vision and Interactive Graphics for AugmentedReality. Presence: Teleoperations and Virtual Environments 6 (1997) 433–451

3. Roy, D.: Learning visually grounded words and syntax of natural spoken language. Evolution of Communi-cation4 (2002)

4. Steels, L., Kaplan, F.: AIBO’s first words: The social learning of language and meaning. Evolution ofCommunication4 (2001) 3–32

5. Breazeal, C., Buchsbaum, D., Gray, J., Gatenby, D., Blumberg, B.: Learning from and about Others: To-wards Using Imitation to Bootstrap the Social Understanding of Others by Robots,. Artificial Life (2004)(Forthcoming 2004).

6. Heidemann, G., Bekel, H., Bax, I., Ritter, H.: Interactive Online Learning. Pattern Recognition and ImageAnalysis15 (2005) 55–58

7. Wachsmuth, S., Wrede, S., Hanheide, M., Bauckhage, C.: AnActive Memory Model for Cognitive ComputerVision Systems. Künstliche Intelligenz19 (2005) 25–31

8. Gorges, N., Hanheide, M., Christmas, W., Bauckhage, C., Sagerer, G., Kittler, J.: Mosaics from ArbitraryStereo Video Sequences. In: Proc. Pattern Recognition Symposium (DAGM). (2004)

9. Bekel, H., Bax, I., Heidemann, G., Ritter, H.: Adaptive Computer Vision: Online Learning for Object Recog-nition. In: Proc. Pattern Recognition Symposium (DAGM). (2004)

10. Gräßl, C., Zinßer, T., Niemann, H.: Efficient HyperplaneTracking by Intelligent Region Selection. In: Proc.IEEE Southwest Symposium on Image Analysis and Interpreta tion. (2004) 51–55

11. Chandraker, M., Stock, C., Pinz, A.: Real Time Camera Pose in a Room. In: Int. Conf. on Computer VisionSystems. Volume 2626 of LNCS. (2003) 98–110

Dependence of Conceptual Representations forTemporal Developments in Videosequences on a Target

Language

Aleš Fexa

Institut für Algorithmen und Kognitive Systeme (IAKS),Fakultät für Informatik der Universität Karlsruhe (TH),

76128 Karlsruhe, [email protected]

WWW home page:http://i21www.ira.uka.de/

Abstract. A system developed at the Institut für Algorithmen und Kognitive Systeme (IAKS) [1,2] uses aconceptualrepresentation of knowledge in order to infer information about situations ofagents. The inferred information can be used to facilitate tracking of agents in a video sequence, toproduce synthetic video sequences, or to generate a naturallanguage description of agents’ situa-tions. The current system can generate descriptions of situations in English and German (Germaniclanguage group). This contribution investigates how much the system-internal conceptual repre-sentation needs to be amended if the natural language text should be generated in a language froma different language group. Czech language text generationhas been implemented in the systemto provide a representative of a different language group (Slavic language group). It is shown thatno changes at the conceptual level are necessary so far in order to generate simple descriptions ofsituations in the English, German or Czech language.

1 Introduction

The system which has been developed at IAKS [1, 2, 3] uses a conceptual repre-sentation as an ‘intermediate’ representation in order to generate a natural languagedescription of temporal developments in a videosequence. The overall architecture ofthe system consists of three sub-systems (Fig. 1); aVision Sub-System(VS), whichextracts information about agents from a videosequence, aConceptual Sub-System(CS), which infers situations of agents, and aNatural Language Sub-System(NS) fortext generation.

The system can generate descriptions in English and German (Germanic languagegroup). Czech language text generation (CLTG) has been implemented in addition inorder to study a representative of a different language group (Slavic language group).In what follows, the changes to the NS and CS subsystems are discussed which ap-peared necessary in order to generate the description in a language from a languagegroup which differs from the one for which the system had beendesigned and imple-mented originally.

2 Natural Language Text Generation

Figure 1 shows a schema which depicts the process of natural language text gen-eration from video sequence evaluation results. Information from VS is propagated

48

DRS − Transformation

NS

DRS − Construction

FMTHL Inferencebased on

Situation Analysis

VS

FMTHLFacts

TrackingResults

DATA KNOWLEDGE

Natural Language Text

DRS − Transformers

PROCESSES

DRSs

SGT

FMTHL Rules

Scene Model

CS

SUBSYSTEM

Basic

Relations

Text Generation Rules GERMAN

Text Generation Rules ENGLISH

Construction Rules

Lexicon

Text Generation Rules CZECH

Fig. 1. A system schema for generation of natural language text describing the behavior of moving agents. Com-ponents underlaid with gray had to be extended, and components underlaid with black had to be added in order toimplement the Czech language text generation. The schema has been taken from [7] and modified.

into the ‘intermediate’ CS first where situations of agents are inferred using SituationGraph Tree (SGT) [4, 5] and Fuzzy-Metric Temporal Horn Logic(FMTHL) infer-ence [4].

The Discourse Representation Theory (DRT) [6] provides therepresentationalformalism used in the NS. A Discourse Representation Structure (DRS) is modifiedby transformation ruleswhich are applied in a predefined order. The transformationrules, which generate text, are calledtext generation rulesand are specific for eachlanguage. Lexicalisation is implemented using some transformation rules andlexical-isation ruleswhich are stored in a file called ‘lexicon’. The conjugations/declinationsare performed bymorphological rules.

There are three components which have been written specifically for the Germanand English language. All are located in NS. The components are the text generationrules, the morphological rules, and the lexicalization rules. At least these componentshave to be designed new in order to incorporate an additionallanguage, for examplethe Czech one.

49

3 What Had to Be Changed in Order to Generate CzechSentences?

The three components noted in the previous section had to be written individually. Inaddition, Referring Expression Generation (REG) had to be written specifically forthe Czech language. How the components had to implemented orextended for theCzech language is described in the following sections.

3.1 Lexicalisation

There are two components which are used during lexicalisation; transformation rulesand lexicalisation rules. So far, transformation rules forEnglish and German are thesame whereas lexicalisation rules are specific for each language. The transformationrules which have been embedded into lexicalisation did not need to be changed. Thelexicalisation rules had to be written specifically for the Czech language.

Templates for Czech lexicalisation rules have been taken from English, and theEnglish lemmas have been replaced by the corresponding Czech ones. If the substi-tution of lemmas could not be accomplished, a German versionof the rule has beentried, or the structure of the rule has been modified. An example of using a Germanpattern is a noun ‘Bernhardstrasse’, which is translated into English as ‘BernhardStreet’, but the original German name ‘Bernhardstrasse’ should be used in Czech. Anexample of structure modification is provided by the verb ‘back up’, which consistsof two words in English, a compound word in German (‘zurück|setzen’), but only ofone word in Czech (‘couvat’).

English, German, and Czech lexicalisation rules stay in onefile (lexicon) together.It has been considered to separate these lexicalisation rules into three language spe-cific lexica because the common lexicon would get long and difficult to manage whenmany additional languages have to be incorporated in the future. Another advantage ofthis ‘separation’ would be a possibility to implement lexicalisation rules potentiallyspecifically for each language. The separation has not been implemented, however,during the implementation of the Czech language text generation, because the lexiconis still managable for three languages, and because the structure of the lexicalisationrules could remain to be the same for all three languages.

3.2 Morphology

There have been two possibilities how to implement a morphological generator for theCzech language. The first one was to use the same system of morphological rules asfor English and German. This would be reasonable for very small domains with fewwords only, but difficult for already moderate domains, because the Czech language isa highly inflectional language (tables with declinations and conjugations of paradigmscover about100 pages in [8]). Although still the same morphological categories as inEnglish and German (namely Part of Speech, Gender, Case, Tense, and Possessor’s

50

gender) could be used also for the current Czech corpus (i.e.the groundtruth Czechlanguage description), the number of morphological valuesand the number of irreg-ularities would cause the implementation of the inflexion bythe current system ofmorphological rules to be already very complicated.

The alternative to the system of morphological rules was to use a morphologicalgenerator presented in [8, 9] which is a general system designed for a full morpho-logical synthesis of the whole Czech language. Because the morphological generatoris much more powerful both with respect to the number of morphological categoriesand to the number of words which can be handled correctly, it has been incorporatedinto our system and used for the generation of word forms.

Note that few changes in text generation rules have to be accomplished due to themorphology. Prepositions and adverbs have been supposed tobe undeclinable in thecurrent English and German corpuses. Although there happens to be no declinationof prepositions and adverbs in the current Czech corpus, in general prepositions andadverbs can be declinated in the Czech language. Specifically, the prepositions mightundergo so called vocalization (i.e. (not)adding of a vowelto the end of the prepo-sition), and the adverbs might be negated and have differentgrades. The differentmorphological categories for the prepositions and for the adverbs necessitated thatthe text generation rule, which has been used to generate a verb with a prepositionor an adverb, had to be ‘split’ into a modified text generationrule for generation of averb with a preposition and a new transformation rule for generation of a verb withan adverb.

3.3 Text Generation Rules

Text generation rules are individual for each language. English text generation ruleshave been used as the initial ‘templates’ for the creation ofthe Czech text generationrules. There are several reasons why the English rules have been preferred over theGerman ones. The most important one is that the Czech word order, specified by theorder of Czech text generation rules, is closer to the English one.

Although the basic word order in Czech sentences is the same as English andGerman (subject–predicate–object), a few changes in ordering of the rules had to beaccomplished in order to produce more complicated sentences correctly.

As has been already mentioned in the previous section, some modifications hadbeen necessary due to the more complicated Czech morphologywith the result thatone rule had to be ‘split’ into multiple ones.

The most significant differences with respect to the Englishand German rulesresulted, however, from the different algorithm for REG which is implemented bythe text generation rules. The algorithm for Czech REG is desribed in the followingsection.

51

3.4 Referring Expression Generation

The English (and German) REG has been quite simple so far. Thefirst occurrenceof the subject has been referred by a definite description, all subsequent occurrencesby a pronoun. Objects, which have been of the same type as the subject, have beenreferred by an indefinite description for the first time, and by a definite descriptionin all subsequent references. Objects, which have been of a different type than thesubject, have been referred by a definitive description all the time.

The Czech REG, however, has to be different. Czech has neither an indefinite nor adefinite article. The demonstrative pronoun ‘ten’ (lit. ‘that’), however, sometimes canbe used in a similar way to the English ‘the’. Another difference is that the frequencyof pronouns and ‘zero references’ (i.e. no explicit reference to the referent) is higherin the Czech corpus than in the corresponding English and German corpuses.

[10, 11, 12, 13, 14] provide examples for a discussion of the component REGand about the pronominalization (not/using of pronouns as referring expressions).The approaches use different algorithms for the selection of referring expressions,but all of them use features like ‘distance from the last reference’, ‘degree of ambigu-ity’, ‘time-shift’, ‘topic-shift’ in order to decide whichreferring expression should beused. Although the models have been designed for English, the philosophy of thesemodels and specifically their more or less expressedfocusmodelling has been foundto be applicable for the Czech corpus as well. Because the applicability for otherCzech corpuses is expected, the model for the Czech REG has been based on theseinvestigations.

The REG of subjects has been simple in the Czech corpus, because there is onlyone unique subject throughout all the discourses studied sofar in our context. Thefirst occurrence has been referred by the subject name, all other occurrences by ‘zeroreference’. The REG of objects has been more complicated (see Fig. 2). The algorithmmodels the reader’s focus in order to decide what referring expression should be used.Rule1 handles the case when the referent is referred for the first time. Because thisis the first reference, the referent is not in the reader’s focus, and the full referentname has to be used. Rule2a handles the case when the referent is already ‘out offocus’. If this is the case, then a demonstrative pronoun + the referent name has to beused. Rules2b and2c model the case when there is a competing referent in a ‘strongfocus’. If this is the case, a demonstrative pronoun + the referent name have to be usedin order to avoid an ambiguity. A pronoun should be used in allother cases, becausethe object is assumed to be in the reader’s focus, and redundant information wouldonly bother the reader.

4 Discussion

Examples of natural language text generation from two videosequences are givenin Fig. 3. The first video sequence is called‘dtneu05’ , and shows an innercity

52

1. If this is the first occurrence in the text,then use the referent name.2. If this is a subsequent occurrence in the text,then

(a) if the object has been not referred in the current or the previous sentence,then use a demonstrativepronoun + referent name.

(b) if there is an ambiguity and the competing referent has been referred at least three times in the previousfour sentences,then use a demonstrative pronoun + referent name.

(c) if there is an ambiguity and the competing referent has been referred by [a demonstrative pronoun +] thereferent name at least once in the last two sentences,then use a demonstrative pronoun + the referentname.

(d) Elseuse a pronoun.

Fig. 2. The algorithm used for the Czech REG of objects.

intersection. The second one is a‘tankstelle’ video sequence which shows asituation at a gas station. The two video sequences have beenused already for gener-ation of English and German text descriptions in previous investigations [2, 7, 3]. TheCzech language text description, which is new here, is generated correctly. The sen-tences are gramatically correct, all words are inflected appropriately, and the correctreferring expressions are used.

The new and changed components of the whole system have been underlaid inFig. 1. As can be seen, the only components that had to be designed new are the textgeneration rules for Czech which also comprise the changes in REG and morphology.The lexicon has been extended with the Czech lexicalisationrules. All changes arelocated entirely in the NS. No changes at the CS thus have beennecessary in order togenerate the descriptions in the Czech language. Note also that the components thathad to be changed are exactly the ones which have been language specific already forthe English and German language.

5 Conclusion

The system which has been developed at IAKS [1, 2, 3] can generate natural lan-guage description from video input. The architecture of thesystem consists of VS,which extracts information from a video, CS, which infers situations of agents, andNS which generates description of agents’ situations in English or German language.In this contribution, we investigated to which extend the CShad to be changed ifthe description had been generated in a language from a different language group.Czech language text generation has been implemented, and ithas been shown that nochanges at the CS have been necessary in order to extend the generation of descrip-tions in English or German (Germanic language group) to the Czech language (Slaviclanguage group).

Acknowledgement

Discussions with and comments by Prof. H.–H. Nagel are gratefully acknowledged.

53

The white car comes in from the Bernhard Street. It crosses the intersection. It turns left into the Chapel Street.Das weisse Fahrzeug kommt aus der Bernhardstrasse. Es ueberquert die Kreuzung. Es faehrt links in die Kapellenstrasse.Bílé auto p rijíždí zBernhardstrasse.

Prejíždí p res k rižovatku. Zahýbá doleva do Kapellenstrasse.

G

1) 2)

F

T Gfs_2 fs_6 fs_5 fs_5

23 13 2 1

Object 2

(300−418) G goes to GS and stops at fs_6: (1780−2252) G stands behind T,

leaves fs_6 and GS:

The car drives to the first pump. Now it has reached the first pump. It moves on to the second pump. Now it has reached the secondpump. It stops. Now it drives off tothe exit. Therefore it has to overtake another car. It backs up. It stops. Now it drives towards the other car. It overtakesthe other car. It leaves the other car. It drives to theexit. Now it has reached the exit.Das Fahrzeug faehrt zu der ersten Zapfsaeule. Jetzt hat es die erste Zapfsaeule erreicht. Es faehrt weiter zu der zweitenZapfsaeule. Jetzt hat es die zweite Zapfsaeule

erreicht. Es haelt an. Jetzt faehrt es zu der Ausfahrt weiter. Dazu muss es ein anderes Fahrzeug ueberholen. Es setzt zurueck. Es haelt an. Jetzt faehrt es auf das andereFahrzeug zu. Es ueberholt das andere Fahrzeug. Es verlaesstdas andere Fahrzeug. Es faehrt zu der Ausfahrt. Jetzt hat es die Ausfahrt erreicht.

Auto jede k první pump e. Dojelo k ní. Jede dále k druhé pump e. Dojelo k ní. Zastavuje. Jede k výjezdu. Kvùlitomu musí p redjet jiné auto. Couvá. Zastavuje. Jede k tomu autu. P redjíždí ho. Odjíždí od n ej. Jede k výjezdu.Dojelo k výjezdu.

Fig. 3. Two examples of text generated for‘dtneu05’ video sequence (first line of images), and‘tankstelle’ video sequence (second line of images with a screen shot and aschema of movement forcar ‘G’). The images have been taken from [7, 2] and modified slightly.

References1. Nagel, H.H.: Steps toward a Cognitive Vision System. AI-Magazine25:2 (Summer 2004) 31–502. Gerber, R.: Neustrukturierung der Generierung von Text aus Ergebnissen der Bildfolgenauswertung. Interner

Bericht (in German), Institut für Algorithmen und Kognitive Systeme, Fakultät für Informatik der UniversitätKarlsruhe (TH), 76128 Karlsruhe, Germany (2004)

3. Gerber, R., Nagel, H.H.: Discourse Representation Theory for Generating Text from Video Input. Draft(2005)

4. Schäfer, K.: Unscharfe zeitlogische Modellierung von Situationen und Handlungen in Bildfolgenauswer-tung und Robotik. Volume 135 of Dissertationen zur Künstlichen Intelligenz (DISKI). Infix–Verlag, SanktAugustin, Germany (1996) (Dissertation, Fakultät für Informatik der Universität Karlsruhe (TH), 76128 Karl-sruhe, Germany, July 1996).

5. Arens, M.: Repräsentation und Nutzung von Verhaltenswissen in der Bildfolgenauswertung. Volume 287 ofDissertationen zur Künstlichen Intelligenz (DISKI). Akademische Verlagsgesellschaft AKA GmbH, Berlin,Germany (2004) (Dissertation, Fakultät für Informatik derUniversität Karlsruhe (TH), 76128 Karlsruhe,Germany, July 2004).

6. Kamp, H., Reyle, U.: From Discourse to Logic. Kluwer Academic Publishers, Dordrecht Boston London(1993)

7. Gerber, R.: On Switching the Discourse Domain for Text Generation from Videos. Cognitive Vision System- Final Report (Draft of 30 November 2004) 347–362

54

8. Hajic, J.: Disambiguation of Rich Inflection (Computational Morphology of Czech). Univerzita Karlova vPraze, Nakladatelství Karolinum (Charles University in Prague, The Karolinum Press), Ovocný trh 5, 11636Prague 1, Czech Republic (2004)

9. Hajic, J.: Czech “Free” Morphology (2000–2001) The Czech “Free”Morphology Homepage, seehttp://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Morphology/index.html.

10. Grosz, B., Joshi, A., Weinstein, S.: Centering: A framework for modelling the local coherence of discourse.Computational Linguistics21 (1995) 203–225

11. McCoy, K.F., Strube, M.: Generating Anaphoric Expressions: Pronoun or Definite Description? In Cristea,D., Ide, N., Marcu, D., eds.: Proceedings of The Relation of Discourse/Dialogue Structure and ReferenceWorkshop, Association for Computational Linguistics, 3 Landmark Center, East Stroudsburg, PA 18301 USA(1999) 63–71

12. Henschel, R., Cheng, H., Poesio, M.: Pronominalizationrevisited. In: Proceedings of 18th InternationalConference on Computational Linguistics (COLING ’00), Morgan Kaufmann Publishers, 340 Pine Street,6th Floor, San Francisco, CA 94104 USA (2000) 1:306–312

13. Reiter, E., Dale, R.: Building Natural Language Generation Systems. Cambridge University Press, Cam-bridge/UK (2000)

14. Callaway, C.B., Lester, J.C.: Pronominalization in Generated Discourse and Dialogue. In: Proceedings ofthe 40th Annual Meeting of the Association for Computational Linguistics (ACL), Association for Compu-tational Linguistics, 3 Landmark Center, East Stroudsburg, PA 18301 USA (2002) 88–95

Outline of a Computational Theory of Human Vision

Fridolin Wild

Department of Information Systems and New Media,Vienna University of Economics and Business Administration (WUW),

Augasse 2-6, A-1090 Vienna, Austria,[email protected]

Abstract. Human vision is a powerful yet highly efficient processing system. Drawing on anextensive review of empirical findings and theoretical foundations of human visual-spatial per-ception originating from various disciplines, results of an in-depth analysis of the human visualperception process will be presented.Several stages in the visual pathway will be identified that segment into processes in eye andretina, control processes for exploration and sampling, processes in the primal visual cortex, fea-ture detection, and, furthermore, object recognition processes. Based on the review, a computa-tional theory of human vision will be drafted with emphasis on developing an architectural model.The model introduced will couple enhanced perceptrons overan intermediated layer, which isresponsible for controlling exploration and sampling, to primal feature decomposition modules. Afeature based router then is responsible for distributing this preprocessed input to concept detectorcomponents. The architecture will allow both bottom-up andtop-down data flows. Moreover, itwill facilitate ’lazy’ processing by introducing means forfocused, concept driven attention.

1 Introduction

Human visual-spatial perception is an active process of thebrain starting with primalprocessing already on the eye’s retina. Researchers from various disciplines (includ-ing among others psychology, cognitive science and neurology) up to now have fairlywell succeeded in segmenting this process into several different stages. Vision itself,however, remains an eternal conundrum.

Visual-spatial perception (or ’vision’) can be defined as the process1 of buildingan internal representation of an object, a scene, an event, and simply any concept(or compilation thereof) in the mind of the beholder. This encompasses entities orrelations that are beliefed to exist in an external reality and that can be derived byprocessing reflected light rays (or an absence thereof).

Besides visual charateristica of the objects, this processis shaped by the humanand individual anatomy (cf. Pfeiffer’s ’embodiment’ theses, [2]), prior experiences,current task and context, expectations, aims, and self-regulatory strategies (for thelatter cf. [3, 4]). To rephrase this in a nutshell: already existing activities2 and theeasiness of traversion of neuron connecting axons3 both drive the spreading of activa-tions in the human brains neural net. The vision process is, so to say, fundamentally’re-constructive’ in nature (cf. [5]).

1 Including the result of this process, cf. [1].2 Which nodes are already active and are emitting/relaying action potentials?3 Which trajectory is inhibatory, which is facilitating?

56

In the remaining sections of this paper, the author will firstanalyse the stagesof the human vision process in more detail to, second, draft the outlines of a modelemulating the human approach of perception.

2 Human Vision

Starting with a Section on primal processing in the eye and the retina, a Section onexploration and sampling through the motor movements of theeye will follow. More-over, an overview over the workflows to and in the primal visual cortex will be given.These findings will result in a Section on (atomic) feature detection which then willbe investigated under the premises of (human) object recognition theories.

2.1 Eye and Retina

The human eye is sensitive for lightwaves in the range between 400nm (violet) and700nm (red). The eye works in a manner similar to a camera. Thecornea bends thelight beams through the pupil, the opening in the iris, to theretina at the back ofthe eye. The iris thereby acts just like the apperture of a camera, contracting whenexposed to bright light (thus letting less light in) and expanding when experiencinglittle light. The lense is responsible for focusing light tothe retina, thereby reflectingthe picture upside down.

The retina consists of two types of photoreceptors: the cones, which respond tocolours (either red-, green- or blue-sensitive) and are mostly situated in the centre ofthe retina, the so called fovea (cf. [6]). The second type of receptors are the rods,which respond to brightness and can only be found outside thefovea. The eye per-forms detailled distinctions only in the fovea, which covers approximately the size ofa thumb nail in an arm-length distance of the field of sight. Outside the fovea, acuitydecreases tremendously. However, stimuli in these peripheral areas are processed andsensitivity to peripheral stimuli can even be enhanced by training (cf. [7]).

The photoreceptors are connected through bipolar cells to ganglion cells whichcommunicate the sensory excitation to the brain (see figure 2.1, cf. [8], [9],[10]).Usually several rods and cones together with horizontal cells converge to one bipolarcell, which again converge with other bipolars into one ganglion. Figure 2.1 shows(simplified) how on-bipolar cells and off-bipolar cells inhibit respectively strengthenactivations of surrounding cells (cf. [11]).

Horizontal cells and amacrine cells transmit signals laterally. Depending on theinput receptors, the signals are merged and converted into different (colour) contrastsignals. Amacrinal cells similarily show an antagonistic behavior. In some cases, how-ever, they react only to stimuli changes or show phasic behavior.

All receptors of a retinal ganglion cell form a receptive field of this cell. The largerthis field is, the fuzzier the perceived picture will be – as the origins of impulsescannot be exactly identified. In the fovea these receptive fields are very small and,accordingly, the resolution is very high.

57

Ganglions bundle input from the above mentioned layers intoreceptive fields, ei-ther into an on-centre, into an off-centre or into an on-off-centre field type. Moreover,they can be distinguised into transient (sensitive only to changes) and sustained (i.e.sensitive constantly throughout the stimulation) cells. Taking into account these dif-ferent behaviors, ganglion cells can be functionally distinguished — for example ac-cording to their sensitivity to colour antagonisms, luminance, movement, directions,specific spatial frequencies, and others (cf. [8]).

Fig. 1. Cross-Section of the Retina (Kolb)Fig. 2. Inhibition and Excitation on the Retina(Pichler).

2.2 Exploration and Sampling

The sampling process with which the eye explores the field of sight consists of fix-ations and saccades. Saccades are the rapid eye movements (approx. 25ms duration)with which the high-resolution central field is pointed to the area of desire. Fixations(and slow eye movements) of relatively long duration (300ms) follow these extremlyquick redirections.

Movement and focus are controlled by ’motor maps’ (residingin the ColliculiSuperiores), a kind of neural sketch-pad activated from theretina itself and fromregions of higher processing and cognition. Retina activation in these motor maps forexample aligns the eye’s orientations towards hard contrast changes. Activation fromhigher processing stages is top-down responsible for e.g. integrating expectations ortasks.

2.3 The Primal Visual Cortex

The primal visual cortex (V1) gets input from the Chiasma Opticus, the crossing ofthe nerve cells attached to the ganglions in the middle between retina and primalvisual cortex at the back of the brain.

The input is organised in two times three neuron layers responsible for red-greenantagonism, yellow-blue antagonism and luminance-antagonism, all for each eye sep-arate. Only about 10-20% of the stimuli from the retinal ganglion cells reach the cor-tex.

58

Fig. 3. Celltype connectivity. Fig. 4.Hypercolumns.

The cortex combines both parts again, both hemispheres are connected by theCorpus Callosum (see figure, cf. [12]). The primal visual cortex (V1) is partitionedinto six layers. The fourth layer receives input of the concentric receptive fields orig-inating from the retina. >From there, several cells converge onto simple cells. Simplecells are thus sensitive to bar-shaped (bars, lines, edges)light stimuli of a specific ori-entation. Simple cells converge onto complex cortex cells with receptive fields similarto those of simple cells (see figure 3). However, they cover a larger field of the retina(thus generating positional invariance) and they fire most to moving lines (cf. [13]).Furthermore, other more complex types of cells have been identified in the cortex,for example hypercomplex cells that respond to lines of specific length or to combi-nations of orientations. Layer 4c input cell can be wired with many different simplecells.

The cortex is organised in hypercolumns: position columns are organised retinotrop,i.e. their spatial distribution resembles the distribution in the retina. Ocular dominancecolumns have a pinwheel structure (for the orientation sensitivity, the so called orien-tation columns) and – at their centres – a colour responsive blob (see figure 4).

2.4 Feature Detection

Starting with the processing in the hypercolumns, channelscan be assumed whichsplit visual input data along their spatial frequencies into nine different channels (seefigure 5, cf. [14], [15],[16]). This acts just like the biological realisation of a coarsefourier analysis. Figure 6 shows on the left the original picture which is separated asdescribed into its frequency channels. A revisualisation of the first four low-frequencychannels to the right shows that the human vision system perceives the dotted lineon some of the channels as if they were connected4. Not all features are processedsimultaneously, e.g. usually colour is processed quicker than shape and shape againis quicker than movement ([17]).

From the hypercolumns in the primal visual cortex, activations are spread to theother areas of the brain (including the other areas of the visual cortex).

4 Wertheimer called this in his investigations in Gestalt laws the law of proximity.

59

Fig. 5. Frequency Channels (Ginsburg).

Fig. 6. Original and revisualisation of four low frequency channels (Ginsburg).

Fig. 7. ‘What’- and ‘where’-stream (Grossberg).

60

Further down, a ‘what’- and a ‘where’-stream can be differentiated (see figure 7,cf. [18]). The ‘what’-stream is responsible for object recognition, the ‘where’-streamlocalises where these objects and events are. The ’what’-stream is assumed to be al-locentric, i.e. object centered, with basically retinotrop organisation or at least resem-blance, whereas the ‘where’-stream is organised egocentric, i.e. observer centered.

2.5 Object Recognition

Object recognition theories have a long tradition in research on the human visionprocess (cf. [19], [20],[21],[22], [23],[12],[24], [18]). Object recognition theories, ac-cording to Ullman ([25]), must be able to cope with four variability effects: photo-metric effects, context effects5, effects of changing perspectives, and effects of shapemorphs6. By combining the above mentioned stream processing methodwith GuidedSearch (an enhanced Feature Integration Theory, cf. [21], [22] and [23]), these obsta-cles can be overcome. The coupling processes base on featurebundles and, thereby,guide attention bottom-up. The other way round, expectations lead to higher pre-activation of expected features (in the expected locations). Retinotrope feature mapssplit features according to their types, which are then processed by the functionalmodules (as described in [18]). From there stimuli associate to the brain areas wherememories reside. The short term memory is responsible for keeping relevant areasactive (for more information on memory see [26], [27], [28],[29], [30], [31], [32]).

3 A Computational Model of Human Vision

3.1 Enhanced Perceptrons: e-Ganglions

The first layer of an artificial vision system emulating humanvision consists of perceptron-like (cf. [33]) electronic ganglions (e-ganglions) that split the concentric field of sightof a retina-equivalent into smaller, overlapping fields of sight of ganglion-equivalents.All major types of circular receptive fields of the human ganglions7 in varying sizeshave to be modeled. The various types of behavior can be imitated with pattern match-ing algorithms. As a side-effect of this, some photometric effects can be already elim-inated in this layer. In figure 8, the overlapping receptive fields are represented aspetals of the blossom unfolding around the fixation points.

3.2 Intermediate Layer I: Receptive Field Controller

The functional feature decomposition components registerat the receptive field con-troller in order to limit their input to a specified selectionof all available data fired by

5 For example, whenever an object is partly hidden by another object.6 For example, a sitting vs. a standing person.7 Luminance-, red-green-, and blue-yellow antagonism with on-centre, off-centre, and on-off-centre behavior,

both in sustained and transient activation mode.

61

Fig. 8.Model: Architecture.

the e-ganglions. They can send down impulses (coming e.g. from higher areas of cog-nition or from their own processing results) that impede or enhance the activitationsfrom the e-ganglions.

The receptive field controller acts as a router organising and forwarding the e-ganglion output to the feature decomposition components (and between them). More-over, the controller sends data to the ’motor’ maps component to influence explorationand sampling. Microtremors (very small movements of the fixation point) ensure thatpattern matching is not too sensitive to absolute positioning.

3.3 Primal Feature Decomposition

Fed from the receptive field controller, primal feature separation takes place in thefeature decomposition components. These, for example, imitate the frequency bandfiltering mechanisms or the orientation splitting in the hypercolumns of the visualcortex. Here again it is important to support not only bottomup processing, but tofacilitate top-down communication of activations in orderto drive exploration andsampling. This is especially neccessary, as in a fixation period only a small (thumb-nail sized) clipping of the potential stimulus material of acomplete image will beprocessed.

62

The activations originating the receptive fields of the e-ganglions, which are ag-glomerated by the mechanisms of the band width filtering process (and others), areused to stimulate parts of models of concepts through the next intermediate layer,the intelligent feature router. When reaching a certain threshold level in one of thesecomponents, they lead to a complete activation of the matching model and to therecognition of a corresponding label, if the concept has already been learnt. The com-partments of the ‘what’- and ‘where’-stream described above in figure 7 rest mainlyin this layer. Some of them, however, have to be interconnected or even serialized.For example the output of the four low frequency channels (asdepicted in the ‘G’-component in figure 8 resp. 6) converge onto a primal feature detector that serves aspreprocessor for a figure ground separator component.

3.4 Intermediate Layer II: An Intelligent Feature Router

By introducing an intermediate bidirectional permissive layer between the crude fea-ture detection mechanisms and the components which emulatea specific partial con-cept activation mechanism of the human long term memory, theway for a distributedarchitecture is paved: without understanding, what actually is processed, a featurerouter can be installed that forwards potentially interesting material to specific detec-tor components (in parallel or consecutively). The other way round, detector compo-nents can send feature requests: they can emit (spatially bound) mark-up of interest-ing features or regions within the field of sight – top-down tothe controller and the‘motor’ map component. The router may be rule-based, pattern-based or based on atrainable neural net.

A message channel ensures that information gained in one primal feature or con-cept detector component can be accessed by the others.

3.5 Concept Detector Components

Concept detector components receive input from various primal feature detectors(similarily to the components of the ‘what’- and ‘where’-stream). By driving atten-tion and focus top-down, first assumptions can be discarded or asserted later on in theperception process. As the retinotrop perception favors lazy, spot-oriented processing,many object recognition problems (see above) can be avoided, especially context ef-fects and shape morph effects. Several alternatives arise considering the basic workingprocess of a concept detector component. Static and dynamic(=learning via a neuralnet) pattern matching mechanisms compete with graph-basedfeature segmentationmethods (cf. [34], [24]).

3.6 Controlling Exploration: Motor Maps

Motor maps are fed from both intermediate layers. They are not retinotrop but have anegocentric field of sight resolution. They control where thenext saccade is pointing

63

to and which location will be fixated. The motor map componentespecially has to becoupled to components of a ‘where’-stream equivalent.

3.7 Learning an Ontology

To develop new concept detector components in a system implementing this architec-tural model, the resulting (semantics-free) output of the feature decomposition layerneeds to be analysed. Especially, revisualising low-frequency filter band informationhelps in the identification of course shapes. Combined with information from the ‘mo-tor’ map component that enables spatial localisation, thiscan be used to find rules,build pattern algorithms or train a neural net capable of matching the visual represen-tations of the concept of desire.

4 Conclusion and Future Work

The human vision process has been described in detail. An model imitating this vi-sion process has been outlined. However, an important aspect for human vision, theclocking, i.e. timing constraints in processing, remains uncovered. Furthermore, the’grammar’ (e.g. graphs) and ’vocabulary’ (e.g. visual variables) of human vision atthe interface of primal feature detection and concept detection needs further investi-gation. Ways of automatic visual learning (e.g. based on MPEG7 shape mark-up) willadditionally guide future research.

References

1. Scaife, M., Rogers, Y.: External cognition: how do graphical representations work? International Journal ofHuman-Computer Studies45 (1996) 185–213

2. Pfeifer, R., Scheier, C.: Understanding Intelligence. MIT Press, Cambridge, MA (2001)3. Drewniak, U.: Lernen mit Bildern in Texten. Waxman, Münster, Germany (1992)4. Weidenmann, B.: Psychische Prozesse beim Verstehen von Bildern. Verlag Hans Huber, Bern (1988)5. Pöppl, E.: Informationsverarbeitung im menschlichen Gehirn. Informatik Spektrum16 (2002) 427–4376. Mallot, H.: Sehen und die Verarbeitung visueller Information. Vieweg, Braunschweig (2000)7. Gramopadhye, A., Madhani, K.: Visual search and visual lobe size. In: IWVF 4, LNCS 2059, Berlin, Springer

(2001)8. Kolb, H., Fernandez, E., Nelson, R.: Webvision. The Organisation of the Retina and Visual System. John

Moran Eye Center of the University of Utah, Salt Lake City (2005) http://webvision.med.utah.edu/.9. Kolb, H.: How the retina works. American Scientist91 (2003) 28–35

10. Murch, G.: Human factors of color displays. In: Advancesin Computer Graphics, Berlin, Eurographics(1986)

11. Pichler, P.: Prinzipien der Bildverarbeitung im visuellen System des Menschen (2002)http://www.informatik.uni-ulm.de/ni/Lehre/SS02/Proseminar_CV/ausarbeitungen2/ppichler.pdf.

12. Hoffman, D.: Visuelle Intelligenz. Klett-Cotta, Stuttgart (2001)13. Vilis, T.: The physiology of the senses. Transformations for perception and action (2005)

http://www.med.uwo.ca/physiology/courses/sensesweb/.14. Ginsburg, A.: Spatial filtering and visual form perception. In Boff, Kaufman, Thomas, eds.: Handbook of

Perception and Human Performance, Vol. II: Cognitive Processes and Performance. (1986) 1–4115. Höger, R.: Speed of processing and stimulus complexity in low-frequency and high-frequency channels.

Perception26 (1997) 1039–1045

64

16. Höger, R.: Raumzeitliche Prozesse der visuellen Informationsverarbeitung. Scriptum Verlag, Magdeburg(2001)

17. Zeki, S.: Inner Vision. Oxford University Press, London(1999)18. Grossberg, S.: How does the cerebral cortex work? development, learning, attention, and 3d vision by laminar

circuits of visual cortex. In Grossberg, S., ed.: Behavioral and Cognitive Neuroscience Reviews. (2003)19. Marr, D.: Vision. Freeman, San Francisco (1982)20. Biederman, I.: Recognition-by-components: A theory ofhuman image understanding. Psychological Review

94 (1987) 115–14721. Treisman, A., Gelade, G.: A feature-integration theoryof attention. Cognitive Psychology12 (1980) 97–13622. Treisman, A.: Features and objects. Quarterly Journal of Experimental Psychology40A (1988) 201–23723. Wolfe, J.: Moving towards solutions to some enduring controversies in visual search. TRENDS in Cognitive

Science7 (2003) 70–7624. Gaerdenfors, P.: Conceptual Spaces. The Geometry of Thought. MIT Press, Cambridge, MA (2000)25. Ullman, S.: The visual recognition of three-dimensional objects. In Meyer, Kornblum, eds.: Attention and

Performance XIV, Cambridge, MA, MIT Press (1993)26. Baddeley, A.: The fractionation of working memory. Proceedings of the National Academy of Science of the

USA 93 (1996) 13468–1347227. Baddeley, A.: Episodic memory. New directions in research. Oxford University Press, Oxford, UK (2002)28. Barsalou, L., Simmons, K., Barbey, A., Wilson, C.: Grounding conceptual knowledge in modality-specific

systems. TRENDS in Cognitive Science7 (2003) 84–9129. Nairne, J.: The myth of the encoding-retrieval match. Memory 10 (2002) 389–39530. Braver, T., Cohen, J.: Working memory, cognitive control and the prefrontal cortex. Cognitive Processing7

(2001) 25–5531. Curtis, C., D’Esposito, M.: Persistent activity in the prefrontal cortex during working memory. TRENDS in

Cognitive Science7 (2003) 415–42332. Engelkamp, J.: Gedächtnis für Bilder. In Sachs-Hombach, K., Rehkämper, K., eds.: Bild – Bildwahrnehmung

– Bildverarbeitung, Wiesbaden, Universitäts-Verlag (1998)33. Rosenblatt, F.: The perceptron: A probabilistic model for information storage and organization in the brain.

In Cummins, Cummins, eds.: Minds, Brains, and Computers, Oxford, Blackwell (2000(1958))34. Feldman, J.: What is a visual object? TRENDS in CognitiveScience7 (2003) 252–256

Available Research Reports (since 2000):

2005

11/2005 Dietrich Paulus, Detlev Droege.Mixed-reality as a challenge to imageunderstanding and artificial intelligence.

10/2005 Jurgen Sauer.19. Workshop Planen,Scheduling und Konfigurieren / Entwerfen.

9/2005 Pascal Hitzler, Carsten Lutz, Gerd Stumme.Foundational Aspects of Ontologies.

8/2005 Joachim Baumeister, Dietmar Seipel.Knowledge Engineering and SoftwareEngineering.

7/2005 Benno Stein, Sven Meier zu Eißen.Proceedings of the Second InternationalWorkshop on Text-Based InformationRetrieval.

6/2005 Andreas Winter, Jurgen Ebert.Metamodel-driven Service Interoperability.

5/2005 Joschka Boedecker, Norbert Michael Mayer,Masaki Ogino, Rodrigo da Silva Guerra,Masaaki Kikuchi, Minoru Asada.Gettingcloser: How Simulation and Humanoid Leaguecan benefit from each other.

4/2005 Torsten Gipp, Jurgen Ebert.Web Engineeringdoes profit from a Functional Approach.

3/2005 Oliver Obst, Anita Maas, Joschka Boedecker.HTN Planning for Flexible Coordination OfMultiagent Team Behavior.

2/2005 Andreas von Hessling, Thomas Kleemann,Alex Sinner.Semantic User Profiles and theirApplications in a Mobile Environment.

1/2005 Heni Ben Amor, Achim Rettinger.IntelligentExploration for Genetic Algorithms – UsingSelf-Organizing Maps in EvolutionaryComputation.

2004

12/2004 Manfred Rosendahl.ObjektorientierteImplementierung einer Constraint basiertengeometrischen Modellierung.

11/2004 Urs Kuhlmann, Harry Sneed, AndreasWinter.Workshop Reengineering Prozesse(RePro 2004) — Fallstudien, Methoden,Vorgehen, Werkzeuge.

10/2004 Bernhard Beckert, Gerd Beuster.FormalSpecification of Security-relevant Properties ofUser-Interfaces.

9/2004 Bernhard Beckert, Martin Giese, ElmarHabermalz, Reiner Hahnle, Andreas Roth,Philipp Rummer, Steffen Schlager.Taclets: ANew Paradigm for Constructing InteractiveTheorem Provers.

8/2004 Achim Rettinger.Learning from RecordedGames: A Scoring Policy for Simulated SoccerAgents.

7/2004 Oliver Obst, Markus Rollmann.Spark — AGeneric Simulator for Physical Multi-agentSimulations.

6/2004 Frank Dylla, Alexander Ferrein, GerhardLakemeyer, Jan Murray, Oliver Obst, ThomasRofer, Frieder Stolzenburg, Ubbo Visser,Thomas Wagner.Towards aLeague-Independent Qualitative Soccer Theoryfor RoboCup.

5/2004 Peter Baumgartner, Ulrich Furbach, MargretGroß-Hardt, Thomas Kleemann.Model BasedDeduction for Database Schema Reasoning.

4/2004 Lutz Priese.A Note on Recognizable Sets ofUnranked and Unordered Trees.

3/2004 Lutz Priese.Petri Net DAG Languages andRegular Tree Languages with Synchronization.

2/2004 Ulrich Furbach, Margret Groß-Hardt, BerndThomas, Tobias Weller, Alexander Wolf.IssuesManagement: Erkennen und Beherrschen vonkommunikativen Risiken und Chancen.

1/2004 Andreas Winter, Carlo Simon.ExchangingBusiness Process Models with GXL.

2003

18/2003 Kurt Lautenbach.Duality of MarkedPlace/Transition Nets.

17/2003 Frieder Stolzenburg, Jan Murray, KarstenSturm.Multiagent Matching Algorithms Withand Without Coach.

16/2003 Peter Baumgartner, Paul A. Cairns, MichaelKohlhase, Erica Melis (Eds.).KnowledgeRepresentation and Automated Reasoning forE-Learning Systems.

15/2003 Peter Baumgartner, Ulrich Furbach, MargretGross-Hardt, Thomas Kleemann, ChristophWernhard.KRHyper Inside — Model BasedDeduction in Applications.

14/2003 Christoph Wernhard.System Description:KRHyper.

13/2003 Peter Baumgartner, Ulrich Furbach, MargretGross-Hardt, Alex Sinner.’Living Book’ :-’Deduction’, ’Slicing’, ’Interaction’..

12/2003 Heni Ben Amor, Oliver Obst, Jan Murray.Fast, Neat and Under Control: Inverse SteeringBehaviors for Physical Autonomous Agents.

11/2003 Gerd Beuster, Thomas Kleemann, BerndThomas.MIA - A Multi-Agent Location BasedInformation Systems for Mobile Users in 3GNetworks.

10/2003 Gerd Beuster, Ulrich Furbach, MargretGroß-Hardt, Bernd Thomas.AutomaticClassification for the Identification ofRelationships in a Metadata Repository.

9/2003 Nicholas Kushmerick, Bernd Thomas.Adaptive information extraction: Coretechnologies for information agents.

8/2003 Bernd Thomas.Bottom-Up Learning of LogicPrograms for Information Extraction fromHypertext Documents.

7/2003 Ulrich Furbach.AI - A Multiple BookReview.

6/2003 Peter Baumgartner, Ulrich Furbach, MargretGroß-Hardt.Living Books.

5/2003 Oliver Obst.Using Model-Based Diagnosis toBuild Hypotheses about Spatial Environments.

4/2003 Daniel Lohmann, Jurgen Ebert.AGeneralization of the Hyperspace ApproachUsing Meta-Models.

3/2003 Marco Kogler, Oliver Obst.SimulationLeague: The Next Generation.

2/2003 Peter Baumgartner, Margret Groß-Hardt, AlexSinner.Living Book – Deduction, Slicing andInteraction.

1/2003 Peter Baumgartner, Cesare Tinelli.The ModelEvolution Calculus.

2002

12/2002 Kurt Lautenbach.Logical Reasoning andPetri Nets.

11/2002 Margret Groß-Hardt.Processing of ConceptBased Queries for XML Data.

10/2002 Hanno Binder, Jerome Diebold, TobiasFeldmann, Andreas Kern, David Polock,Dennis Reif, Stephan Schmidt, Frank Schmitt,Dieter Zobel.Fahrassistenzsystem zurUnterstutzung beim R¨uckwartsfahren miteinachsigen Gespannen.

9/2002 Jurgen Ebert, Bernt Kullbach, Franz Lehner.4. Workshop Software Reengineering (BadHonnef, 29./30. April 2002).

8/2002 Richard C. Holt, Andreas Winter, Jingwei Wu.Towards a Common Query Language forReverse Engineering.

7/2002 Jurgen Ebert, Bernt Kullbach, Volker Riediger,Andreas Winter.GUPRO – GenericUnderstanding of Programs, An Overview.

6/2002 Margret Groß-Hardt.Concept based queryingof semistructured data.

5/2002 Anna Simon, Marianne Valerius.UserRequirements – Lessons Learned from aComputer Science Course.

4/2002 Frieder Stolzenburg, Oliver Obst, Jan Murray.Qualitative Velocity and Ball Interception.

3/2002 Peter Baumgartner.A First-Order LogicDavis-Putnam-Logemann-Loveland Procedure.

2/2002 Peter Baumgartner, Ulrich Furbach.Automated Deduction Techniques for theManagement of Personalized Documents.

1/2002 Jurgen Ebert, Bernt Kullbach, Franz Lehner.3. Workshop Software Reengineering (BadHonnef, 10./11. Mai 2001).

2001

13/2001 Annette Pook.Schlussbericht “FUN -Funkunterrichtsnetzwerk”.

12/2001 Toshiaki Arai, Frieder Stolzenburg.Multiagent Systems Specification by UMLStatecharts Aiming at IntelligentManufacturing.

11/2001 Kurt Lautenbach.Reproducibility of theEmpty Marking.

10/2001 Jan Murray.Specifying Agents with UML inRobotic Soccer.

9/2001 Andreas Winter.Exchanging Graphs withGXL.

8/2001 Marianne Valerius, Anna Simon.Slicing BookTechnology — eine neue Technik f¨ur eine neueLehre?.

7/2001 Bernt Kullbach, Volker Riediger.Folding: AnApproach to Enable Program Understanding ofPreprocessed Languages.

6/2001 Frieder Stolzenburg.From the Specification ofMultiagent Systems by Statecharts to theirFormal Analysis by Model Checking.

5/2001 Oliver Obst.Specifying Rational Agents withStatecharts and Utility Functions.

4/2001 Torsten Gipp, Jurgen Ebert.ConceptualModelling and Web Site Generation usingGraph Technology.

3/2001 Carlos I. Chesnevar, Jurgen Dix, FriederStolzenburg, Guillermo R. Simari.RelatingDefeasible and Normal Logic Programmingthrough Transformation Properties.

2/2001 Carola Lange, Harry M. Sneed, AndreasWinter.Applying GUPRO to GEOS – A CaseStudy.

1/2001 Pascal von Hutten, Stephan Philippi.Modelling a concurrent ray-tracing algorithmusing object-oriented Petri-Nets.

2000

8/2000 Jurgen Ebert, Bernt Kullbach,Franz Lehner (Hrsg.).2. Workshop SoftwareReengineering (Bad Honnef, 11./12. Mai2000).

7/2000 Stephan Philippi.AWPN 2000 - 7. WorkshopAlgorithmen und Werkzeuge f¨ur Petrinetze,Koblenz, 02.-03. Oktober 2000 .

6/2000 Jan Murray, Oliver Obst, Frieder Stolzenburg.Towards a Logical Approach for Soccer AgentsEngineering.

5/2000 Peter Baumgartner, Hantao Zhang (Eds.).FTP 2000 – Third International Workshop onFirst-Order Theorem Proving, St Andrews,Scotland, July 2000.

4/2000 Frieder Stolzenburg, Alejandro J. Garcıa,Carlos I. Chesnevar, Guillermo R. Simari.Introducing Generalized Specificity in LogicProgramming.

3/2000 Ingar Uhe, Manfred Rosendahl.Specificationof Symbols and Implementation of TheirConstraints in JKogge.

2/2000 Peter Baumgartner, Fabio Massacci.TheTaming of the (X)OR.

1/2000 Richard C. Holt, Andreas Winter, Andy Schurr.GXL: Towards a Standard Exchange Format.

Fachberichte INFORMATIK · Mixed-reality as a challenge to image understanding and artiﬁcial...

Documents

Transcript of Fachberichte INFORMATIK · Mixed-reality as a challenge to image understanding and artiﬁcial...