Audiovisual Attention via Synchrony · and depth-perception, see [63] chap. 4), hearing (e.g....

Audiovisual Attention via Synchrony

Diploma Thesis

Applied Computer Science GroupFaculty of TechnologyBielefeld University

byMatthias Rolf

Supervisors:Prof. Dr.-Ing. Franz Kummert

Dr.-Ing. Marc Hanheide

Bielefeld, June 17, 2008

Acknowledgements

During my thesis work, I had support from many different people that I owe thanks.First of all, I have to thank my supervisors Prof. Dr.-Ing. Franz Kummert and Dr.-Ing. Marc Hanheide. In particular I want to thank Marc Hanheide who advised mein an exemplary manner and asked the right questions at the right time. AdditionallyI owe thanks to Dr. Katharina Rohlfing, who was a great support through the entirethesis work, Dr.-Ing. Yukie Nagai, whose saliency map implementation I could use andDipl.-Inf. Lars Schillingmann who supported me in the use of the motionese video-corpus. Furthermore I want to thank Marco, Sebastian, Mirco, Daniel, Tim, Annaand Edgar for carefully reading my thesis and giving valuable hints for debugging it.A principle thank-you is entitled to my parents who supported me through my entirelife and initially enabled me to write this thesis.

i

Declaration of Authorship

I herewith declare that I am the sole author of this thesis and that I used nothing butthe specified resources and means.

Matthias RolfBielefeld, June 17, 2008

iii

Contents

Acknowledgements i

Declaration of Authorship iii

Contents v

List of Figures vii

List of Tables ix

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Basics 5

2.1 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.1 What is Attention? . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.2 Visual Attention . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.3 Cross-Modal Attention . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Saliency Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.1 Motivation and History . . . . . . . . . . . . . . . . . . . . . . 112.2.2 Computational Saliency Maps . . . . . . . . . . . . . . . . . . . 122.2.3 Saliency Maps as an Attention-Model . . . . . . . . . . . . . . 16

2.3 Multimodal Integration and Synchrony . . . . . . . . . . . . . . . . . . 172.3.1 Importance of temporal Synchrony . . . . . . . . . . . . . . . . 172.3.2 Synchrony in Perception . . . . . . . . . . . . . . . . . . . . . . 182.3.3 Synchrony in Attention . . . . . . . . . . . . . . . . . . . . . . 182.3.4 Computational Synchrony Detection . . . . . . . . . . . . . . . 202.3.5 Synchrony vs. Correlation . . . . . . . . . . . . . . . . . . . . . 23

3 Synchrony for Audio-Visual Attention 25

3.1 Choice of Computation Model . . . . . . . . . . . . . . . . . . . . . . . 253.2 Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3 Global Synchrony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

v

4 Features for Synchrony 294.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.1 Investigated Features . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Detection Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3 Localization Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . 384.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Synchrony in multimodal Motherese 455.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1.1 Motherese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.1.2 Motionese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.1.3 Multimodal Motherese . . . . . . . . . . . . . . . . . . . . . . . 46

5.2 Quantitative Analysis of Synchrony . . . . . . . . . . . . . . . . . . . . 475.2.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3 Spatial Analysis of Synchrony . . . . . . . . . . . . . . . . . . . . . . . 545.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Audiovisual Attention 596.1 Feature Weighting Schemes . . . . . . . . . . . . . . . . . . . . . . . . 59

6.1.1 Individual Synchrony Weighting . . . . . . . . . . . . . . . . . 606.1.2 Global Synchrony Weighting . . . . . . . . . . . . . . . . . . . 60

6.2 Spatial Modulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.3 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7 Conclusions 65

Bibliography 67

A Detection Results 73

vi

List of Figures

2.1 Visual-search tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Preattentive and attentive vision . . . . . . . . . . . . . . . . . . . . . 82.3 Experiment on endogenous spatial attention . . . . . . . . . . . . . . . 92.4 Itti’s saliency map model . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Static features for Saliency Maps . . . . . . . . . . . . . . . . . . . . . 132.6 Focus of Attention and Inhibition of Return . . . . . . . . . . . . . . . 152.7 Experiment by Vroomen on exogenous multimodal attention . . . . . . 192.8 Hershey & Movellan: Example of sound-source localization . . . . . . 202.9 Hershey & Movellan: Localization results . . . . . . . . . . . . . . . . 212.10 Slaney & Covell: Synchrony Detection . . . . . . . . . . . . . . . . . . 22

3.1 Comparison window vs. exp. smoothing . . . . . . . . . . . . . . . . . 263.2 Filters for Hershey Mutual Information . . . . . . . . . . . . . . . . . 27

4.1 Prewitt filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Results for an artificial testvideo . . . . . . . . . . . . . . . . . . . . . 324.3 Setup Detection Experiment . . . . . . . . . . . . . . . . . . . . . . . . 334.4 Results Detection Experiment (int,diff,flow) . . . . . . . . . . . . . 354.5 Results Detection Experiment (oHor,oVert,sal) . . . . . . . . . . . . 364.6 Setup Localization Experiment . . . . . . . . . . . . . . . . . . . . . . 40

5.1 Motion modification in Child-directed Communication . . . . . . . . . 465.2 Synchrony in multimodal Motherese (Gogate) . . . . . . . . . . . . . . 475.3 Investigated scenarios of multimodal motherese . . . . . . . . . . . . . 485.4 Sobel filter and Gradient strength . . . . . . . . . . . . . . . . . . . . . 505.5 Comparison adult-directed/child-directed . . . . . . . . . . . . . . . . 515.6 Spatial Distribution of Saliency and Synchrony . . . . . . . . . . . . . 555.7 Spatial Distribution of Saliency and Synchrony . . . . . . . . . . . . . 56

6.1 Feature Weighting via global Synchrony . . . . . . . . . . . . . . . . . 616.2 Study on disturbances in HRI . . . . . . . . . . . . . . . . . . . . . . . 63

A.1 Results Detection Experiment int . . . . . . . . . . . . . . . . . . . . . 74A.2 Results Detection Experiment diff . . . . . . . . . . . . . . . . . . . . 75A.3 Results Detection Experiment oVert . . . . . . . . . . . . . . . . . . . 76A.4 Results Detection Experiment oHor . . . . . . . . . . . . . . . . . . . . 77A.5 Results Detection Experiment RG . . . . . . . . . . . . . . . . . . . . 78

vii

A.6 Results Detection Experiment BY . . . . . . . . . . . . . . . . . . . . 79A.7 Results Detection Experiment flowXp . . . . . . . . . . . . . . . . . . 80A.8 Results Detection Experiment flowXn . . . . . . . . . . . . . . . . . . 81A.9 Results Detection Experiment flowYp . . . . . . . . . . . . . . . . . . 82A.10 Results Detection Experiment flowYn . . . . . . . . . . . . . . . . . . 83A.11 Results Detection Experiment sal . . . . . . . . . . . . . . . . . . . . . 84

viii

List of Tables

4.1 List of investigated image features . . . . . . . . . . . . . . . . . . . . 304.2 Results of the localization experiment . . . . . . . . . . . . . . . . . . 42

5.1 Hypothesis: more Synchrony in child-directed communication . . . . . 525.2 Hypothesis: Correlation between AA/AC across subjects . . . . . . . . 53

ix

1 Introduction

The understanding of human perception foundations has already been an issue forAristoteles and Platon [24]. Beyond the physical foundations, huge progress in theunderstanding of human perception skills has been made within the last 50 years.Psychology has uncovered many insights on human vision (e.g. on color-, contrast-and depth-perception, see [63] chap. 4), hearing (e.g. speech perception [14]) and othersenses. In parallel, computer science developed computational models for perceptionas well as recognition tasks. Though still lacking human performance, advances havee.g. been made on stereo vision [7], visual object recognition [30, 56] and speechrecognition [30, 39]. Thereby, research from the 1960s to the 1980s mainly concernedwith the separation of anatomically and functionally different modules in mind andbrain [18]. In parallel the reproduction of such modules through computer scienceswas investigated e.g. for the modalities vision and hearing.

1.1 Motivation

This view of strictly separated modules – and thus senses – does no justice to the factthat the real world is perceived in a heavily multimodal manner. For example whendriving a car, one sees the street and other cars, hears motors or maybe sirens andfeels wheel and gearshift as well as pressure and vibrations on the pedals. Since the1990s it is increasingly realized that not just the isolated modalities, but their interplayplays a crucial role [18]. Indeed, this multimodality is not a problem for perception,as one might expect w.r.t. the computational complexity and the need of relatingthem to each other. Rather, multimodality is very useful and provides the basis for aunary, integrated understanding of our environment [2]. In that way, stimuli from onemodality can help interpreting stimuli from another modality that are ambiguous bythemselves. A well studied example is lip-reading. Here, the view on the mouth canenhance the recognition performance of the heard words [3, 15].

While accepting the usefulness of multimodal stimuli, the problem of relating modal-ities to each other remains. When only seeing a moving face and only hearing a voice,it might be obvious that these two stimuli belong together. Though, we are seeing,hearing and feeling more than one object or event at the same time and not all eventsand objects are present at all our senses. Thus the problem is to determine, which e.g.visual stimulus has counterpart in other senses and which one does not. Knowledgeabout the environment can help in doing so, since adults know that faces and voicestypically belong together [2]. When no prior knowledge is available, the timing be-tween modalities can serve as fundamental cue. For example a crashing glass producesa sharp stimulus in vision and hearing, appearing at the same time. Also voice and

1

lip movement share rhythmic temporal patterns – i.e. they are synchronous in time.Thus stimuli from different modalities can be related to each other when occurring atthe same time or enduringly sharing a temporal pattern.

Such detection of related stimuli and their integration towards unitary, multimodalconcepts was believed to occur when processing within each modality is completed [18].That means, that first all stimuli are processed in a purely unimodal manner, followedby an integration. In contrast, recent research [2, 18, 60] suggests that integrationacross modalities already occurs before each stimulus is fully processed unimodally. Inparticular, synchrony between modalities has been demonstrated to affect attention.Attention allows us to concentrate on stimuli that are considered to be important.Also, stimuli that are strong or unexpected can catch attention in a reflex-like manner.In this way synchronous multimodal stimuli can attract attention. For example wecan locate and attend to sound sources due to the temporal synchrony of visual andauditory stimuli.

Findings in developmental psychology reveal the fundamental role of multimodalsynchrony and its guidance of attention [2]. In several studies, young infants havebeen demonstrated to detect multimodal synchrony and preferentially attend to syn-chronous stimuli. In doing so, synchronous multimodal information provides an impor-tant cornerstone of development of both perception and cognition in a constructivistmanner [1]. For learning about their environment, infants can (i) passively observetheir environment, (ii) actively explore it or (iii) benefit from a teacher in a sociallearning context. In a social learning scenario temporal synchrony can (consciously orunconsciously) be used by the teacher in order to direct the infants attention [1] andhighlight e.g. relations between spoken words and visual percepts [21].

1.2 Goals

This brief introduction already gives a clear indication that (i) multimodal integrationvia temporal synchrony happens even preattentively in humans and that (ii) thisprocess provides an important basis for learning and development. Though, there is noknown computational system making use of multimodal synchrony to guide attentionin a purely data-driven, bottom-up manner.

Therefore the major goal of this thesis is to explore the use of multimodal syn-chrony for an attention system. More concretely, visual attention shall be guided viaaudiovisual synchrony. It is particularly important to investigate this combination,since vision and hearing provide a huge amount of information, even without physicalcontact to the observed events and objects. Moreover computational visual attentionis easy to interpret in terms of gaze and objects in the visual scene. As a basis forvisual attention I use saliency maps [29], that provide a well investigated and pow-erful foundation of visual bottom-up attention. The basic idea is that saliency mapsare extended or modified in that way, that attention is preferentially focused on visualstimuli that provide synchrony with the heard sound. In the absence of any audiovisualsynchrony the system should behave like purely visual saliency maps.

2

Besides the vision of such an attention system, I follow three concrete subgoals inthis thesis. The first subgoal is to find out what features yield significant informa-tion about synchrony. Thereby I focus on the investigation of diverse image features.Those features are tested on two tasks: the general ability to discriminate betweensynchronous and asynchronous conditions as well the spatial localization within thevisual scene. Secondly I investigate the use of synchrony in a social learning scenarioin terms of child directed communication. I follow and test the hypothesis that teach-ers provide additional learning cues by means of synchrony towards young infants.While that account has already been addressed on a more abstract level [21], it isunclear whether this additional synchrony can be detected on signal level. ThirdlyI demonstrate and discuss several ways of integrating synchrony into saliency maps.Here I will remain on an argumentative level.

1.3 Overview

The next chapter is intended to introduce the basic concepts and methods used in thisthesis. In section 2.1, I discuss the most important insights on attention in general,on visual attention and in particular on cross-modal influences. Section 2.2 contains atechnical and conceptual overview of saliency maps as foundation of visual attentionin this thesis. Finally I discuss the role of multimodal synchrony in section 2.3. Inparticular I introduce computational methods concerning signal level synchrony, thatwere applied to the combination of audio and video. In chapter 3, I discuss the compu-tational synchrony method used for this thesis. Thereby I use a known method, that isdescribed in section 2.3.4 and propose several modifications and extensions. Chapter4 deals with the investigation of image features for synchrony. The experiments ondetection and localization are described in sections 4.2 and 4.3. The second subgoal –the investigation of a social learning scenario – is handled in chapter 5. After givinga comprehensive overview of research on child-directed communication, I demonstratethat an increased synchrony can indeed be detected when parents interact with theirinfants. Chapter 6 contains several sketches for an integration of saliency maps andaudiovisual synchrony, as well as an outlook on possible evaluation studies. Finally Isummarize the results and newly arised questions in chapter 7.

3

2 Basics

2.1 Attention

The goal of this thesis is to develop an audiovisual attention system. Thus the mostobvious questions are:

• What does attention mean?

• How can multimodal (e.g. audiovisual) stimuli affect attention?

This section is intended to contain an overview of the (for this thesis) most importantissues and ongoing research concerning these two questions.

2.1.1 What is Attention?

In 1890 psychology-pioneer William James stated “Everyone knows what attention is”in [31]. Though everyone might have an intuition, it is very hard to set up a definitionincorporating all empirical knowledge and still there is no satisfying definition of at-tention [62]. The common ground is: our senses collect a huge amount of informationas we see, hear and feel our environment. Now the term “attention” refers to the factthat we are only aware of a small subset of this information. Attention covers a ratherdiverse set of selective processes, deciding which information is important and worthextensive processing while only performing a limited analysis on all other information[62].

Some examples clearly show the diversity of the term “attention”. Humans canattend to . . .

• a single modality like vision or touch, at cost of other modalities,

• one certain aspect within a modality, e.g. color in vision,

• a task to solve [62],

• a certain object or

• a location in space – even in a multimodal fashion [17].

Such a selection enables us to focus only on relevant facts – in particular for survival([63], chapter 5.1.1). When facing a danger, a human or animal should give highpriority to information related to that danger. Also in harmless situations selectionof information is important because our brains are not able to process all incominginformation in a meaningful way.

5

Filter Theory and Extensions After denying unobservable cognitive processes likeattention in behaviorism, psychology turned back on these issues during the so called“cognitive revolution” in the 1950s. In 1958 Donald Broadbent introduced his influ-ential Filter Model of attention [6]. Broadbents experiments were focused on dichoticlistening: subjects were hearing different word-sequences in each ear. They wereinstructed to attend to one of these sequences and loudly repeat the words. Theanticipated effect was that the subjects were not able to remember the unattendedsequence of words. Interestingly the subjects didn’t even notice significant changesin the unattended stream like the tape running inverse or switching from English toGerman. However they noticed changes from a male to a female speaker.

Broadbent inferred that incoming information was processed to some degree on apreattentive level, i.e. before subjects were aware of that information. Regarding toBroadbent’s theory attention then works as a selective filter, handling a huge amountof information by (i) blocking most of the information considered to be unwantedand (ii) passing wanted information to consciousness. Information rated as unwantedis buffered for a short period of time. So it can still be processed in a delayed way.Broadbent’s theory is strongly influenced by mathematical Information Theory, declar-ing attention as communication channel with limited bandwidth. It is limited in theamount of information and by the fact that it can not arbitrarily switch between dif-ferent sources of information. Hence subjects are not able to perceive the sound inboth ears consciously.

Though influential, Broadbent’s theory could not explain several observations. Afirst modification was proposed by Anne Treisman in 1960. When listening to distinctstories in each ear, subjects tended to (unconsciously) replace words from the attendedstory with words from the unattended story if they made more sense. This couldnot be explained within a strict filter model, since the information from the otherstory should be completely ignored. Treisman concluded that attention operates viaattenuation, gradually weighting the importance of information rather than making abinary decision. In a second extension, Ulric Neisser (1967) noted that, apart fromthe sensory input, background knowledge is crucial to attention. Information is filteredusing prior learning and combined with that knowledge before reaching consciousness([63], chapter 5.1.5).

Endogenous vs. Exogenous Whenever arguing about attention, an important con-sideration has to be made about the direction of control [32]. First of all, humans areable to consciously direct attention, e.g. when instructed to do so. This endogenous(also referred as voluntary or top-down) direction of control is driven by expectation[17] of what information might be useful. When handling a specific task that allowssuch prediction, higher cognitive functions cause concentration on that kind of infor-mation that is potentially useful for completing the task. Examples include visualsearch tasks, i.e. when people seek for a visual target where the appearance of thattarget is known in advance. By definition, this mechanism cannot direct attentiontowards unexpected stimuli. The shift to new information is done by the exogenous

6

Figure 2.1: Left: Certain features like color, size and orientation cause an immediatepop-out. A different shape like the “2” among the “5”s causes no pop-out. Center:conjunctions of features (like the black cross) take more effort to be detected. Right:the ease of detection depends on the difference between target and distractor. Slightdifferences take more effort to be found. Illustrations from [62]

(also referred as reflexive or bottom-up) direction of control, considering changing oroutstanding stimuli to be important [17]. An important function of this mechanism isalerting [49]. Critical information (e.g. about a predator) causes a general alert stateand enables humans and animals to quickly react.

2.1.2 Visual Attention

A lot of research has been done on visual attention. A common experimental frame-work for visual attention are visual search tasks [62]. In a visual search task an observerseeks a target object among a set of distracting objects. For example he could be in-structed to find a red item among several blue items (see figure 2.1 left). Insights ofvisual attention can be gained because some tasks are observed to be completed in avery fast and efficient manner, while others are not. The complexity of a task can bemeasured using the subjects reaction time, which is the time from presenting the taskuntil subjects can tell whether the item is present or not. The task can e.g. be variedin (i) the kind of discriminating feature (like color or orientation), (ii) the numberof distracting items or (iii) the optical contrast between target and distractors. Fortarget features like color, orientation and size the target objects immediately pop outand are detected in a very efficient manner due to exogenous attention mechanisms(figure 2.1 left) — efficient in that more distracting items do not complicate the task.A different shape is not detected that efficiently: the single “2” amongst the “5”s cannot be seen immediately (figure 2.1 left) and the performance depends on the numberof distractors. A broad discussion of those features can be found in [62]. A more com-plex task – i.e. one that takes more time – is finding items via conjunctions of features(see figure 2.1 center). This requires more cognitive effort and endogenous control.Also the reaction time depends on the contrast between target and distractors. Slightcontrasts (figure 2.1 right) take more time than prominent differences like red/blue.

7

Figure 2.2: According to Feature Integration Theory primitive visual features areautomatically computed on a preattentive stage, followed by an attentive combinationand recognition. Illustration from [62]

Spatial Cueing Paradigm An important paradigm in visual attention research –known as Spatial Cueing – is that visual attention is basically space based. There isa so called Focus-of-Attention (FoA) located in the visual field, that subsumes theattended information. This FoA is often described as a spotlight [22] that highlightsimportant information.

An important theory following that paradigm was proposed by Anne Treisman [57]in 1980: the Feature Integration Theory. The main proposition is that visual percep-tion consists out of two functionally independent and subsequent processing stages(see figure 2.2). On a first, preattentive stage a set of features is computed over theentire visual field. The features are computed independently and in parallel, creatingeach one feature map that encodes the presence of the feature for each location. As-suming that this computation happens always and over the entire visual field explainsthe immediate pop-out of items in visual search — which only occurs when a singlefeature discriminates the target from the distractors. Since the computation-time doesnot depend on the number of distractors, the reaction time does not as well.

On a second, attentive stage combination and joint processing of those features takesplace. According to Feature Integration Theory every combination of features (likerepresenting a colored shape) requires attention. Thus a conjunction of features causesno pop-out in a visual search task and items have to be scanned in a serial, attentivemanner [51] focussing one item at the same time.

Newer empirical results show major drawbacks of Treismans theory, such that somepreattentive features can not guide attention (i.e. do not cause pop-outs in visualsearch tasks) or that the strict architecture of two subsequent stages is not reasonableitself [62]. Though there are more elaborated visual attention theories, they are notdiscussed here since the Feature Integration Theory has the highest practical relevancefor this thesis. It provides the idea of preattentive feature extraction and yields a basisfor saliency maps, discussed in section 2.2.

Other Paradigms Besides the space-based approach there are two other paradigmspresent in literature: object-based visual attention claims that attention is not bound

8

Figure 2.3: Subjects are told to fixate the center point. A visual or auditive stimulusis presented at one of four speakers/lights. Illustration from [17]

to arbitrary locations in space but to perceptual objects. As noted in [22], theseapproaches do not necessarily exclude each other since objects are always present ata certain location in space. Additionally, distinction approaches suffer from vagueand diverging interpretations of “object-based”. Finally, feature- or dimension-basedapproaches propose that selection is restricted to a maximum number of features thatcan be used for discrimination at the same time [44].

2.1.3 Cross-Modal Attention

For a long time the research-mainstream on selective attention focused on a singlemodality at the same time [11, 17]. However, we are facing a multimodal environmentand perceiving objects and events in a multimodal manner. Thus it is reasonable toassume that also attention works in a multimodal way when attending to an objectthat is perceived through several senses. In fact there is strong evidence for diversecross-modal interactions in attention.

The most basic interaction is selectively attending to one modality while ignoringothers. Examples include focusing on sounds during a telephone call or ignoring allother stimuli when smelling an interesting odor, while most of the time vision is thedominating modality for humans. While this link was already mentioned 1893 byWilhelm Wundt, there are more recent results that suggest a close multimodal linkagein spatial attention. Figure 2.3 illustrates an experiment indicating a link betweenvision and hearing for endogenous spatial attention. Subjects were asked to fix apoint in the middle, when perceiving a visual or auditive stimulus on either left/rightbottom/top corner of a board in front of them. They should discriminate whether thestimulus appeared on the top or at the bottom. They were informed about the mostlikely side the stimulus would appear on and in fact, they produced better and fasterdiscriminations on the expected side. They even performed better when expecting avisual stimulus but perceiving an auditive stimulus on the expected side, what suggestsa shared spatial focus for vision and hearing.

Jon Spence and Charles Driver review several of those links in [17], also showinginterdependences between vision/touch, hearing/touch and results for exogenous spa-

9

tial attention (also shown in [11]). Recently there is also neuro-biological evidencefor multimodal interdependences between attentional and early, preattentive corticalbrain areas, that were assumed to work purely unimodally [10].

10

Figure 2.4: Left: An image is decomposed into several feature maps. Difference-of-Gaussians across several scales is applied to each feature. The resulting maps arenormalized and feed into a unique map of saliency. A winner-take-all mechanism withinhibition of return produces a serial shift of a focus of attention. Right: Example oforiginal image, conspicuity maps per feature and the final saliency map. Illustrationfrom [28]

2.2 Saliency Maps

The investigation of cognitive science terms by means of computational models hasbecome common since the beginning of the artificial intelligence movement in the1950s. This is first of all a proper mean for verifying or discarding theories originatedin psychology or neuro-science. On the other hand artificial systems can take directadvantage of those computational models. In the same way attention – and especiallyvisual attention – was studied in the last decades. Saliency Maps are nowadays awidely used [20, 53] model for visual bottom-up attention, first conceptually describedby Koch & Ullman in 1985 [37].

2.2.1 Motivation and History

Already Treisman’s Feature Integration Theory described a system of feature maps,combined into a master map of locations. Activities in this master map encode thepresence of salient discontinuities in feature activity [51]. For example a single redobject among other blue objects (see section 2.1.2) is reflected by a discontinuity incolor. The red object is salient in being outstanding from its surrounding. Indeed,those saliency maps were not only described by psychology but as well found in neuro-physiologic studies of primate brains [23]. In a behavioral study it was shown thatonly a small subset of visual stimuli was directly represented in primate brains and

11

that saliency maps (located in the parietal cortex) had significant influence on whatwas represented.

Though conceptually described in 1985, the first implementation of saliency mapswas reported in 1996 [46]. After several extensions the now commonly used model wasproposed by Laurent Itti in 1998 [29] and further discussed in [28].

2.2.2 Computational Saliency Maps

Itti’s model is designed to handle static color images. For a single input image, onesaliency map is computed out of several image features. Then a focus of attention (seesection 2.1.2) is placed in the image, serially shifting over the image. Three groupsof features are used to detect saliency: colors, intensity and orientations. For eachgroup of features one so called conspicuity map is computed. Afterwards the threeconspicuity maps are combined into a single, unified saliency map. These separatesteps (illustrated in figure 2.4) are now described in detail.

Modelling Saliency When presented to the system, at first a Gaussian pyramid ofthe image is computed. A Gaussian pyramid consists of several scales of the sameimage, where scale zero is the unfiltered, original-size image. Subsequent scales arecomputed based on the above scale by first filtering with a Gaussian kernel – typicallyof size 5x5. This yields a low-pass filtered, i.e. smoothed, image that is sampled downto half of its original size afterwards [8]. This pyramid is computed up to scale eight– which means a horizontal and vertical size-reduction of 1/256.

On each scale a set of features is computed. The first feature is Intensity or bright-ness. If r, g and b encode the red, green and blue value of a certain pixel, then theIntensity is I=(r+b+g)/3. Color information is obtained by computing the red-greenand blue-yellow color differences. Normalized color maps are computed with

R = r − (g + b)/2 for red,G = g − (r + b)/2 for green,B = b− (r + g)/2 for blue andY = (r + g)/2− |r − g|/2− b for yellow.

Now the differences R−G and B−Y are used as features for the saliency map. The lastgroup of features consists of Gabor filters [13] providing information about orientation.Gabor filters are parameterized with an angle θ, which lets them react to edges withthat orientation. For saliency maps, four orientations θ ∈ {0◦, 45◦, 90◦, 135◦} are used.All features are illustrated in figure 2.5.

Conspicuity is now measured as center-surround difference using the Gaussian pyra-mids. Pixel values from different scales are subtracted. That yields high (absolute)values when the feature value is different from its surrounding, which is encoded inthe more downsampled scale [8]. When ◦(i) is a feature on scale i, the followingcross-scale differences are computed for every combination of scales c ∈ {2, 3, 4} ands=c+ δ, δ∈{3, 4}:

12

Figure 2.5: Top left: original color image. Top right: Intensity image. Bottomleft: RG contrast, where gray is neutral. Bottom right: vertical orientation image,where gray indicates zero filter response.

I(c, s) = |I(c) I(s)|RG(c, s) = |(R(c)−G(c)) (R(s)−G(s))|BY(c, s) = |(B(c)− Y (c)) (B(s)− Y (s))|

And for every angle θ of orientation:

O(c, s, θ) = |O(c, θ)O(s, θ)|

Here denotes the difference between images at different scales. That operationdiffers from a normal difference, because several pixels in one image correspond to thesame pixel in the smaller image.

The resulting images encode local conspicuity (or saliency) with respect to a certainfeature, scale and size of the surrounding. For six combinations of scales and fourorientation angles, 42 of those maps are computed.

Integration across Features After computing cross-scale differences for each feature,the resulting maps have to be combined into a unique saliency map. This task involvestwo problems:

1. The features represent modalities that are not directly comparable. Each featurehas a specific range and distribution.

13

2. Since 42 maps are combined, objects salient in one map can be masked by noiseor less salient structures in other maps.

Itti and Koch propose a normalization operator N (· ) for handling both problems.Applied on each map, it promotes those maps with a small number of significantpeaks and suppresses maps with no globally significant maxima. They describe theoperator in three steps:

1. Normalizing the values of the map to a fixed range [0..M ]. This eliminates alldifferences in the amplitude of different features, but can reinforce noise.

2. Finding all local maxima in the map and computing their average value m

3. Multiplying each value in the map with (M − m)2. If the global maximum sig-nificantly differs from the local maxima, this will produce numerical dominanceover other maps.

Using this operator, the single maps are first combined to each one so called conspicuitymap per feature. Therefor each map is normalized and added to the conspicuity map.Due to the different sizes cross-scale addition ⊕ is used:

I =4⊕c=2

c+4⊕s=c+3

N (I(c, s)) (2.1)

C =4⊕c=2

c+4⊕s=c+3

[N (RG(c, s)) +N (BY(c, s))

](2.2)

O =∑

θ∈{0◦,45◦,90◦,135◦}

N[ 4⊕c=2

c+4⊕s=c+3

N (O(c, s, θ))]

(2.3)

The maintained separation between those channels is motivated by the hypothesis thatsimilar maps compete for saliency while different modalities contribute independentlyto the final saliency map:

S =13

(N (I ) +N (C ) +N (O)

)(2.4)

The resulting saliency map (see figure 2.4 for an example) topographically encodesthe local saliency/outstandingness of each region over the entire image. A saliencymap indicates what regions in the image are potentially interesting and thus worthattending to.

Overt Attention So far saliency maps can model covert attention. Covert attentionrefers to system-internal shifts of attention and thus processing priority. Essentiallysaliency maps indicate exactly that priority. In contrast, overt attention describesphysical shifts of receptors towards a region of interest. This can be a movement ofe.g. eyes, head or hands [17]. For vision overt attention can easily be modeled by

14

Figure 2.6: Dynamic Inhibition of Return in the saliency map causes a shifting Focusof Attention. Illustration from [28]

a focus of attention and eye gaze directed on that focus. An obvious possibility forchoosing that FoA is a winner-take-all mechanism, i.e. all attention if focused on thesingle highest saliency region. However this is not sufficient to explore complex visualscenes, since there can be more than one interesting region and there is no guaranteethat the most salient region is really interesting at all.

One biologically and psychologically plausible way of solving that problem is In-hibition of Return (IoR) [36]. Once focused, the degree of interest for that regiondecreases and other regions become more interesting. Itti and Koch suggest a neuralnetwork, arranged in 2D over the saliency map as implementation for IoR. Neurons getexcited by the corresponding location of the saliency map, causing the fastest increaseof activity within the most salient region. When the first neuron activity reaches aspecified threshold, FoA is shifted towards that location. Simultaneously all neuronsare reset to a neutral level, invoking a new competition for FoA. Additionally a localinhibition around the FoA is established as IoR mechanism. This allows other, lesssalient regions to obtain the FoA. Figure 2.6 illustrates that serial shift of FoA overtime. Already focused areas in the saliency map are inhibited and appear as blackcircles in the map visualization.

Dynamic Scenes The model described so far uses a single image, which is scanned ina serial manner over several timesteps. However this scheme can easily be transportedto sequences of images (videos). The computation of features and the saliency map

15

can be done successively on each frame. Afterwards one FoA can be located per frame[45]. Thereby additional features can be used to do justice to the now dynamicalstimuli. Commonly used features are optical flow [58] and difference images [41].

2.2.3 Saliency Maps as an Attention-Model

Reviewing the computation model, saliency maps can be classified with respect toseveral terms of attention. First, saliency maps were mainly intended to work purelyvisually throughout their development and several extensions. Nevertheless, saliencymaps were recently applied to audio-spectrogram data [33]. In fact, the usage as mul-timodal attention model is restricted by the unique topographic map that fixates asingle spatial reference system. This also clarifies that saliency map (used for visual at-tention) follow the Spatial Cueing paradigm and in particular the Feature IntegrationTheory (see 2.1.2 and 2.2.1). However, multimodal cues can integrated even if atten-tion is purely visually. Several ways of such integration will be discussed in chapter6.

Intrinsically, saliency maps provide bottom-up/exogenous control of attention. Bothcovert and overt attention are fully determined by the incoming data. However, thereare reasonable (but limited) ways to involve top-down/voluntary control. First, fea-tures can be learned or modulated. Second, recombination of both scales and featurescan be weighted or modified in a more general way. Itti and Koch overview severalexisting approaches in [28].

16

2.3 Multimodal Integration and Synchrony

We permanently perceive our environment in a multimodal manner. For examplewhen talking face to face, we can both hear and see the other person. When graspingfor an object, we see our hand moving, while proprioception provides informationabout arm-joint positions until we finally feel the object in the hand. In that wayhumans perceive many objects and events multimodally. This observation evokesquestions about how humans are able to handle information from different senses. Inthis context multimodal integration means that information from different modalitiesis combined towards a single, multimodal representation. For example when listeningto a person while observing his or her mouth we do not perceive those stimuli to beseparated, but as a unique event – just as they have the same physical source.

2.3.1 Importance of temporal Synchrony

It is not surprising that adult humans are able to link heard voice with a seen face, sinceadults know that those percepts belong together. In fact, adults must have learned thisknowledge just as they have learned that a breaking glass causes a crashing sound andlearned about other events and objects. This ability – binding stimuli from differentmodalities – is only possible because stimuli from different modalities are dependenton each other in two ways:

• Multimodal percepts are temporally correlated. Voice shares a rhythmic patternwith the moving mouth. The breaking glass occurs in close temporal proximityto its crashing sound.

• Also they are spatially correlated. We can locate stimuli via vision, hearing andtouch. Thus if visual and auditory stimuli share the same origin, humans are (ingeneral) able to detect that spatial correspondence [18].

The most general form is temporal correlation — also referred as synchrony. Thoughunique physical events always happen in space and time, we are not always able todeal with spatial correlation. First of all this is due to fundamental restrictions ofseveral senses.

• Senses like smell or taste possess hardly any spatial information.

• Hearing as major distance-bridging sense does not always provide accurate spa-tial localization. Dull, low-frequency tones can hardly be localized due to theratio of ear-distance and wavelength of sound.

Nevertheless vision, hearing and touch can be assumed to provide spatial informationin many relevant situations. Though recognizing a spatial correspondence remainsdifficult, because the spatial coordinate systems of different senses can not be assumedto be aligned a priori: Vision, hearing and touch each have their own spatial referencesystem. Relations between those systems must be known when considering spatialcorrelation across modalities. Even harder, the correspondences change with every

17

body-movement. When moving the eyes, the coordinate system of vision is shiftedagainst the other coordinate systems. The same is true for head movements regardinghearing and movements in general for touch [17]. Thus relations between differentspatial reference systems must also be learned for which only temporal synchrony cuesremain.

2.3.2 Synchrony in Perception

Binding stimuli from different modalities using temporal synchrony is usefull for in-terpreting incoming signals. For example, lip reading can enhance the recognition ofspoken words [3], [15]. Nevertheless, binding via temporal synchrony can also illusion-ary distort perception from true physical properties. Methodically, such illusions arewell suited to identify links in perception between modalities.

A well known multimodal illusion is the McGurk-effect [40], discovered by HarryMcGurk and John MacDonald in 1976. Subjects were seeing a video tape with aperson speaking the syllables /ga-ga/. Though, the audio track was manipulated,containing the syllables /ba-ba/. Since the consonant /b/ can only be articulatedwith a short lip-closure, and /g/ only without, those percepts do not fit together.Unconsciously faced with that conflict, 98% of all subjects reported to perceive /da-da/, which is phoneticly similar to /ba-ba/ and would invoke similar lip movementsto /ga-ga/.

Another well studied example [25] is known as the (spatial) ventriloquism effect,which is a misslocation of sound towards a simultaneously presented visual cue. Ven-triloquists speak – widely avoiding visible mouth movements – while moving the mouthof a dummy. When both patterns are synchronous, one perceives the sound to originatefrom the dummy. Such misslocation has also been found for tactile stimuli influencedby vision [48].

The variety of such effects clearly indicates that integration across modalities is therule rather than the exception in real perception situations [18]. Also, integrationoccurs early within cognitive processing, since cues from one modality affect how orwhere a stimulus from another modality is perceived.

2.3.3 Synchrony in Attention

In particular – and importantly for this thesis – temporal synchrony of modalities hasbeen demonstrated to be influential on spatial attention. Since synchrony is fully andexclusively determined by incoming stimuli, such guidance occurs within exogenousattention.

An experiment from Vroomen and de Gelder [60] demonstrates that auditory cuescan guide attention towards simultaneously presented visual stimuli (Figure 2.7). Sub-jects should detect and localize and diamond-shaped structure within a stream of visualdisplays. A tone from a fixed location was presented concurrently for each display.Subjects were told that a single tone could differ from all others (e.g. a high toneamong low tones). The results showed that subjects performed better in detecting the

18

Figure 2.7: A rapid sequence of visual displays was presented to subjects. Therebythey should detect the diamond shape structure. A tone was played simultaneouslywith each display. One tone could differ from all others. Illustration from [18]

diamond shape when the unique tone was played simultaneously with the diamondshape. The effect also occurred when the temporal relation between unique tone anddiamond shape was entirely non-predictive, thus excluding any voluntary use of coin-cidence. Driver and Spence refer to that effect as pop-out [18] corresponding to purelyvisual pop-out effects, already described in 2.1.2.

The fundamental role of temporal synchrony for multimodal attention, perceptionand learning is confirmed by findings in developmental psychology. Bahrick et al. re-view several experiments on synchrony guiding overt attention in [2]. They refer toamodal information as information redundantly spread across several modalities andpresented in temporally synchronous way. In contrast, modality-specific informationcan only be perceived in one modality. For example a bouncing soccer ball sharesrhythm and tempo – as amodal information – across vision and hearing. On the otherhand the color of the ball is specific to vision while pitch and timbre are specific tosound. In the experiments multimodal events are presented to infants. When in-fants are habituated to the event (measured according to looking level) the stimulus ischanged. Insights can be gained through subsequent changes in overt attention. In thisway, 3-month-old infants payed attention to changing tempo of a tapping toy hammerpresented visually and auditory. This change was not noticed in a purely unimodal(visual or auditory) presentation. The other way around, changes in modality-specificproperties were only noticed in unimodal presentations, but not in multimodal sit-uations. Based on those results Bahrick et al. set up an Intersensory RedundancyHypothesis: during early infancy, attention is especially tuned on amodal informationin multimodal perception of events and to modality-specific aspects in unimodally per-ceived events [2]. In particular, this claims that temporal cues like rate and rhythm

19

Figure 2.8: Sound-source localization using Hershey & Movellans algorithm. While aperson on either left or right side is speaking, corresponding pixels yield high mutualinformation, visualized with highlights in the images. Illustration from [26]

are unattended for unimodal percepts.Together these empirical results clearly show that temporal synchrony of multi-

modal stimuli can guide attention towards those stimuli. Therefore the integrationand synchrony detection have to be done early cognitive processing and have to relyon rather primitive features, available in preattentive computation.

2.3.4 Computational Synchrony Detection

Effects of synchrony on exogenous attention, but also ventriloquism, highlight thatmultimodal integration is (partially) done in a preattentive manner [17] — thus on thesame level as preattentive visual features like color and intensity take place. Thougheffects of temporal synchrony have been well studied, literature on psychology andcognitive neuroscience clearly misses a concrete and common definition of such tem-poral synchrony itself. On a level of discrete events with precise timing this can easilybe named in terms of interval intersections and time-differences. Those criteria arenot a-priori applicable to signal level data.

Computer scientists started to develop (audio-video) synchrony detection algorithmssince a pioneering work by Hershey and Movellan in 2000 [26]. An overview of com-monly used features, preprocessing and correspondence measurement methods can befound in [5].

Hershey & Movellan (2000) Motivated by findings on ventriloquism, Hershey andMovellan build on the scenario of sound-source localization. In the presence of anaudio signal, the physical source of that signal shall be localized in a concurrent videosignal. It is assumed, that pixels values corresponding to the sound source sharetemporal patterns with the audio signal. Those patterns shall be uncovered using astatistical analysis of each single pixel and the sound signal over time.

They use a model in general independent of kind and number of acoustic and visualfeatures. They refer to a(t) ∈ Rn as set of acoustic features over time and v(x, y, t) ∈

20

Figure 2.9: The x-coordinate of the estimated sound-source location over time (mas-sive line). The true sound location changes as the two persons alternate speaking(dashed line). Illustration from [26]

Rm as visual features per pixel over time. Now the statistical analysis is restrictedto a small window in time of length s, that is shifted over time. So the past s videoframes are analyzed together with the corresponding audio signal. The set of datais denoted as S = (a(tl), v(x, y, tl))l=k−s+1,...,k at all coordinates (x, y), reaching sframes in the past from the current frame (with index k at time tk). Samples inthat set are assumed to origin from a joint Gaussian process (A(tk), V (x, y, tk)) ∼Nn+m(µA+V (x, y, tk),ΣA+V (x, y, tk)). That is, each sample is independently chosen ofa normal distribution with mean µA+V (x, y, tk) and covariance matrix ΣA+V (x, y, tk),thus describing both acoustic and visual features and their interdependence. Theparameters of this normal distribution can be estimated out of the data samples:

µA+V (x, y, tk) =1s

s−1∑l=0

(a(tk−l)

v(x, y, tk−l)

)(2.5)

ΣA+V (x, y, tk) =1

s− 1

s−1∑l=0

((a(tk−l)

v(x, y, tk−l)

)− µA+V (x, y, tk)

)2

(2.6)

As measure of correspondence – or synchrony – Mutual Information is computedusing the prior estimates:

I((A(tk);V (x, y, tk))) = −12

log(|ΣA(tk)| · |ΣV (x, y, tk)||ΣA+V (x, y, tk)|

)(2.7)

where ΣA(tk) and ΣV (x, y, tk) are those parts of the joint covariance matrix ΣA+V (x, y, tk)only containing audio-audio and video-video (co-)variances. In the special case ofn=m=1, this formula can be simplified using the Pearson correlation coefficient ρ:

21

Figure 2.10: Slaney & Covell: FaceSync. Left: Contribution of each pixel to the globalcorrelation. Right: Performance of synchrony detection. The highest correlation isobtained for a zero-delay between audio and video. Illustration from [54]

I((A(tk);V (x, y, tk))) = −12

log(1− ρ2(x, y, tk)

)(2.8)

ρ(x, y, tk) =cov(A, V )√

var(A) · var(V )=

σA,V (x, y, tk)√σA(tk) · σV (x, y, tk)

(2.9)

where σA, σV , σA,V denote the now scalar (co-)variances.As this computation is done at every pixel in an image, the result is a image contain-

ing mutual information values – a so called mixelgram [50]. Thereby the time windowis shifted by 1, so that each video frame obtains a new mixelgram.

Hershey and Movellan tested this approach with an experiment on sound-sourcelocalization: among two persons in the video the algorithm should locate the sound-source, whereas the persons were speaking alternately (Figure 2.8). Only the intensityof video-pixels and the average acoustic energy were used as features (n = m = 1)within a time window length of s= 16 at 30 frames per second. To reduce effects ofnoise a threshold was applied to the mixelgram, thus all pixels with a lower mutualinformation were set to zero. A single position – the sound source – per frame wasiteratively estimated via center of gravity combined with a Gaussian influence function[26]. The results of localization are illustrated in figure 2.9. In fact, the estimatedsound location – and thus the amount of mutual information – is concentrated on thecurrently speaking person.

Canonical Correlation Analysis A widely used tool for multimodal analysis is Canon-ical Correlation Analysis (CCA) [5, 34, 54]. In contrast to Hershey & Movellansmethod, CCA is used for handling all pixels together, thus making use of correla-tions between pixels. The vector containing all visual information is now denoted asv(t) ∈ Rm·H·W , where H and W are height and width of the video frames. Each vectora(t) and v(t) is projected into a one-dimensional space by multiplying with projection-vectors wa and wv. Now the canonical correlation is the Pearson correlation between

22

the scalar variables a(t)·wa and v(t)·wv:

ρ(tk) =wTv ΣV,A(tk)wa√

wTv ΣV (tk)wvwT

a ΣA(tk)wa

(2.10)

wa and wv are found by maximizing the absolute value of ρ(tk). In terms of sound-source localization the projection vectors can be interpreted as weight of each featurecomputed on each pixel, contributing to the global correlation, which is expressed inρ(tk). Thus the synchrony of single pixel and feature is expressed by the correspondingentry in the projection vector.

Canonical Correlation has been used to measure synchrony of audio with video-datacontaining only the mouth of a speaking person in FaceSync [54]. For this specialsetup, Slaney and Covell used a preliminarily learned projection of video intensitydata and acoustic features like MFCCs, LPCs and spectrograms. The approach wastested in detecting synchrony, which means discriminating whether audio and videoare synchronous or not. For that task they shifted the audio data against the videodata in time. Assuming a priori synchronous streams, audio and video get graduallyout of synchrony with higher delay between them. A good detection performance thusyields a sharp correlation maximum at zero-delay. Slaney and Covell report the bestperformance for MFCC audio features (see figure 2.10).

Kidron et al. [34] applied CCA on sound-source localization, highlighting an im-portant problem of CCA: the choice of projection vectors wa and wv is not uniqueif there are more feature dimensions than samples, which is the usual case in sound-source localization. A common method is to choose the solution vectors with minimumEuclidean length (l2-norm). Kidron et al. remark that this method tends to spreadweights across more pixels than necessary and propose the use of the minimum l0-norm, which directly corresponds to the number of non-zero entries of a vector. Thusthe localization is restricted to a minimum number of pixels.

2.3.5 Synchrony vs. Correlation

As soon as computational methods exist, it has to be questioned to what extent theycorrespond to the original concepts in psychology. However psychology literature notonly misses a formalized concept of synchrony [26] but contains many subtle differ-ences in the interpretation of “synchrony”. A strict notion of synchrony is, that it issynonymous with simultaneousness and simultaneity [55, 59]. That means differentevents start and end at the same time. In that fashion, two types of modalities canbe said to be in synchrony, when events are (repeatedly) presented at the same timein each modality. According to a prominent perspective in literature, temporal syn-chrony is (as concept) distinct from e.g. a shared tempo or rhythm between differentmodalities [2, 38]. However, stimuli that are synchronous must share the same rhythmand tempo. Differences can e.g. arise from situations where a fixed temporal delay isfound between modalities. Another sight of “synchrony” is that different modalitiesare used in parallel, but concrete stimuli are not necessarily simultaneous [47]. How-ever, most literature lacks any definition of synchrony, but suggests that synchrony

23

is a generic term for a close temporal relation. For this thesis I agree with the strictdefinition – that simultaneousness and synchrony are the same.

Nevertheless on signal level, there are a priori no events, but a continuous streamof continuous values. Here correlation can be seen as direct adoption of synchrony tocontinuous signals: stimuli gain high correlation when the signal-values decrease orincrease simultaneously in several modalities. In fact, stimuli can also gain gradualcorrelation when a small temporal delay is introduced between them, but only whenthe delay is smaller then the basic period length of the stimuli. Noteworthy, thisnotion is (conceptually) to some extend orthogonal to the synchrony between events:(i) events in two modalities do not necessarily gain correlation due to the concreteshape of the continuous signal and (ii) also in the absence of anything considered asevent there can be correlation. However the differences depend on both the definitionof an event and the choice of features. Events can e.g. be defined as peaks in signalvalues [42]. On the other hand one can also use features that directly indicate events.

24

3 Synchrony for Audio-Visual Attention

The computational detection of audio-visual synchrony is the fundamental task withinthis thesis. In this chapter I present and discuss the approach used therefore.

3.1 Choice of Computation Model

Computational approaches on audio-visual synchrony detection (discussed in 2.3.4)fall into two groups: First of all, global methods like CCA jointly process all video andaudio features. In particular they process all image locations/pixels together and makeuse of correlations between them. On the other hand, local methods like Hershey’sapproach process each pixel separately. It is reasonable to assume that global methodscan yield better results than local methods, since they make use of more information.Nevertheless global methods have a critical drawback: the computation has quadraticcomplexity in space and (at least quadratic) in time, because covariances have to beestimated between all pairs of features/pixels. Due to memory constraints, Kidronet al. report a practical bound of 3000 visual features [34]. Beyond hard memoryconstraints, computation time is a critical aspect. This thesis clearly aims at a systemcapable of realtime processing. As global methods fail due to their performance, Iuse the method proposed by Hershey & Movellan [26] as basis for this thesis. Fur-ther motivation for this choice comes from a study presented in [50]. Prince et al.there demonstrate that Hershey’s algorithm can basically model infants abilities ofsynchrony detection.

3.2 Modifications

According to Hershey’s method, correlation between audio and video is estimated overa time window. In order to yield one mixelgram per frame, this time window has tobe shifted by one and correlation has to be recomputed. Though this is possible inconstant time with respect to the window length s, it requires bookkeeping of thedata. That is, the past s frames have to be buffered. When tk is the timestamp of thecurrent data a(tk) and v(x, y, tk), all past data up to tk−s+1 equally contributes to themutual information. Data from tk−s and before is ignored. Though a rough choice ofs can be argued when facing a specific task, the exact value is completely arbitrary.Nevertheless it causes a hard horizon that gets reflected in discontinuities in mutualinformation over time (see figure 3.1).

25

0

0.2

0.4

0.6

0.8

1

1.2

0 0.5 1 1.5 2 2.5 3

Sig

nal

Time in seconds

AudiosignalVideosignal

0

0.005

0.01

0.015

0.02

0 0.5 1 1.5 2 2.5 3

Mut

ual I

nfor

mat

ion

Time in seconds

SmoothingMI hard Window

Figure 3.1: Left: Exemplary signals over time. An impulse on the audio channelfollowed by a step in video. The signal is sampled with 25Hz. Right: Mutual infor-mation obtained with a window (s = 10) and exponential smoothing (α = 0.1). Ahard window causes a discontinuity when the window loses the audio impulse. In theabsence of noise an enduring correlation is established by exponential smoothing.

Statistical Estimation I propose an estimation using exponential smoothing overtime instead of a hard time window. A constant weight α is given to the current data.The estimation of mean and variances is done by recursively combining the new datawith all past data, that has weight (1− α) (compare equations 2.5 and 2.6):

µA+V (x, y, tk) = α ·(

a(tk)v(x, y, tk)

)+ (1− α) · µA+V (x, y, tk−1) (3.1)

ΣA+V (x, y, tk) =1

1 + α

(α ·((

a(tk)v(x, y, tk)

)− µA+V (x, y, tk−1)

)2

+ΣA+V (x, y, tk−1))

)(3.2)

The estimates from the last timestep tk−1 are thereby already weighted, resulting inan effective weight of α · (1 − α)i for sample tk−i. Note that the mutual informationcomputation out of ΣA+V (x, y, tk) is not affected.

First of all, this approach makes bookkeeping of past data unnecessary. Second, itpromotes a smooth estimation over time. Though the choice of α is still heuristic,it does not cause discontinuities in time. In fact, smoothing causes a theoreticallyinfinite horizon, which is not desirable. Though in practice the influence of arbitrarilypast samples on the correlation is effectively bounded by noise.

Signal Significance Pearson’s correlation and mutual information are used as mea-sures of interdependence and indicate the significance of that correlation. It is note-worthy that they do not take into account the significance of the signal, since theyare independent of shift and scale of feature values. In fact, most pixels in an image

26

Figure 3.2: Top left: Original RGB-frame from a test-video. Top right: Mutual in-formation obtained with image-intensity and acoustic energy as features and α = 0.05at 25fps. Background noise causes intensive correlation artifacts. In the illustrationwhite corresponds to a mutual information of 0.51 which is a Pearson correlation of±0.8. Black indicates zero correlation. Middle left: A threshold of TV = 10 onvideo variance. Static background pixels are widely excluded. Middle right: Anadditional morphological erosion removes remaining background pixels and also out-standing noise pixels in regions with activity. Bottom left: Threshold on correlationρ = 0.4/I = 0.087. The interesting face region and many background regions areunaffected. Bottom right: A higher threshold of ρ = 0.6/I = 0.223 removes the faceregion while some background artifacts remain.

27

are usually static apart from noise, thus providing no significant change over time.Nevertheless those pixels can cause high correlation just by chance (see figure 3.2).

I propose a two-stage filter process to exclude insignificant visual stimuli and noise.The first stage excludes pixels without activity. As measurement of activity I usethe variance over time on each pixel. Note that the variance is already available dueto the correlation estimation. If the variance on a pixel is below specified thresholdTV , mutual information is set to zero. Figure 3.2 illustrates the effect: large areasof stationary background are filtered out. Still, there is notable noise in regions thatmust be considered to be active. This noise results in single, outstanding pixels withhigh mutual information (figure 3.2, middle left). These single pixel distortions areeffectively handled by the second filter stage: a morphological erosion. Each pixelvalue is replaced by the minimum value of its direct neighborhood. Thereby, singleoutstanding pixels are completely erased, while massive regions of mutual informationare retained (figure 3.2, middle right).

Hershey and Movellan propose a threshold applied on the mutual information [26].Note that this approach can not exclude insignificant stimuli (figure 3.2, bottom row).

3.3 Global Synchrony

An important empirical task is to compare different videos or parameter settings withrespect to the degree of detected synchrony. Global synchrony detection mechanismsnaturally provide much measurement, since a unique correlation is computed on allavailable features and pixels. When using a local method, the mixelgram has to bebroken down to a single scalar measurement of synchrony. Prince et al. alreadyaddressed this problem [50]. They suggest to first filter the mixelgram with an 15x15Gaussian filter, followed by an 3x3 Sobel edge-detection filter. Subsequently all pixelvalues are summed up to a scalar measure. The Gaussian filter is thereby intended toreduce noise. In fact this is inappropriate on a noisy image, since the values are onlyblurred and loose nothing of their mass. Here I use a threshold on the video varianceand a morphological erosion (considered in the last section) which I have demonstratedto handle noise effectively.

The first and most obvious measure is the arithmetic mean on the filtered mixelgram:

Sall(tk) =1

H ·W

W∑x=1

H∑y=1

I(x, y, tk) (3.3)

This method gives equal weight to each pixel – both pixels with and without activity.Thus the result highly depends on the size of currently active image regions, i.e. thosepixels passing the filters. When the existence of a motion source can be assumed, it isreasonable to ignore pixels that were set to zero during filtering. The second synchronymeasure is the average of all positive, i.e. not zero, pixels in the filtered mixelgram:

Spos(tk) =1

|P(tk)|∑

(x,y)∈P(tk)

I(x, y, tk); P(tk) = {(x, y) : 0 < I(x, y, tk)} (3.4)

28

4 Features for Synchrony

This chapter deals with the investigation of several image features for the proposedsynchrony detection mechanism. Thereby audio energy (the average squared samplevalue) is fixed as feature for the audio stream. Audio energy serves as a general purposefeature indicating loudness. Any kind of qualitative information about sounds gets lost,but no application specific assumptions have to be made about the kind of sound.

4.1 Introduction

The most commonly used features for audio-video correlation are intensity/grayscaleimages [26, 50, 54]. Also, difference images were used as basis for correlation analysis[34]. Beyond such methods using single pixels, holistic methods were applied [5] thattake whole image regions into account. However there is still a need for broad, quanti-tative analysis of different features for audio-video correlation. Since different featuresprovide heavily different behavior in their responses over both time and space, it isnot a priori clear which one is appropriate for relation to an audio signal. As soundcan be assumed to originate from a dynamic source of motion in the image, it seemsreasonable to investigate image features that express this motion [5, 9]. It is therebyimportant to note that the use of dynamic features is not necessary for a correla-tion – through correlation is driven by the dynamics. In fact, also features staticallycomputed on a single frame yield a change over time that can be used for correlation.

4.1.1 Investigated Features

To provide a basis for comparisons, intensity images are used as a feature in this thesis.Though they have been used with some success in several works, it is hard to interpreta correlation of grayscale values with acoustic energy. An exemplary, artificial exampleof signals is shown in figure 3.1. Acoustic energy typically provides sharp maxima,while staying close to zero for the rest of the time. A single, short movement in the

–1 0 +1–1 0 +1–1 0 +1

+1 +1 +10 0 0–1 –1 –1

Figure 4.1: Linear filter masks of the Prewitt filter. Left: filter responsive to verticaledges. Right: filter responsive to horizontal edges.

29

Name Description Variance Thresholdint Image intensity 5diff Difference images (absolute values) 1rg Red-green distance 5by Blue-yellow distance 5oVert Vertical-edge image (Prewitt-Filter) 5oHor Horizontal-edge image (Prewitt-Filter) 5flowXp Optical flow, x-component, positive part 0flowXn Optical flow, x-component, negative part 0flowYp Optical flow, y-component, positive part 0flowYn Optical flow, y-component, negative part 0sal Saliency map, using all above features Mean pixel variance

Table 4.1: List of all features investigated for their use in audio-visual correlation.

image causes some intensity values to increase and some to decrease and remain atthat level afterwards. The both features thus provide a very different characteristic intime. Difference images show thereby a more similar temporal behavior, as they yieldhigh positive peaks when a motion is present while remaining on low numerical valueswithout motion. Here the absolute difference D between two succeeding intensityimages I is taken into account as feature:

D(x, y, t) = abs(I(x, y, t)− I(x, y, t− 1)

)Thus acoustic energy can be correlated with the presence of motion. Nevertheless,difference images are limited in expressing motion, since they do not take into ac-count the strength or speed of motion in the image, but only the per-pixel changeof intensity. A better approach to motion is given in terms of optical flow. Opti-cal flow is used to find the geometric displacements of each pixel between two im-ages. When applied to two succeeding frames of a video sequence, the result isa 2d velocity vector (vx(x, y, tk), vy(x, y, tk)) for each pixel. This vector indicatesthat the pixel found on position (x, y) in the current image (tk) was positioned at(x−vx(x, y, tk), y−vy(x, y, tk)) in the last frame (tk−1). Thus the vector expresses thevelocity of the geometric movement between two frames. A correlation with acousticenergy can easily be interpreted as coincidence of fast or slow movements in the imagewith (loud) sounds.

A further idea in this thesis is to investigate saliency maps as feature for audio-videocorrelation. Saliency expresses the (hypothetical) importance of an image region inrelation to its surrounding. It seems possible that image regions corresponding tosound source become more salient when such sound is emitted. Therefore it shall beinvestigated whether this assumption is reflected in real video sequences and yields asignificant correlation. As saliency maps do justice to a broad bandwidth of image

30

features, all those features used in saliency maps are investigated. This includes inten-sity images, but also color-differences RG and BY and orientation images (see section2.2.2). Thereby I use an existing saliency map implementation [45] that contains someslight modifications and extensions to the original model proposed by Itti & Koch:

• For the orientation images, Prewitt filters are used instead of Gabor filters.Prewitt filters (see figure 4.1) are linear filters responsive to vertical or horizontalimages and very similar to the well known Sobel filter.

• Difference images between succeeding frames are used to take motion into ac-count. They are used with the same computational scheme as intensity images:(i) difference of Gaussians, (ii) normalization and (iii) cross scale addition (com-pare equations 2.1 and 2.4).

• Optical flow is additionally used. The flow is computed with the algorithmproposed by Horn & Schunck [27]. The resulting vector field is split into x- andy-component and each component is split into positive and negative part, thusyielding four maps indicating a movement upwards, downwards, left and right.The four maps are integrated into the saliency map in the same way as the fourorientation maps (compare equations 2.3 and 2.4).

Originally the Difference-of-Gaussians computation is done on scales c∈{2, 3, 4} ands= c + δ, δ ∈ {3, 4}. Since small movements (e.g. of the mouth) can not be resolvedon this coarse scales I use finer scales c∈ {1, 2, 3} and δ ∈ {1, 2} for the experimentsdescribed in this chapter.

The complete set of investigated features is shown in table 4.1. Using the variance-threshold method proposed in section 3.2, an individual threshold has to be chosen foreach feature, since numerical range and the distribution of values largely differs amongthe features. All thresholds (also listed in table 4.1) were chosen very defensively butsufficient to filter out the typical background noise contained in the investigated scenes.Thereby difference images need lower threshold than the corresponding intensity sig-nal. For instance a step in the intensity signal provides a higher temporal variancethan the corresponding impulse in the difference signal. The optical flow componentswere used without any threshold – which is the same as a threshold of zero. Theflow-estimation itself is not very sensitive to noise and it has shown to be unnecessaryto use a threshold on the test videos. Noise infections of single pixels are still effec-tively handled by the subsequent morphologic erosion. For saliency maps, it is a prioridifficult to define a reasonable threshold, since saliency values provide a very high,dynamic range that also depends on the global scene setting. Therefore the thresholdwas automatically chosen for every individual frame. Per frame the threshold is set tothe average variance σV (x, y, tk) of all pixels (x, y) ∈ [1..W ]× [1..H] of the image:

TV (tk) =1

H ·W

W∑x=1

H∑y=1

σV (x, y, tk)

This process guarantees, that in each frame the pixels with the highest variance passthe filter.

31

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

-30 -20 -10 0 10 20 30

Mut

ual I

nfor

mat

ion

Spo

s

Delay in frames

diff, α=0.05

Figure 4.2: The artificial test video contains a black rectangle that moves from leftto right after staying on the left side for 2 seconds. The movement takes 5 frames andgoes along with a beep. Difference images as feature provide the expected synchronymaximum at a zero delay.

4.2 Detection Experiment

The fundamental task of any synchrony-mechanism is to discriminate situations withand without synchrony — i.e. synchrony detection. Where local methods – and thusHershey’s method – perform detection on each isolated pixel, it is also importantto decide on a global level whether synchrony is present or not. In section 3.3, Iproposed two methods to break down a mixelgram into a scalar value, expressing theglobal synchrony in each video frame. Using these methods, different features canbe compared in their detection performance when taking into account situations thatare a priori known or assumed to be synchronous or asynchronous. A high amountof mutual information should be yielded for known synchronous conditions and lowmutual information for asynchronous ones.

4.2.1 Method

It is hardly possible to determine a numerical ground truth for the degree of synchronythat is contained in a data setting. This complicates the comparison between distinctvideos or settings as it is not clear what should come out. To avoid this problem I usean approach proposed by Slaney & Covell [54] (see also fig. 2.10). The idea is to use asingle video (containing e.g. a speaking person) and introduce a delay between audioand video channel. Synchrony within the entire video is measured for each delay.As audio and video are assumed to be in synchrony a priori, an algorithm shoulddeliver a high result for a delay of zero. As audio is increasingly delayed against video(positive delay) or vice versa (negative delay), the algorithm should yield lower values

32

(a) The filtered mixelgram shows correlation at the mouth.

(b) Without any filtering (left), much mutual information is dis-tributed on the background. Still without erosion (right)noise and minimal movements have significant impact.

(c) Little mutual information is found on the mouth in this sit-uation.

(d) Here most mutual information is found on the shirt.

Figure 4.3: Setup and example mixelgrams from the detection experiment.

33

of synchrony, since both streams can no more be assumed to be synchronous. Whenplotting the delay against the measured synchrony, the result should be a significantpeak at a delay of zero.

For this investigation a scalar value must be assigned to a whole video for specifiedfeatures and parameters. In order to break down a single frame to scalar, I use theproposed measurement Spos(tk) (see equation 3.4). It is more appropriate than Sall(tk)under the conditions in this experiments, since the existence of a motion source is givenand the measurement directly expresses the average synchrony within that motionsource and not its size. The values of Spos(tk) are averaged over the different framesto yield a scalar measure of synchrony for an entire video:

Spos =1

N − k0 + 1

N∑k=k0

Spos(tk) (4.1)

Here N is the number of frames in the video and k0 > 1 is an offset from the firstframe. This offset is necessary since the exponentially smoothed estimation of meansand variances needs some frames to get robust. In particular the existence of justtwo samples at t2 constitutes a perfect correlation for non-constant patterns. Thiscorresponds to a singular, infinite mutual information that would numerically dominatethe rest of the video.

The method was tested on an artificial test video (see fig. 4.2). In the video,a five frame long movement is presented simultaneously with a beep tone of sameduration. Indeed the procedure (here with difference images as feature) yields mostmutual information for a delay of zero. The values smoothly decrease while movementand beep tone overlap, and stay close to zero for larger delays.

For the actual experiment six videos with different subjects were recorded. Eachsubject is sitting directly in front of the camera (see fig. 4.3). Subjects were instructedto count from one to ten loudly and clearly. They were not instructed to take careof synchrony in any way, but told not to gesture. Due to different rate of speech thevideos vary in length between 9 and 17 seconds, where the total amount of video datasubsumes 74 seconds. The videos were recorded with a HDV camera standing on atripod. The original video resolution is 1440x1080 at 25 Hz and was downsampled to536x424 for the experiments. Audio was sampled with 48000 Hz. A frame evaluationoffset of k0 = 26 was used, thus exactly leaving out the first second of video. Thevideos were cut such that the first utterance was just not contained in this time span.Hence the amount of used data subsumes 68 seconds of video.

4.2.2 Results

In the experiment Spos was evaluated for all six videos, all listed features and delaysbetween −25 and +25 frames. A delay of +25 frames means that audio is delayed bya duration of one second, since videos have a framerate of 25 fps. For a delay −25means that the video frames are delayed by a duration of one second. The figures 4.4and 4.5 show the most relevant results. The plots show the result-curves for all six

34

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

-30 -20 -10 0 10 20 30

Spo

s

Audio-delay in frames

int, α=0.05

Average

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

-30 -20 -10 0 10 20 30

Spo

s


diff, α=0.05

Average

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

-30 -20 -10 0 10 20 30

Spo

s


flowYp, α=0.05

Average

Figure 4.4: Detection results for intensity images (top), difference images (middle)and the positive y-component of the optical flow (bottom). The average mutual infor-mation Spos is plotted against the introduced delays. The single videos are shown ingreen and the average of those curves in red.

35

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

-30 -20 -10 0 10 20 30

Spo

s


oVert, α=0.05

Average

0.005

0.01

0.015

0.02

0.025

0.03

-30 -20 -10 0 10 20 30

Spo

s


oHor, α=0.05

Average

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

-30 -20 -10 0 10 20 30

Spo

s


sal, α=0.05

Average

Figure 4.5: Detection results for vertical edges (top), horizontal edges (middle) andsaliency maps as feature (bottom). The average mutual information Spos is plottedagainst the introduced delays. The single videos are shown in green and the averageof those curves in red.

36

videos on their own and the average of all six curves. All results that are discussedhere were computed with α=0.05. Further more all features were tested with α=0.1.Both values roughly fall into the range of window sizes s=16 (used in [26]) and s=32(used in [34]), though values of s and α are not directly comparable.

The most obvious observation is that no feature yields a sharp synchrony-maximumat a delay of zero. This can partially be explained with the structure of the video data:all six subjects counted in a heavily rhythmic manner. Thus periodic maxima can beexpected, e.g. for a delay that shifts the spoken word “one” to the visual percept of“two”, the spoken “two” to the visual “three” and so on.

Intensity yields a smooth, periodic structure for each video (fig. 4.4 top). In averageover all videos, a local maximum can be found at a delay of +3. However thereare stronger maxima at each −7 and +13. The global maxima for the single videoslargely vary in their position and show no trend towards a delay of zero. The plots fordifference images (fig. 4.4 middle) and optical flow (fig. 4.4 bottom) show an unsteadybehavior. Peaks are unsystematically distributed across the plot for both features.Thereby the optical flow yields less, but more significant peaks per video. Noteworthy,both features gain lower mutual information than intensity images do. Only flowYp isshown here, which expresses downward movements of e.g. the lower jaw when openingthe mouth. The other flow components show comparable characteristics.

The average curves for oVert and oHor (fig. 4.5 top and middle) show character-istics that are similar to the intensity image results: a smooth, periodic shape with amaximum at +3 among other maxima. For oVert the adjacent maxima are weakerthan for intensity images. The opposite holds for oHor. Saliency maps provide a ratherpoor performance (fig. 4.5 bottom). The plot shows similar properties as differenceimages and optical flow. Noteworthy, diff, flow and sal do not only provide anunsystematic, noisy behavior, but also generally lower mutual information values thanint, oVert and oHor.

Results between those extrema were found for RG and BY as features. As features,they both provide a temporal and spatial structure similar to intensity images, sinceall are continuous transform of the original RGB values for each pixel. The full set ofresults for all features and α ∈ {0.1, 0.05} can be found in appendix A.

Some mixelgrams from the experiment are shown in figure 4.3. In some situationsmutual information is precisely concentrated on the mouth (top row). Thereby the twofilter stages prove to be very useful. The second row of images shows the completelyunfiltered mixelgram. The variance-threshold perfectly removes background noise instationary background regions of the image. The morphological erosion further filtersminimal torso movements and noise within active image regions like the face. Though,in some situations no significant mutual information is detected on the face region (seethird and fourth image row in fig. 4.3). In particular highly structured clothes causemuch activity and can attract mutual information.

The experimental setup has also been tested with larger thresholds and withouterosion, both with worse results. Further, Sall was tested as measurement. The resultsfor each video are thereby comparable, but the videos diverge in their total amount ofmutual information, which complicates the comparison between them.

37

4.2.3 Discussion

In their original experiment, Slaney et al. report a very significant discriminationbetween synchronous and asynchronous conditions [54]. In direct comparison to theirresults, discrimination clearly fails for all features in this experiment. However suchcomparison is misleading. Slaney et al. used video data that only contains the mouthregion of a speaking person. Therefor they used previously trained visual featuresand highly speech-specific acoustic features. They used CCA as synchrony detectionalgorithm, which directly yields a scalar estimate of synchrony for a whole time windowof frames (see section 2.3.4). Since the videos contain only the mouth without anylateral movements they can use a single time window that spans the whole video. Also,they used only a single test video of 9 seconds length. Hence the results presentedin this thesis must be considered with the additional challenge to integrate mutualinformation across pixels, frames and videos.

When considering the content of the videos, it is remarkable that the overall syn-chrony contains mutual information at the mouth or head, but also on the torso.Torso movements heavily interfere with the desired mouth synchrony in some testvideos (see fig. 4.3). This partially hides insights on the temporal structure betweenmouth movements and sound, as they were gained by Slaney et al.. Nevertheless thetask is realistic with respect to the scenario of this thesis. An algorithm should alsodetect synchrony in the presence of concurrent movements. It should also deal withsynchrony between any kind of visual and auditory stimuli and not just mouths andvoices. Retrospectively, only the concrete task for the subjects has to be criticized.The instruction to count from one to ten caused heavily rhythmic patterns throughoutall six videos. This was originally intended to provide a very simple task as baselinefor further investigations. However, it induces periodic maxima of synchrony alongdifferent delays. As the positions and distances of those maxima vary between thesubjects this significantly complicates the interpretation of the results.

Despite the more complex setting in this experiment, important insights can begained about algorithm and features. Difference images and optical flow as dynamicfeatures clearly fail to discriminate between synchronous and asynchronous conditionsin this setup. The same holds for saliency maps as features. No mandatory conclu-sions can be made about intensity and edge images. Taken together no expressivediscrimination performance could be gained in this experiment. Among the discussedissues, also the power of Hershey’s algorithm itself must be considered as reason. Sinceit makes no use of correlations amongst different image pixels it must be consideredas less powerful than e.g. CCA.

4.3 Localization Experiment

The detection setup in the last section deals with the question whether synchrony ispresent or not between audio and video. Such measurement can e.g. indicate thepresence of a sound source in the video stream. Mixelgrams have to be reduced toscalar value of synchrony to satisfy this task. Though, the most natural question a

38

mixelgram can answer is where a sound source is located within an image. In fact,this ability makes a mixelgram valuable for an attention system, since it can be guidedtowards probable locations of sound sources. The experiment described in this sectiondeals with the question how well image features perform, when mixelgrams are usedto locate a sound source.

4.3.1 Method

To evaluate the localization performance of their algorithm, Hershey and Movellanused an estimation of a single position (Sx, Sy)T in the image, that reflects the locationof a sound source [26]. This location is estimated by means of iteratively computing thecenter of gravity in the mixelgram and applying a gaussian influence function aroundit. Thereby the location converges to an area with a high mass of mutual information.The trajectory of that position was qualitatively compared to a ground truth trajectory(see fig. 2.9). This method has two major drawbacks: first, the location estimationdoes not take into account the absolute values of single pixels, but only the mass withinthe Gaussian influence function. Thereby huge regions with low mutual informationcan outperform smaller regions with much higher mutual information values. Second,the evaluation does not take into account the robustness of the estimation. Whenfacing two mixelgram regions with similar mass of mutual information, small changesin pixel values can cause volatile changes of the estimated position.

The approach used for this thesis is inspired by the work of Kidron et al. [34],[35]. The central idea is to compare the amount of mutual information within imageregions, that are known a priori. For example one can compare the region that isknown to contain a sound source against the rest of the image. To gain a challengingtask, a visual distractor is placed in the scene, that causes motion in the image but(intentionally) no synchrony to the sound. This approach directly addresses problem(ii), since not only the region with the most mutual information is determined, but alsotheir relation. Problem (i) can be addressed with the choice of the scalar measurementfor each region. Here I use both Sall(tk) and Spos(tk). Sall(tk) corresponds to theabsolute mass of mutual information, when the compared image regions have thesame size. Thus it can make predictions for a location estimator like the one usedby Hershey. Spos(tk) expresses the average mutual information of active (i.e. moving)pixels and expresses the degree of synchrony independent of the size of moving objects.

For this experiment I used recordings of the same six subjects as in the detectionexperiment. Overall the video data subsumes 32 seconds with 5-7 seconds per video.Again, the videos were recorded with 1440x1080 pixels resolution at 25 Hz and au-diodata sampled with 48000 Hz. The videos were downsampled to 536x424 for theexperiments. In the setup, subjects are sitting on the right side of the image, whiletalking into the camera (see fig. 4.6). A ventilator is placed on the left side and usedas distractor (also used in [42]). The ventilator produces noise in the image as resultsof its rotor movement. Also it produces a regular movement as the rotor is turnedaround the vertical axis. Moreover it causes acoustic noise. In fact the ventilator isalso a sound source, but due to the purely noisy sound there should be no statistically

39

(a) Good localization performance can be gained in this situa-tion: much mutual information is concentrated on the faceand much less on the ventilator.

(b) The same mixelgram without filtering (left) and with acti-vation filter but without erosion (right).

(c) Here most mutual information is distracted onto slight handmovements.

(d) In some situations the ventilator attracts the most mutualinformation.

Figure 4.6: Setup and example mixelgrams from the localization experiment.

40

significant relation to the video signal. Hence most mutual information should beconcentrated on the human subject.

The setup is intentionally designed to contain well separated and precisely definedregions that can be compared with respect to their synchrony. Here the sound sourceand the distractor are well separated on left and right side of the image. To yield ascalar measure of synchrony per region I use Sall(tk) and Spos(tk), where Sleft|rightall|pos (tk)denotes the value on the corresponding side of each frame. As measure of the local-ization performance I use the proportion of synchrony that is contained in the rightside of each frame and the average of all frames across the video:

Lall|pos(tk) =Srightall|pos(tk)

Sleftall|pos(tk) + Srightall|pos(tk)(4.2)

Lall|pos =1N

N∑k=1

Lall|pos(tk) (4.3)

The values of this proportion range from 0.0, where all mutual information is concen-trated on the left side, up to 1.0, where mutual information is only contained in theright side of the video. Hence high values of Lall and Lpos indicate a good localizationperformance.

Remarkably this differs from the detection evaluation in that the comparison isfirst done within each frame and then integrated over the video. This scheme preventsframes with high mutual information on both sides to dominate frames with less mutualinformation. It is therefore needless to ignore the first frames as in the detectionexperiment. In fact each frame is equally reflected in the localization performance.

4.3.2 Results

The full set of results is shown in table 4.2. All values are averaged over the sixvideos. The first remarkable result is that Lall does not exceed 0.5 for a single feature.Further, Lpos has higher numerical values than Lall for all features. With respectto the different features, oHor yields the best performance for both measurementmethods and both α=0.05 and α=0.1. Good results are also gained for int, rg andoVert. Only those features have values of Lpos, that are significantly higher than 0.5,which indicates higher synchrony values on the right side of the video. Similar to thedetection experiment, optical flow and saliency do not provide a good performance.Compared to the detection experiment, difference images perform quite well in thissetup. However, the values of Lpos are too close to 0.5 to attest a good performance.

Again, the two-stage filter process of mutual information visibly enhances the results(see fig. 4.6). In particular the erosion reduces the impact of the ventilator, since noiseis robustly filtered.

41

Lall Lpos

Feature α = 0.05 α = 0.1 α = 0.05 α = 0.1int 0.424 0.392 0.544 (2.) 0.568 (2.)diff 0.428 (3.) 0.404 (3.) 0.501 0.506rg 0.465 (2.) 0.440 (2.) 0.536 (3.) 0.521by 0.268 0.230 0.448 0.372oVert 0.352 0.346 0.524 0.539 (3.)oHor 0.478 (1.) 0.444 (1.) 0.613 (1.) 0.625 (1.)flowXp 0.124 0.110 0.317 0.291flowXn 0.123 0.111 0.355 0.342flowYp 0.165 0.153 0.468 0.481flowYn 0.163 0.150 0.442 0.440sal 0.208 0.208 0.330 0.329

Table 4.2: Results of the localization experiment (averaged over the six videos).Values above 0.5 are shown in black, between 0.5 and 0.25 in gray and below in lightgray. For each Lall/Lpos and α ∈ {0.05, 0.1} the three best features are marked.

4.3.3 Discussion

This experiment has uncovered several important insights on the different features andthe task of sound source localization. The resulting values of Lall and Lpos are easy tointerpret, since high values directly indicate a good performance. Thereby edge imagesshow the best results, followed by intensity images. The best result of Lpos was 0.625for oHor, indicating that mutual information values on the right side were in average1.67 times higher than on the left side. Hence a robust estimation of the sound sourcelocation is possible in this scenario. However, a localization estimation like proposedby Hershey & Movellan is attracted by the mass of mutual information. Results ofLall thereby show that the center of gravity is located on the left side, which wouldmislead the localization. Hence the results show that localization should consider thehighest values of mutual information, not the region with the highest mass.

The setup used for this experiment contains two challenges. First, the ventilatorproduces a high temporal variability in the left half of the images. The correspondingimage areas thus pass the variance threshold and can potentially contain a huge massof mutual information. Second, the right of the video also contains slight torso or armmovements. In most situations this movements causes low mutual information values,causing a decreased value of Lpos and and higher values for Lall. Nevertheless thesemovements also gain high mutual information in some situations (see fig. 4.6).

42

4.4 Conclusions

The goal of the experiments presented in this section was to find out about good imagefeatures for audiovisual correlation. Taken together, edge images provided the bestperformance. Also intensity images provided a reasonable performance. An importantoutcome of the experiments is that temporal high-pass features (i.e. difference imagesand optical flow) failed to gain a significant discrimination between synchronous andasynchronous stimuli. Though, it was argued [5, 9] that such dynamic features aremore suitable to audio and in particular speech signals. In fact difference imagesperformed well on an artificial test video (see fig. 4.2). This observation is fullyconsistent with the argumentation in literature, but contrasts the weak performanceon real video sequences. My hypothesis on these outcomes is that both features fail dueto temporal noise – which is only contained in the real videos. Though temporal noise isalso present in static features, it is reinforced by dynamic features. On the other handedge images are a spatial high-pass, reinforcing spatial noise contained in each frame.Here the morphological erosion has shown to be a proper mean against noise. Howeversuch filtering can not easily be translated to temporal noise filtering. In particular themutual information computation itself had to be radically changed, whereas spatialfiltering can be applied separately after the mutual information detection is done.Saliency maps also provided no expressive correlation. Two problems can be figuredout for saliency maps: first they can not resolve small e.g. mouth movements dueto the downsampling procedure. Second, the normalization operation introduces aglobal influence of single pixels values. Changes at any position in an image can thusinterfere with each pixel’s audio-correlation.

Even for the best-performing features the results were not brilliant in both ex-periments. This leads to the observation that Hershey’s method is quite limited inexpressing synchrony. A psychological obstacle in the interpretation is that also forinappropriate features the mixelgrams can look very reasonable. In fact any kind ofmovement causes some degree of mutual information to appear. In the presence ofonly one motion source this can be misleading. When most mutual information in aframe is concentrated on that source of motion one tend to interpret that this indicatessynchrony. However the absolute values of mutual information have to be considered– not just the distribution within the frame.

43

5 Synchrony in multimodal Motherese

The second subgoal of this thesis is to investigate the use of synchrony in a sociallearning scenario in terms of child-directed communication. Here I first give a com-prehensive overview of relevant research on child-directed communication in the nextsection. The basic hypothesis is that tutors (parents in this case) provide additionallearning cues by synchrony. The hypothesis is tested by comparing the degree of au-diovisual synchrony between adult- and child-directed communication (section 5.2).In order to use such synchrony cues provided by a tutor, it is important to understandwhat is synchronous and when. Therefore I discuss several examples of the spatialdistribution of synchrony in section 5.3. I close the chapter with some conclusions andan outlook on possible future investigations in section 5.4.

5.1 Introduction

As children learn about their environment, they not only passively observe their sur-rounding, but interact with it. A special aspect of human learning is thereby thetutoring by parents or caregivers in general [52]. Though tutoring can help to struc-ture the environment, the interpretation of tutoring-signals is still a complex problem.It has been shown that parents modify their behavior in child-directed communication,providing cues to ease interpretation and learning.

5.1.1 Motherese

The best studied modification in child-directed communication is the modification ofspeech, known as motherese or child-directed speech (CDS). Parents use a simplifiedsyntax and meaning and use more pauses between propositions [16]. Also the prosodyis modified compared to adult-directed communication, creating a highly melodic char-acter of motherese. Parents e.g. emphasize important words with exaggerated pitchpeaks. While this is generally also used in adult-directed speech, it was shown to bemore consistently in CDS [19]. Thereby motherese has three important functions [12]:first it arouses the child’s attention and directs it to important parts of the speechsignal. Second, it eases the interpretation of emotional signals. Third, it highlightsthe linguistic structure of the speech signal. The focus thereby shifts between thesefunctions as infants grow up and become more adept [12]. In parallel motherese alsobecomes more complex, adapting to the development of the child [16].

45

Figure 5.1: In adult-directed communication motion trajectories follow smooth, eco-nomic paths. In child-directed communication parents typically perform more angular,expansive motion trajectories. Illustration from [52]

5.1.2 Motionese

Apart from speech it was recently shown that parents use modified patterns of motionin child-directed communication. In the style of “motherese” this phenomenon iscalled motionese. Such modifications in action were first described by Brand et al.in 2002 [4]. They investigated the scenario of object demonstration: parents shouldshow another person how to interact with an object. The characteristics of motionwere compared between the demonstration towards infants in the age of 6-13 monthsand the demonstration towards other adults. They showed (inter alia) that actionswere performed in closer proximity to infants and that parents used broader, moreexpansive movements when communicating with infants. For their analysis they used amanual, subjective encoding of those parameters. The first investigation with objectiveparameters was presented by Rohlfing et al. in 2006 [52]. Here an automatic 3D body-tracking was used to reconstruct and analyze movement trajectories during objectdemonstrations. It was found that parents make longer pauses between movementsand that movements follow less round trajectories (see fig. 5.1) when objects aredemonstrated to 8-11 month old infants.

These results suggest that motionese enhances and guides the infants attentionthrough expansive gestures [4] and helps to decompose sequences of actions throughpauses and more angular movements [52]. In fact these modifications appear to besimilar to those in child-directed speech. Brand et al. therefore hypothesized thatboth motherese and motionese are examples of a broader child-directed communicationscheme.

5.1.3 Multimodal Motherese

Beyond modifications in speech and motion alone, also their interplay has been in-vestigated. Though communication via speech and gestures is also coordinated inadult-directed situations, it has been questioned whether parents provide additionalcues towards children in such multimodal situations [52]. When infants are e.g. learn-ing new relations between spoken words and objects, they have to identify the seen

46

Figure 5.2: Parents most often present a novel word in synchrony with a motion ofthe referent object. The effect is most intense for prelexical children and decreaseswith higher lexical abilities of the infants. Illustration from [21]

object – among other objects – that is related to the heard word. A synchronouspresentation of word and object can help to relate the stimuli from vision and hear-ing. Gogate et al. investigated this synchrony between object presentation and verballabeling [21]. In their study, mothers were asked to teach their children novel nounsand verbs. They discriminated situations in that target words were presented (i) syn-chronous to a motion of the referent object, (ii) asynchronous to an object motion,(iii) without any object motion and (iv) while the infant was holding the referentobject. Therefore they used a manual coding system. The evaluation (see fig. 5.2)clearly shows that parents most often used a synchronous presentation towards prelex-ical infants (5-8 months old). This observation is consistent with the hypothesis thatsynchrony is used to highlight the word-object-relations. Further, the results showthat the relative rate of synchronous presentations decreases for older infants. Thissuggests that the multimodal communication behavior towards infants is adapted toinfants increasing ability to find out word-object relations on their own [21].

The use of temporal synchrony to guide young infants attention to the correct objectsis highly plausible in the context of the Intersensory Redundancy Hypothesis [2] (seealso section 2.3.3). Similar to child-directed speech [12], it has been demonstrated thatmultimodal temporal synchrony can both arouse and direct attention.

5.2 Quantitative Analysis of Synchrony

Gogate et al. demonstrated that parents tend to produce high synchrony betweenvisual and auditory stimuli in child-directed communication. It has been reasonablyargued that such synchrony can be used to guide attention towards important stimuli– e.g. objects related to spoken words. However, Gogate et al. used a manual,

47

(a) Cup stacking (b) Wooden bricks

(c) Bell (d) Salt shaker

Figure 5.3: Investigated scenarios of multimodal motherese: parents demonstratedifferent object interactions to either their child of their partner.

subjective coding scheme, expressing that a spoken word and an object motion appearwithin the same timespan. It is thereby problematic to make a binary decision aboutboth synchrony or asynchrony and motion or no motion. Here I aim at an objectivemeasurement of synchrony in multimodal motherese, using the methods proposed inthis thesis. The use of signal correlation provides a well formalized, objective approachthat yields gradual information about the degree of synchrony. In contrast to Gogate’sstudy, I primarily compare adult- and child-directed communication instead of differentages of infants. The basic question investigated in this section is: is there more signal-level synchrony in child-directed communication?

5.2.1 Method

The concrete scenario investigated in this experiment involves parents demonstratingfour different interactions with objects: stacking cups together, assembling of woodenbricks, ringing a bell and using a salt shaker (see fig. 5.3). Parents demonstrated thesetasks to both their infants and their partner. The infants age thereby ranges from 8to 30 months, further divided into groups of 8-11, 12-17, 18-23 and 24-30 months.

Investigated videos The original video corpus (also used in [52]) contains 660 videosof the four used tasks. The analysis was restricted to setups that contain both mother

48

and father, demonstrating the interaction to both their partner and their child (fourruns). Further only videos with an existing speech annotation were used. The remain-ing 444 videos still contain a diverse set of disturbances, including the experimenterwalking through the scene, the infant pulling the tablecloth down, verbal interactionwith the experimenter and shouting from the infant. In order to overcome at least theacoustic disturbances automatically, a subset of videos was chosen with respect to thetemporal overlap of parent speech and noise from the infant and the experimenter.Per demonstrated task and age-group, the three parental couples were chosen whosevideos consistently contained the fewest acoustic overlaps. After all the selection pro-cess aimed at 192 videos, equally distributed over 4 tasks, 4 groups of infant ages,each with 3 parental couples in 4 runs. For the salt shaker demonstration toward 12-17 month old infants, only 2 couples could be used due to missing speech annotations(188 remaining videos). Further two videos (one 8-11 months, bell, one 12-17 months,salt shaker, both child-directed) were excluded due to a corrupted audio track. As a di-rect comparison was impossible for these two videos, the corresponding adult-directeddemonstrations were also excluded, yielding a final number of 184 analyzed videos.The videos thereby show 24 different parental couples and thus 48 different subjects.

All videos were analyzed in the original resolution of 720x576 pixels at 25 fps. Theaudio tracks contain mono sound, sampled with 44100Hz.

Measurement The estimation of synchrony for each video is based on the measure-ments Sall and Spos (see section 4.2). As the videos still contain visual and acousticdisturbances, the speech annotation was used to yield expressive results – i.e. onlythose frames contribute to the average, that show the parent speaking. Hence a scalarestimate of parental synchrony Sparentall|pos is assigned to each video. However, the resultspresented in the last chapter suggest that such scalar estimates are not always ex-pressive. Although the localization experiment showed that truly synchronous stimuligained high mutual information, still a huge amount of mutual information was dis-tributed over asynchronous stimuli. In the last chapter the goal was to find out aboutgood features and parameters, using conditions that were known to be synchronousor not. Here the goal is to find out about the true synchrony contained in the scenes.Therefore I introduce an additional normalization step: in a separate computationthe original audio track is replaced by Gaussian white noise. Then synchrony is mea-sured in the same way as with the original audio track – in particular over the sameframes that show the parent speaking. The result Snoiseall|pos provides a baseline for thesynchrony with the original audio track. Since Snoiseall|pos is per definition a stochasticmeasure, the average across n instantiations of Gaussian noise is used (here n= 10).In this experiment I use the ratio between both estimates as measure of synchronycontained in a video:

Srelativeall|pos =Sparentall|pos

1n

∑ni=1 Snoiseall|pos(i)

(5.1)

This measure indicates the mutual information gain of the original audio track, rel-ative to pure audio noise – and thus how synchronous audio and video are. Note

49

–1 0 +1–2 0 +2–1 0 +1

+1 +2 +10 0 0–1 –2 –1

Figure 5.4: Left and middle: Sobel filter masks for vertical and horizontal edges.Right: Integrated gradient strength image.

that both Snoiseall and Snoisepos are neither zero nor constant across different videos – alsonot their expectation over different instantiations of noise. The independent Gaussiandistribution of the white noise provides indeed no mutual information to any otherprobability distribution. However the algorithm deals with sampled values of boththe audio and the video signal, thus correlation can occur by chance. Also, the videosignals do practically not fulfill the two assumptions made in correlation computa-tion: that the samples are (i) independently chosen (ii) from a Gauss-distribution.The normalization therefore potentially mitigates the impact of the violation of theseassumptions in real video sequences.

Features The experiments presented in the last chapter indicated that edge imagesprovided the best discrimination between synchronous and asynchronous stimuli. How-ever the investigation was split into a vertical and a horizontal edge detection. To again a unified result, I use gradient-strength images |∇I | in this experiment, thatintegrate horizontal and vertical edges:

|∇I(x, y)| =

√(∂I

∂x(x, y)

)2

+(∂I

∂y(x, y)

)2

(5.2)

The estimation of x- and y-derivatives ( ∂I∂x and ∂I∂y ) is here done with Sobel filters.

The resulting image contains high values for edges in any orientation (see fig. 5.4).Additionally intensity images are used in this experiment.

5.2.2 Results

The goal of this experiment is to compare synchrony in adult- and child-directed com-munication. For this purpose, each video showing an adult-adult (AA) interactionis analyzed together with the corresponding adult-child (AC) video. Figure 5.5 ex-emplarily shows the synchrony results for gradient-strength as feature, α= 0.05 andSrelativepos as measure. Each point in the plots corresponds to a pair of an AA and ACvideo, where the synchrony in the AA video is plotted on the x-axis and the synchrony

50

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Syn

chro

ny A

C

Synchrony AA

Pairs of AA/AC VideosMedian

Meany=x

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Syn

chro

ny A

C

Synchrony AA

8-11 months12-17 months18-23 months24-30 months

y=x

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Syn

chro

ny A

C

Synchrony AA

Cup stackingBell

Wooden bricksSalt shaker

y=x

Figure 5.5: The plots show the synchrony results for gradient-strength as feature,α = 0.05 and Srelativepos as measure. Synchrony in each adult-adult video is plottedagainst the synchrony in the corresponding adult-child video. All plots show the samedata set, where the second plot shows the separate infant-age groups and the thirdthe different object interactions. The median is shown as big point for each category.

51

MedianSettings AC AA Significance level

int 0.05 all 4.91 3.84 >0.1int 0.05 pos 3.86 3.09 0.001sobel 0.1 pos 2.31 2.00 0.001sobel 0.02 pos 2.18 1.96 0.1sobel 0.05 all 3.79 3.04 0.001sobel 0.05 pos 2.68 2.32 0.001

Table 5.1: Comparison between adult-directed and child-directed communication forintensity images and Sobel-based gradient strength as features, different values ofα and measures Srelativeall and Srelativepos . The median for AC is higher than for AAin all settings. The last column shows the significance w.r.t. the null hypothesisH0 : P (SyncAA < SyncAC) = P (SyncAC < SyncAA) = 0.5.

in the AC video on the y-axis. The first observation is that all except for three videosgained synchrony values above 1.0. That means that the video signals gained highermutual information with the original audio track than with audio noise. Hence a realsynchrony could be detected.

For a direct comparison between AA and AC conditions, the main diagonal (i.e.x = y) is shown in the plot. A point above this diagonal indicates that more synchronyis found in the child-directed interaction than in the corresponding adult-directedsituation. Indeed most points (here 62 out of 92) lie above the diagonal. Both medianand mean show higher synchrony for the child-directed situation. With this parametersetting the median synchrony is 2.32 for AA videos and 2.68 for AC videos. Thesignificance of this effect was tested with a two-tailed sign test. The sign test betweenpaired random variables (ai, bi)i=1..N thereby tests the null hypothesis H0 : P (A <B) = P (A > B) = 0.5. Here the null hypothesis is that synchrony in AC has thesame probability to be higher or lower than in the corresponding AA situation. Onthe dataset presented here, this null hypothesis can be rejected with high significance(error probability p < 0.001). The effect can be reproduced across diverse parametersettings (see table 5.1). In all tested settings, the median of AC synchrony exceeds themedian of AA synchrony. The effects also reaches significance in most of the settings.

A comparison between different infant ages (see fig. 5.5 middle) does not provide aunique trend. The highest difference between AA and AC conditions can be found inthe older infant groups (18-23 and 24-30 months). The youngest group (8-11 months)still shows a trend (p < 0.1) for more synchrony in child-directed communication. Forthe group of 12-17 month old infants, the median shows less synchrony in AC thanAA, though still 13 out of 21 video-pairs have more synchrony in the AC condition.Contrary to this trend toward higher AA-AC difference for older infants, the highestsynchrony estimates in both AA and AC conditions in the middle groups (12-17 and18-23 months).

52

Settings Spearman’s ρ Significance levelint 0.05 all 0.342 0.01int 0.05 pos 0.418 0.01sobel 0.1 pos 0.378 0.01sobel 0.02 pos 0.227 0.05sobel 0.05 all 0.473 0.01sobel 0.05 pos 0.480 0.01

Table 5.2: Spearman rank correlation coefficients for all tested parameter settings.All settings show a significant positive correlation between synchrony in adult- andchild-directed situations.

With respect to the different interaction tasks (fig. 5.5 bottom) the wooden brickscenario shows a significant (P < 0.01) trend towards more synchrony in child-directedcommunication. For all scenarios the median of AC synchrony is higher than the AAmedian.

An additional observation in the results is that synchrony in child-directed situationsis positively correlated with the synchrony in corresponding adult-directed situations.That means that parents tend to produce high synchrony in AA situations, when theyproduce high synchrony in AC situation. The effect is measured with the Spearmanrank correlation coefficient. Analogically to Pearson’s correlation, it indicates positivecorrelation with values between 0.0 and 1.0, but is more robust to outliers. For thesettings shown in figure 5.5 Spearman’s correlation is 0.480. The effect shows to besignificant w.r.t. the null hypothesis that the variables are uncorrelated (p < 0.01with a two-tailed t-test). Also this effect can be reproduced across several parametersettings (see table 5.2).

5.2.3 Discussion

The results presented in this section give a clear indication that multimodal mothereseindeed involves a higher synchronization between gestures (or movement general) andspeech. Though this effect was also described by Gogate et al. it is remarkable thatit can be detected even on signal level. The finding that more synchrony is detectedthroughout all four investigated tasks reveals this observation. The comparison be-tween different infant-age groups exposes two entirely unexpected trends. That thehighest differences between AA and AC are observed for the older infants contraststhe observation from Gogate et al., who found the strongest effects for the youngestchildren. Additionally the highest degrees for both AA and AC synchrony are foundin the middle age groups. In particular there are high differences in the AA situationsamong the groups, although in all those cases two adults are talking to each otherand it is not immediately clear how this communication can be affected by the age oftheir child. For now there is no obvious explanation for the differences between the

53

infant-age groups. However, it is not clear how significant – and in particular howrobust across different parameter settings – these observations are. Though, theseobservations do not reduce the overall results that more synchrony is found in child-directed situations and that synchrony is correlated between AA and AC conditions.In particular the correlation observation holds besides the higher synchrony in themid-age groups. In fact a positive correlation can be found even within each group ofinfant ages.

5.3 Spatial Analysis of Synchrony

If multimodal motherese provides additional learning cues due to synchrony, it is im-portant to understand what these cues actually indicate. In this section I discuss someexemplary scenes with respect to the spatial distribution on mutual information in thevideo sequences. Motionese was already investigated w.r.t. the spatial distributionof visual saliency [45]. Thereby the same cup stacking demonstrations towards 8-11month old infants were investigated as used in this thesis. The most salient imageposition in each frame was categorized as parents face, parents hands, demonstratedobject and any other image location. It was shown that different motion patternsin adult- and child-directed communication caused higher saliency on demonstratedobjects in child-directed situations. This thesis deals with attention through audio-visual synchrony and its potential integration into a saliency-based attention model.It is therefore important to understand where mutual information is located in suchdemonstration scenarios and how it differs from purely visual cues.

A comparison between attention by saliency and synchrony can generally be done intwo ways: first of all the entire saliency map (or the mixelgram) can be interpreted interms of covert attention. As the potential importance of each image region is encodedin those maps, one can directly compare e.g. face and hand of a subject w.r.t. theirimportance relative to each other. A more condensed view can be gained in terms ofcovert attention: each saliency map and each mixelgram is reduced to a single attendedposition – a focus of attention. For saliency maps this is simply the position with thehighest value. For mixelgrams I apply a 15x15 Gaussian filter before locating themaximum position. The localization experiment described in section 4.3 showed thata simple focus on high mutual information values is more appropriate than center ofgravity estimate of a sound source location. However, a pure maximum-pixel detectionis not reasonable, since such a pixel does not necessarily reflect a robust maximum inthe image region. Here the Gauss filter yields a smooth spatial behavior. The pixelwith the highest value can thus be assumed to reflect a robust maximum.

The analysis of two exemplary videos is shown in the figures 5.6 and 5.7. In termsof covert attention, the average saliency map and mixelgram over time was computed.Thereby only those frames contribute to the average, that go along with speech fromthe parent. Also, the located maxima of each saliency and mutual information duringparental speech are visualized. Here the maximum location within each frame con-tributes with a smooth spot to the visualization. Frequently attended locations appear

54

(a) Exemplary frames with positions of the mutual information(red) and saliency (green) maximum.

(b) Avg. mixelgram during parent speech and maximum mutualinformation positions

(c) Avg. Saliency during parent speech and maximum saliencypositions

Figure 5.6: Subject demonstrating cup stacking towards an infant (8-11 months old).

55

(a) Exemplary frames with positions of the mutual information(red) and saliency (green) maximum.

(b) Avg. mixelgram during parent speech and maximum mutualinformation positions

(c) Avg. Saliency during parent speech and maximum saliencypositions

Figure 5.7: Subject demonstrating wooden bricks towards an infant (12-17 monthsold).

56

brighter, since the spots are overlayed.The first video shows the demonstration of cup stacking towards an 8-11 month old

infant (fig. 5.6). A high amount of mutual information is concentrated on the face, butalso on the pullover. The richly structured texture of the pullover causes much activityin the image features due to minimal body movements. The highest average saliencyis primarily found at the shoulders, where the subjects pullover sharply contraststhe background. In most frames the maximum mutual information is located onthe subjects face, whereas the maximum saliency is mostly found in the subjectsaction space. Some exemplary frames are shown in figure 5.6a. In some frames theglobal mutual information maximum is located directly on the cups shown the infant.Obviously, the cups are no source of sound in this situations, but provide synchronydue to the interplay of parents speech and motion.

The second video shows a demonstration of the wooden bricks towards an 12-17month old infant (fig. 5.7). In average both saliency and mutual information showhigh values on the face and the right hands action space. The maximum mutualinformation is mostly found in the action space, and – contrary to the first video –less often on the face. Also the saliency maxima are mainly restricted to this area.

The shown distributions and example frames suggest that mutual information canindeed be used to find interesting image locations. Gogate et al. found that objectmotion is often used synchronous to a word label in multimodal motherese. Though thecorrelation analysis is performed on an entirely different level, this is consistent withthe observation that high mutual information values can be found on shown objectsduring parental speech. However, the two analyzed videos can not serve as mandatoryevidence for a general communication scheme. Here future investigations could e.g.pick up the scheme used in [45], counting occurrences of maximum mutual informationon the face, hands and demonstrated objects throughout an entire set of videos. Inparticular, the relation between mutual information cues and purely visual saliencycues should be investigated. If audiovisual synchrony truly is an additional, meaningfulcue in multimodal motherese, there should be significant statistical differences betweensaliency and mutual information maxima.

5.4 Conclusions

Taken together, the results presented in this chapter give a clear indication that syn-chrony cues in multimodal motherese exist and that they can be detected on signallevel – in particular with the methods used in this thesis. Beyond human-human com-munication this raises questions about a concrete use in an artificial system [52]. Ithas been argued that cues from child-directed communication help to guide attentiontowards important parts of either the speech signal or the visual scene [4, 16, 21]. Heremutual information can directly serve as a spatial cue towards important visual stimuliand can therefore guide attention. In order to understand these multimodal cues inchild-directed communication, further investigations should not only investigate thespatial, but also the temporal characteristics of synchrony. One could e.g. investigate

57

which kind of utterances involve high multimodal synchrony. It can be hypothesizedthat, if multimodal motherese is used to arouse attention, high synchrony should befound when infants attention slips off and the parent tries to get it back.

58

6 Audiovisual Attention

The basic idea of this thesis is to guide visual attention through audiovisual synchrony.A system (e.g. a robot) making use of such attention mechanisms should thus prefer-antially attend to sound sources, but also be sensitive to multimodal cues in tutoringsituations, as they were discussed in the last chapter. Mutual information images(mixelgrams) can directly be used for this purpose, as mutual information expressessynchrony and thus a potentially interesting stimulus. The system could thereforesimply attend to the image location with the highest mutual information. However,this approach is only reasonable if there is any synchrony in the scene. When thereis for instance no sound at all, or only sound sources that can not be seen, this ap-proach is not able to determine interesting locations in the visual scene. Generally,audiovisual mutual information does not take into account any purely visual cue andis therefore not sufficient for a visual attention system. Here, saliency maps providea good basis for any kind on purely visual stimuli. In this chapter I discuss severalways to integrate audiovisual synchrony with saliency maps. Therefore I sketch threedifferent approaches that should clearly be seen as suggestions.

An integration of audiovisual synchrony and saliency maps should fulfil three basicrequirements: first, in the presence of audiovisual synchrony, the system should focuson those visual stimuli that are in synchrony with sound. Second, if there is noaudiovisual synchrony, the system should behave like purely visual saliency maps.Third, there should be no binary decision whether synchrony is present or not. Insteadthe system should smoothly fade between both modes depending on the degree ofsynchrony present in the scene.

Both mutual information and saliency provide a potential degree of interest for eachlocation in the visual field. Thus both of them – and also an integration of both – fallinto the Spatial Cueing paradigm. The gradual attention on each location is therebyfully determined by the incoming stimuli, which can be classified as bottom-up orexogenous attention. It has to be clarified that this approach yields a visual attention.There is no indication about the importance of different auditive stimuli. However itis an audiovisual system approach, as audio is used for guidance.

6.1 Feature Weighting Schemes

A simple way to influence saliency maps towards a wanted behavior is to weightthe contribution of different features to the overall saliency map [28]. Originally,conspicuity maps from different features are normalized and contribute equally to thesaliency map (compare equation 2.4). Modifications in the weights of features have e.g.

59

been suggested to deal with visual search tasks [61]. Here, features can be weightedto focus on stimuli that provide synchrony to the heard sound.

6.1.1 Individual Synchrony Weighting

As attention shall be guided towards synchronous stimuli, a direct approach is toweight each feature depending on its individual synchrony to the heard sound. Thatmeans mutual information is measured for each feature, feeding into a scalar estimateof synchrony – for example SyncI(t) for intensity images, SyncC(t) for colors and soon. Now those features with high mutual information gain higher weights w(t) withinthe saliency map. For the concrete choice of weights for each feature, one has to takeinto account the behavior when there is no synchrony present in the scene. In thissituation the weights should be stable and for instance the same for all features. Oneway to achieve this is to scale the synchrony estimates with a factor β and add aconstant (e.g. 1) for the weights. The computation scheme is symmetric for all usedfeatures (here only shown for intensity I and color C ):

S =(wI(t) · N (I ) + wC(t) · N (C ) + . . .

)wI(t) = 1 + β · SyncI(t)wC(t) = 1 + β · SyncC(t)

... =...

When no synchrony is present, all features contribute with weight 1 to the saliencymap – which corresponds to purely visual saliency maps. Features with the highestsynchrony also gain the highest weights. The synchrony estimates Sall(t) and Spos(t)used in this thesis typically have small numerical values (�1). Therefore β should be�1 to achieve a reasonable impact of synchrony on the weights.

This approach naturally answers the question of what feature shall be used to detectsynchrony – all features in this case. However the results presented in chapter 4indicate that not all features are appropriate for the detection of audiovisual synchrony.Also the dynamics of such a system are not trivially to understand and predict, sincefor e.g. five features (intensity, color, orientation, difference images and optical flow)there are five estimates of synchrony. It is a priori not clear, how those five measureswould behave in relation to each other.

6.1.2 Global Synchrony Weighting

The weighting of features due to their individual synchrony directly realizes the ideato focus on visual stimuli that provide synchrony to the sound. However the problemsdescribed in the last paragraph suggest that an indirect approach to feature weightingmight be more appropriate. Here the idea is to weight the different features (ormore precisely the corresponding conspicuity maps) depending on a global measureof synchrony Sync(t), that is computed on an appropriate feature (e.g. edge images).

60

(a) RGB image from avideo stream.

N (I ) N (C ) N (O) N (D) N (F )(b) Normalized conspicuity maps for intensity, color, orientation, difference images and

optical flow.

wstat= 56 , wdyn= 1

6 wstat= 12 , wdyn= 1

2 wstat= 16 , wdyn= 5

6

(c) Saliency maps with different weights wstat and wdyn.

Figure 6.1: Feature weighting based on a global synchrony measure: conspicuitymaps are computed from a video stream. The bottom row shows different weights forthe static features (intensity, color and orientation) and dynamic features (differenceimages and optical flow). In situations without synchrony, high weight can be assignedto static features (left). In the presence of synchrony, the weighting promotes thedynamic features (right). A neutral, symmetric weighting is shown in the middle.

61

Since synchrony is always caused by the dynamics in a scene, more weight can begiven to dynamic features (difference images, optical flow) in situations with highaudiovisual synchrony. When there is no synchrony, the system can focus on staticfeatures like intensity, color and orientation. In fact this view is consistent with theIntersensory Redundancy Hypothesis [2], discussed in section 2.3.3. The IRH claimsthat infants preferentially attend to amodal information in the presence of synchronyor redundancy across modalities. As amodal information, Bahrick et al. refer to ase.g. temporal, dynamic patterns. In the absence of redundancy across modalities,infants focus on modality-specific information like color.

Generally this approach can be denoted with two weights wstat(t) and wdyn(t) forthe static and dynamic features. The weights are multiplied with the normalizedconspicuity maps of the corresponding features to focus on either static or dynamicfeatures:

S = wstat(t) ·(N (I ) +N (C ) +N (O)

)+ wdyn(t) ·

(N (D) +N (F )

)Thereby wdyn(t) shall exceed wstat(t) in situations with high synchrony. The followingscheme lets wdyn(t) rise close to 1.0 and wstat(t) decrease close to 0.0 in such situation.When the synchrony measure is close to zero, wdyn(t) gets close to 0.0 and wstat(t)becomes close to 1.0:

wstat(t) = e−β·Sync(t)

wdyn(t) = 1− wstat(t)

As Sync(t) becomes arbitrarily big – mutual information measures have no theoreticalupper bound – e−β·Sync(t) gets close to zero for positive values of β. This forces lowweight on static and high weight on dynamic features. The opposite holds whenSync(t) gets close to zero, as e−β·Sync(t) gets close to 1.0. Here β can be chosento regulate the sensitivity for synchrony. An example of the possible impact of thisapproach is shown in figure 6.1.

6.2 Spatial Modulation

A principle limitation of feature weighting schemes is that spatial information, gath-ered by the synchrony detection gets entirely lost. When for example two persons arein the visual scene – both wearing comparable clothes and both moving – but only oneof them is speaking or making noise, feature weighting can not guide attention towardsthe speaking person. Therefore the spatial information contained in a mixelgram canalso be used directly. Here the idea is not to weight features, but weight locations inthe saliency map. The simplest possible way is to derive a joint map of attention A bymultiplying the values of saliency S and mutual information I in each image location:

A(x, y) = S (x, y) · I (x, y)

This guides attention towards synchronous stimuli, but is not appropriate in the ab-sence of synchrony. In particular all static regions in the images would yield a value of

62

Figure 6.2: Study on disturbances in HRI: Subjects should teach tasks to a simulatedrobot. The only feedback from the robot was its gaze (right) that was controlled by asaliency map system (center). The study dealt with the reactions of the participantswhen the robots attention was distracted by a virtual visual distractor. Illustrationfrom [43].

0.0, since they are filtered out in the mutual information images. Again, it is desirablethat in the absence of any synchrony the system behaves like purely visual saliencymaps:

A(x, y)Sync(t)→0−−−−−−−→ S (x, y)

A(x, y)Sync(t)→∞−−−−−−−→ S (x, y) · I (x, y)

This behavior can for instance be achieved with the following scheme:

A(x, y) = S (x, y) ·(e−β·Sync(t) · C + (1− e−β·Sync(t)) · I (x, y)

)Here the mixelgram I is faded against a constant value C (e.g. C=1) , depending onthe synchrony Sync(t). For low synchrony values, the saliency map is only multipliedwith a constant value. For increasing synchrony, A results in a multiplication ofsaliency and mutual information. Again, β can be used to regulate the sensitivity tosynchrony and should be set �1.0 to achieve a reasonable impact.

6.3 Outlook

For now, the integration of saliency maps and audiovisual mutual information sketchedin this chapter is purely conceptual. As such an integrated attention system shall beused on a robot platform, it has to prove its utility. In human-robot interaction(HRI) a robot should for instance attend to his human interaction partner. However,when considering bottom-up (and particularly purely visual) attention systems, thereis a broad set of stimuli that can potentially distract the robots attention from thehuman. Such disturbances not only complicate any task fulfilment for the robot, butcan confuse the human as the robots gaze slides away.

63

In a study on HRI, Muhl and Nagai [43] investigated how people react to suchdisturbances that cause the robots attention to slip off. In their study subjects wereasked to teach some task to a simulated robot (see fig. 6.2). The only feedback from therobot was its gaze, that was controlled by a saliency map system. The robots attentionwas distracted by a virtual stimulus that was not visible to the participants. As soonas the persons noticed that the robots attention was distracted, they used variousstrategies to regain it, including an approximation to the robot, but also acousticstrategies. Even if they knew that the robot could not receive acoustic input, somesubjects talked to the robot or made noise through knocking on the table.

This experimental setup provides several challenging possibilities to evaluate anaudiovisual attention system. The most obvious question is how sensitive the systemis to visual distractors at all. However this question does not require an interactionand is very similar to the question of localization-performance discussed in section4.3. Concerning the interaction between human and robot, an interesting questionis whether the human strategies to re-arouse the robots attention actually work withan audiovisual attention system. Even if the robots attention is successfully (fromthe experimenters perspective) distracted, it is plausible to assume that audiovisualsynchrony cues can help to re-arouse it through speech or intentional noise. Moreoverif this re-arousal works, it is worth to investigate the further interaction. For instanceit is imaginable that humans make more use of multimodal cues if they have oncenoticed that they are valuable for the communication.

64

7 Conclusions

The goal of this thesis was to explore the use of multimodal synchrony on signal-levelfor an attention system. Therefore many questions could be answered, but also newquestions arised. Methodically, the choice of Hershey & Movellan’s mutual informa-tion method for synchrony detection was necessary with respect to the computationaleffort of other known methods. As an extension of that algorithm, I proposed a fil-tering process that excludes insignificant visual stimuli and has shown to be valuableacross the experiments. The investigation of different image features has shown thatthe best performance could be gained with edge- and intensity-images. This is animportant finding since dynamic features like difference images were a priori argued tobe more appropriate for a relation to audio data. In particular the localization experi-ment showed that Hershey’s method is basically able to discriminate synchronous andasynchronous stimuli. However, the numerical differences in the mutual informationare not very large. Here, future investigations could deal with further extensions andmodifications of the algorithm. For example, a normalization against noise as used inchapter 5 could be directly integrated into the algorithm. Also a direct temporal fil-tering of the signals can be worth a future investigation, since temporal noise seems tobe a major problem to the correlation-based synchrony detection. An important prob-lem that has not been addressed in this thesis is that mutual information is bound tofixed locations in the image. This is problematic when objects or persons move acrosslarge areas of the visual scene, since mutual information gained at one position is nottransported with the object motion but stays fixed on each pixel. Here, effort can bespent e.g. on the usage of tracking methods to bind mutual information to objectsinstead of fixed image locations.

Beyond the detection and localization of known sound sources, the approach hasbeen tested on a social learning scenario. Here I have demonstrated that synchronyis not only found on the tutoring persons face – which is the sound source in the caseof speech. Rather, a high degree of mutual information is found on the hands andmoved objects during gestures and object demonstrations. This is fully consistentwith the original hypothesis that tutors synchronize gestures and speech in order toguide attention towards relevant stimuli. The investigation of child-directed commu-nication – multimodal motherese – in this tutoring situations gives a clear indicationthat parents use a higher degree of synchronization towards their infants than towardsother adults. This results are consistent with a previous study by Gogate et al. [21]who showed a higher synchrony between object motions and spoken word labels inchild-directed communication. However, they used a manual, subjective scheme fortheir evaluations. Here it is highly remarkable that a higher synchrony can not only befound with an objective measurement, but in particular on signal-level. In a broader

65

context, adaptions in child-directed communication have also been shown isolated forspeech (motherese) and gestures (motionese). These modifications have been hypothe-sized to have three major functions: the arousal of attention, the guidance of attentionand the structuring of complex informations. The hypothesis that the synchronizationbetween speech and gestures has the same three functions leads to several predictionsthat can be tested in further studies. If multimodal synchrony functions to arouse theinfants attention, an increased synchrony should be found when the infants attentionslips off from the parent or relevant objects. This function is highly plausible since in-fants have been demonstrated to preferentially attend to stimuli that are synchronousacross modalities. If synchrony is used to guide attention towards relevant stimuli,high mutual information should systematically be found on demonstrated objects.The examples shown in chapter 5.3 already indicate this trend. However, a quanti-tative analysis should give more insights. In particular a systematic comparison topurely visual cues like saliency could confirm that audiovisual synchrony is indeed anadditional, useful cue in these situations. The hypothesis that synchrony functions tostructure the complex flow of information leads to different predictions. If synchronyhelps to partition complex actions, there should be statistical differences in the degreeof synchrony between the start, accomplishment and end of distinct (sub-)actions.Another aspect is the referencing between audio and video. Gogate et al. explicitlyinvestigated this referencing, as they evaluated the relation between a motion of an ob-ject and the articulation of the corresponding object label. Therefore also the amountof signal-level synchrony can be compared between object-label or also pronouns like“this” and other words that do not explicitly refer to a visual stimulus.

I sketched several approaches to integrate audiovisual mutual information and visualsaliency towards a unified attention system in chapter 6. Generally, these approachescan join the benefits from saliency maps, that have shown to be an appropriate meanin various works, and audiovisual synchrony. However, the prove that these integrationconcepts function and yield a useful system is still remaining. Since the mutual infor-mation computation can easily be done in real-time, an integrated attention systemcan be productively used in a robot system and tested in a real interaction.

Taken together, the results of this thesis indicate that audiovisual synchrony is avaluable cue for an attention system that is worth further investigations.

66

Bibliography

[1] Bahrick, L.E., Lickliter, R. Intersensory redundancy guides early perceptual andcognitive development. Advances in Child Development and Behavior, 30:153–187, 2002.

[2] Bahrick, L.E., Lickliter, R. & Flom, R. Intersensory Redundancy Guides the De-velopment of Selective Attention, Perception, and Cognition in Infancy. CurrentDirections in Psychological Science, 2004.

[3] Binnie, C.A., Montgomery, A.A. & Jackson, P.L. Auditory and Visual Contribu-tions to the Perception of Consonants. Journal of speech, language, and hearingresearch, 17(4):619–630, 1974.

[4] Brand, R.J., Baldwin, D.A. & Ashburn, L.A. Evidence for ’motionese’: modifi-cations in mothers’ infant-directed action. Developmental Science, 5(1):72 – 83,2002.

[5] Bredin, H., Chollet, G. Measuring audio and visual speech synchrony: meth-ods and applications. In IET International Conference on Visual InformationEngineering, pages 255–260, 2006.

[6] Broatbend, D. Perception And Communication. Pergamon Press, London, 1958.

[7] Brown, M.Z., Burschka, D. & Hager, G.D. Advances in Computational Stereo.IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(8):993–1008, August 2003.

[8] Burt, P.J. & Adelson, E.H. The Laplacian Pyramid as a Compact Image Code.IEEE Transactions on Communication, 31(4):532–540, 1983.

[9] Chibelushi, C.C., Deravi, F. & Mason, J.S.D. A review of speech-based bimodalrecognition. IEEE Transactions on Multimedia, 4(1):23–37, March 2002.

[10] Ciaramitaro1, V.M., Buracas, G.T. & Boynton, G.M. Spatial and cross-modalattention alter responses to unattended sensory information in early visual andauditory human cortex. Journal on Neurophysiology, August 2007.

[11] Clark, A. Cross-modal cuing and selective attention. In Conference on “Individ-uating the Senses”. Oxford University, December 2004.

[12] Cooper, R.P., Abraham, J., Berman, S. & Staska, M. The Development of InfantsPreference for Motherese. Infant Behavior and Development, 20(4):477–488, 1997.

67

[13] Daugman, J.G. The Laplacian Pyramid as a Compact Image Code. IEEE Trans-actions on Signal Processing, 36(7):1169–1179, 1988.

[14] Diehl, R.L., Lotto, A.J. & Holt, L.L. Speech Perception. Annual Review ofPsychology, 55:149–179, February 2004.

[15] Dodd, B. The role of vision in the perception of speech. Perception, 6(1):31–40,1977.

[16] Dominey, P.F., Dodane, C. Indeterminacy in language acquisition: the roleof child directed speech and joint attention. Journal of Neurolinguistics, 17(2-3):121145, 2004.

[17] Driver, J., Spence, C. Crossmodal attention. Current Opinion in Neurobiology,8:245–253, 1998.

[18] Driver, J., Spence, C. Multisensory Perception: Beyond modularity and conver-gence. Current Biology, 10(20):R731–R735, 2000.

[19] Fernald, A. & Mazzie, C. Prosody and Focus in Speech to Infants and Adults.Developmental psychology, 27(2):209–221, 1991.

[20] Gao, D., Vasconcelos, N. Discriminant Saliency for Visual Recognition fromCluttered Scenes. In Proceedings of Neural Information Processing Systems, 2004.

[21] Gogate, L.J., Bahrick, L.E. & Watson, J.D. A Study of Multimodal Motherese:The Role of Temporal Synchrony between Verbal Labels and Gestures. ChildDevelopment, 71(4):878894, July/August 2000.

[22] Goldsmith, M. What’s in a Location? Comparing Object-Based and Space-Based Models of Feature Integration in Visual Search. Journal of ExperimentalPsychology: General, 127(2):189–219, 1998.

[23] Gottlieb, J.P., Kusunoki, M. & Goldberg, M.E. The representation of visualsalience in monkey parietal cortex. Nature, 391(6666):481–484, 1998.

[24] Guski, R. Wahrnehmen - ein Lehrbuch. Kohlhammer, Stuttgart, 6 edition, 1996.

[25] Hairston, W.D., Wallace, M.T., Vaughan, J.W., Stein, B.E., Norris, J.L., Schirillo,J.A. Visual Localization Ability Influences Cross-Modal Bias. Journal of cognitiveneuroscience, 15(1):898–929, 2003.

[26] Hershey, J. & Movellan, J. Audio-vision: Using audio-visual synchrony to locatesounds. Advances in Neural Information Processing Systems, 2000.

[27] Horn, B.K.P. & Schunck, B.G. Determining Optical Flow. Technical ReportAIM-572, Massachusetts Institute of Technology, April 1980.

[28] Itti, L., Koch, C. Computational Modelling of Visual Attention. Nature ReviewsNeuroscience, 3(2):194–203, March 2001.

68

[29] Itti, L., Koch, C. & Niebur, E. A Model of Saliency-Based Visual Attentionfor Rapid Scene Analysis. IEEE Transactions on Pattern Analysis and MachineIntelligence, 1998.

[30] Jain, A.K., Duin, R.P.W. & Mao, J. Statistical Pattern Recognition: A Review.IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4–37,January 2000.

[31] James, W. The Principles of Psychology. Henry Holt and Co., New York, 1890.

[32] Jonides, J. Voluntary vs. Automatic Control over the Mind’s Eye’s Movement.In Attention and Performance, volume 9, pages 187–203, 1981.

[33] Kayser, C., Petkov, C., Lippert, M. & Logothetis, N. Mechanisms for AllocatingAuditory Attention: An Auditory Saliency Map. Current Biology, 15(21):1943–1947, 2005.

[34] Kidron, E., Schechner, Y., Elad, M. Cross-Modal Localization via Sparsity. IEEEtransactions on signal processing, 55:1390–1404, 2005.

[35] Kidron, E., Schechner, Y., Elad, M. Pixels that sound. In IEEE Computer Vision& Pattern Recognition, volume 1, pages 88–96, 2005.

[36] Klein, R.M. Inhibition of Return. Trends in Cognitive Sciences, 4(4):138–147,2000.

[37] Koch, C., Ullman, S. Shifts in selective visual attention: towards the underlyingneural circuitry. Human neurobiology, 4:219–227, 1985.

[38] Lewkowicz, D.J. The development of intersensory temporal perception : Anepigenetic systems/limitations view. Psychological bulletin, 126(2):281–308, 2000.

[39] Lippmann, R.P. Speech recognition by machines and humans. Speech Communi-cation, 22(1):1–15, July 1997.

[40] McGurk, H., MacDonald, J. Hearing lips and seeing voices. Nature, 264:746–748,December 1976.

[41] Michel, P., Gold, K. & Scassellati, B. Motion-based robotic self-recognition.Proceedings on Intelligent Robots and Systems, 3:2763–2768, 2004.

[42] Monaci, G., Vandergheynst, P. Audiovisual Gestalts. In Proceedings of the 2006Conference on Computer Vision and Pattern Recognition Workshop, 2006.

[43] Muhl, C., Nagai, Y. Does Disturbance Discourage People from Communicatingwith a Robot? In Proceedings of the 16th International Synopsium on Robot ansHuman Interactive Communication, pages 1137–1142, 2007.

69

[44] Muller, H.J., O’Grady, R.B. Dimension-Based Visual Attention ModulatesDual-Judgment Accuracy in Duncan’s (1984) One- Versus Two-Object ReportParadigm. Journal of Experimental Psychology: Human Perception and Perfor-mance, 26(4):1332–1351, 2000.

[45] Nagai, Y. & Rohlfing, K.J. Can Motionese Tells Infants and Robots “What ToImitate”? In Proceedings of the 4th International Symposium on Imitation inAnimals and Artifacts, April 2007.

[46] Niebur, E. & Koch, C. Control of selective visual attention: Modeling the wherepathway. Advances in Neural Information Processing Systems, 6(6666):802–808,1996.

[47] Oviatt, S. Ten myths of multimodal interaction. Communications of the ACM,42(11):74 – 81, 1999.

[48] Pavani, F., Spence, C., Driver, J. Visual Capture of Touch: Out-of-the-BodyExperiences With Rubber Gloves. Psychological Science, 11(5):353359, September2000.

[49] Posner, M.I. The Attention System of the Human Brain. Annual Review ofNeuroscience, 13:25–42, 1990.

[50] Prince, C.G., Hollich, G.J., Helder, N.A., Mislivec, E.J., Reddy, A., Salunke, S.& Memon, N. Taking Synchrony Seriously: A Perceptual-Level Model of InfantSynchrony Detection. In Proceedings of the Fourth International Workshop onEpigenetic Robotics, 2004.

[51] Quinlan, P.T. Visual Feature Integration Theory: Past, Present, and Future.Psychological Bulletin, 129, September 2003.

[52] Rohlfing, K.J., Fritsch, J., Wrede, B. & Jungmann, T. How can multimodal cuesfrom child-directed interaction reduce learning complexity in robots? AdvancedRobotics, 2006.

[53] Rutishauser, U., Walther, D., Koch, C., Perona, P. Is bottom-up attention usefulfor object recognition? In Proceedings of Computer Vision and Pattern Recogni-tion, 2004.

[54] Slaney, M., Covell, M. FaceSync: A Linear Operator for Measuring Synchroniza-tion of Video Facial Images and Audio Tracks. In NIPS, pages 814–820, 2000.

[55] Spence, C., Squire, S. Multisensory Integration: Maintaining the Perception ofSynchrony. Current Biology, 13(13):R519–R521, 2003.

[56] Tarr, M.J. & Bulthoff, H.H. Image-based object recognition in man, monkey andmachine. Cognition, 67(1-2):1–20, July 1998.

70

[57] Treisman, A., Gelade, G. A Feature-Integration Theory of Attention. CognitivePsychology, 12:97–136, 1980.

[58] Vijayakumar, S., Conradt, J., Shibata, T. & Schaal, S. Overt visual attention fora humanoid robot. Proceedings on Intelligent Robots and Systems, 4:2332–2337,2001.

[59] Virsu, V., Oksanen-Hennah, H., Vedenp, A., Jaatinen, P. & Lahti-Nuuttila, P.Simultaneity learning in vision, audition, tactile sense and their cross-modal com-binations. Experimental Brain Research, 186(4):525–537, April 2008.

[60] Vroomen, J., de Gelder, B. Sound enhances visual perception: cross-modal effectsof auditory organization on vision. Journal of experimental psychology. Humanperception and performance, 26(5):1583–1590, 2000.

[61] Wolfe, J.M. Guided search 2.0: A revised model of visual search. PsychonomicBulletin & Revie, 1(2):202–238, 1994.

[62] Wolfe, J.M., Horowitz, T.S. What attributes guide the deployment of visualattention and how do they do it? Nature reviews. Neuroscience., 5:495–501,2004.

[63] Zimbardo, P. G. Psychologie. Springer Verlag, 6 edition, 1995.

71

A Detection Results

73

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

-30 -20 -10 0 10 20 30

Spo

s


int, α=0.05

Average

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

-30 -20 -10 0 10 20 30

Spo

s


int, α=0.1

Average

Figure A.1: Detection results for intensity images with α = 0.05 (top) and α = 0.1(bottom).

74

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

-30 -20 -10 0 10 20 30

Spo

s


diff, α=0.05

Average

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

0.011

-30 -20 -10 0 10 20 30

Spo

s


diff, α=0.1

Average

Figure A.2: Detection results for difference images with α = 0.05 (top) and α = 0.1(bottom).

75

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

-30 -20 -10 0 10 20 30

Spo

s


oVert, α=0.05

Average

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

-30 -20 -10 0 10 20 30

Spo

s


oVert, α=0.1

Average

Figure A.3: Detection results for vertical-edge images with α = 0.05 (top) and α = 0.1(bottom).

76

0.005

0.01

0.015

0.02

0.025

0.03

-30 -20 -10 0 10 20 30

Spo

s


oHor, α=0.05

Average

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

-30 -20 -10 0 10 20 30

Spo

s


oHor, α=0.1

Average

Figure A.4: Detection results for horizontal-edge images with α = 0.05 (top) andα = 0.1 (bottom).

77

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

-30 -20 -10 0 10 20 30

Spo

s


RG, α=0.05

Average

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

0.06

0.065

0.07

-30 -20 -10 0 10 20 30

Spo

s


RG, α=0.1

Average

Figure A.5: Detection results for RG color difference images with α = 0.05 (top) andα = 0.1 (bottom).

78

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

-30 -20 -10 0 10 20 30

Spo

s


BY, α=0.05

Average

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

-30 -20 -10 0 10 20 30

Spo

s


BY, α=0.1

Average

Figure A.6: Detection results for BY color difference images with α = 0.05 (top) andα = 0.1 (bottom).

79

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

-30 -20 -10 0 10 20 30

Spo

s


flowXp, α=0.05

Average

0

0.01

0.02

0.03

0.04

0.05

0.06

-30 -20 -10 0 10 20 30

Spo

s


flowXp, α=0.1

Average

Figure A.7: Detection results for the positive x-component of the optical flow withα = 0.05 (top) and α = 0.1 (bottom).

80

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

-30 -20 -10 0 10 20 30

Spo

s


flowXn, α=0.05

Average

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

-30 -20 -10 0 10 20 30

Spo

s


flowXn, α=0.1

Average

Figure A.8: Detection results for the negative x-component of the optical flow withα = 0.05 (top) and α = 0.1 (bottom).

81

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

-30 -20 -10 0 10 20 30

Spo

s


flowYp, α=0.05

Average

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

-30 -20 -10 0 10 20 30

Spo

s


flowYp, α=0.1

Average

Figure A.9: Detection results for the positive y-component of the optical flow withα = 0.05 (top) and α = 0.1 (bottom).

82

0

0.005

0.01

0.015

0.02

0.025

0.03

-30 -20 -10 0 10 20 30

Spo

s


flowYn, α=0.05

Average

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

-30 -20 -10 0 10 20 30

Spo

s


flowYn, α=0.1

Average

Figure A.10: Detection results for the negative y-component of the optical flow withα = 0.05 (top) and α = 0.1 (bottom).

83

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

-30 -20 -10 0 10 20 30

Spo

s


sal, α=0.05

Average

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

-30 -20 -10 0 10 20 30

Spo

s


sal, α=0.1

Average

Figure A.11: Detection results for saliency maps as feature with α = 0.05 (top) andα = 0.1 (bottom).

84

Audiovisual Attention via Synchrony · and depth-perception, see [63] chap. 4), hearing (e.g....

Documents

Transcript of Audiovisual Attention via Synchrony · and depth-perception, see [63] chap. 4), hearing (e.g....