Use of Visual Information in Experimental End-of-Speech ... · Stimuli: Presentation Concatenate 12...

Use of Visual Information in ExperimentalEnd-of-Speech Detection

R.J.J.H. van Son, Wieneke Wesseling, and Louis C.W. Pols

ACLC/IFA, University of Amsterdam, The [email protected]

OAP, 20 June 2008

van Son et al. (ACLC/CLST) Use of Visual Information OAP 20/06/2008 1 / 16

Introduction

Introduction: Visual cues in conversations

Quantify the use of visual cues in (experimental) conversations

Humans use visual cues, eg, lips, gaze, brows, head, hands

Interesting for understanding language and for technical use

Studying the processing of ”conversations” in the lab

When do human subjects use visual cues and how does it help them

Method: Shadowing pre-recorded dialogs with minimal responses


Stimuli

Stimuli

20 informal dialogs of 900 s each, 5 hours total

Informal and unrestricted dialogs (lively speech)

69 kWords, 13669 utterances, 5752 Turn Switches (simplified)

Transcribed, utterance aligned

Annotated for dialog “function” and Gaze direction


Stimuli

Stimuli: Examples

Example frame of stimuli

Videos were synchronized frame-by-frame


Stimuli

Stimuli: Presentation

Concatenate 12 framents of 5 minutes (310 s) each

2 minute practise fragment (Audio + Video)

2 minute pause

Repeat four times:

AV: Audio + VideoA: Audio onlyV: Video only

2 minute pauses after every four fragments (20 minutes)

Fragments were pseudo-randomized

Never present a subject two fragments from the same dialog


Stimuli

Stimuli: Task and analysis

30 Subjects were asked to “shadow” the dialog

Subjects should utter minimal responses whenever they “felt like it”

Record responses with laryngograph, segment automatically

RT defined with respect to nearest utterance end

Works surprisingly well, over 10,000 responses

Response delays comparable to turn-switch delays

RT experiment with Real, Correct, and Spurious, Error, responses

Model Spurious responses with Randomized responses


RT model

Reaction-Times

Random Walk to thresshold for Central (decision) component

Perceptual and Motor: deterministic response-time t0 = tp + tm

Central decision making component: integration-time τ = 1α

RT ∼ τ + t0

RT Variance ∼ τ3

Sigman, M and Dehaene, S (2005). Parsing a cognitive task: a characterization of the mind’s bottleneck.. PLoS Biol, 3(2):e37.


http://www.unicog.org/publications/sigman_plos_2005.pdf

Results

Results: Distribution of Turn and Response delays

A mixture of Spurious and Correct responses

Turns: Orignal speaker turns switch delays

AV audio-visual, A audio-only, and V visual-only

Randomized delays model Spurious responses


Results

Results: Separating Correct and Spurious responses

Delays −−> seconds

Re

sp

on

se

s (

co

rre

cte

d)

−2.0 −1.6 −1.2 −0.8 −0.4 0.0 0.4 0.8 1.2 1.6 2.0

06

01

20

20

02

80

36

0 Turns (N=2509 ~ 85%)

AV (N=1836 ~ 56%)

A (N=1663 ~ 44%)

V (N=722 ~ 25%)

Expectation Maximization with one PDF fixed

Filled symbols: Estimated Correct delays (% correct)

Open symbols: Randomized delays


Results

Results: Experimental values

Three interesting values

Fraction of Correct responses ⇒ Precision

Estimated Average Response delay ⇒ RT

Estimated variance ⇒ τ3 or “cognitive load”

But: Statistics are problematic

Visual cues: Gaze

Annotated Gaze direction of the speaker is a correlate of visual attention

Gaze: Speaker starts to look towards the listener

Nogaze: Other utterances


Results

Results: Gaze versus Nogaze utterances


Responses (

corr

ecte

d)

−2.0 −1.6 −1.2 −0.8 −0.4 0.0 0.4 0.8 1.2 1.6 2.0

030

70

110

160

210

Turns (N=784 ~ 85%)

AV (N=728 ~ 64%)

A (N=548 ~ 40%)

V (N=377 ~ 35%)

Gaze


Responses (

corr

ecte

d)

−2.0 −1.6 −1.2 −0.8 −0.4 0.0 0.4 0.8 1.2 1.6 2.0

030

70

110

160

210

Turns (N=1705 ~ 83%)

AV (N=1127 ~ 52%)

A (N=1161 ~ 47%)

V (N=360 ~ 19%)

Nogaze

High numbers of spurious responses, overlapping delay distibutions

Filled symbols: Estimated Correct delays

Open symbols: Randomized delays


Results

Results: Estimated Response Counts

Total versus Correct responses Experimental conditionE

stim

ate

d S

NR

(dB

) −

−>

Turns AV A V

−8

−6

−4

−2

02

46

8

Gaze

Nogaze

Correct versus Spurious responses, as Signal-to-Noise ratio (dB)

Left: High numbers of spurious responses

Right: Gaze has better SNR, except for Audio-only (A)


Results

Results: Estimated Reaction Times

Experimental condition

Estim

ate

d m

ea

n R

T −

−>

se

c

Turns AV A V

0.0

0.1

0.2

0.3

0.4

0.5

Gaze

Nogaze

Estimated mean RT

Gaze longer than Nogaze even without visual cues (A)

Compounding factors? (eg, utterance length)


Results

Results: Estimated Standard Deviation and τGaze/τNogazeS

tandard

Devia

tion −

−>

Gaze None Gaze None Gaze None Gaze None

Turns AV A V

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Gaze

Nogaze

SD: Gaze versus Nogaze

Rela

tive T

au −

−>

Gaze/N

ogaze

Turns AV A V

0.9

11.1

1.2

1.3

τGaze/τNogaze = 3

qS2

Gaze/S2Nogaze

Congnitive “load” increases when visual Gaze cues are used

AV, V have higher Sd and integration constant τ with Gaze (∼ 20%)

Original Turns have high processing load, not affected by Gaze


Conclusions

Conclusions

Visual cues are sufficient and helpful for experimental TRP detection

Visual cues almost sufficient for TRP detection (V ∼ 25% Correct)

Visual information without Gaze has little effect (AV∼A)

Visual correlates of attention, eg, Gaze:

improve precision (∼ 2dB SNR, V∼A)increase cognitive load (τ ∼ +20%)

Original dialog turn switches (Turns)

induce a higher cognitive loadunaffected by Gazevisual attention only one factor among many


Conclusions

Thank You


Use of Visual Information in Experimental End-of-Speech ... · Stimuli: Presentation Concatenate 12...

Documents

Transcript of Use of Visual Information in Experimental End-of-Speech ... · Stimuli: Presentation Concatenate 12...