Use of Visual Information in Experimental End-of-Speech ... · Stimuli: Presentation Concatenate 12...
Transcript of Use of Visual Information in Experimental End-of-Speech ... · Stimuli: Presentation Concatenate 12...
Use of Visual Information in ExperimentalEnd-of-Speech Detection
R.J.J.H. van Son, Wieneke Wesseling, and Louis C.W. Pols
ACLC/IFA, University of Amsterdam, The [email protected]
OAP, 20 June 2008
van Son et al. (ACLC/CLST) Use of Visual Information OAP 20/06/2008 1 / 16
Introduction
Introduction: Visual cues in conversations
Quantify the use of visual cues in (experimental) conversations
Humans use visual cues, eg, lips, gaze, brows, head, hands
Interesting for understanding language and for technical use
Studying the processing of ”conversations” in the lab
When do human subjects use visual cues and how does it help them
Method: Shadowing pre-recorded dialogs with minimal responses
van Son et al. (ACLC/CLST) Use of Visual Information OAP 20/06/2008 2 / 16
Stimuli
Stimuli
20 informal dialogs of 900 s each, 5 hours total
Informal and unrestricted dialogs (lively speech)
69 kWords, 13669 utterances, 5752 Turn Switches (simplified)
Transcribed, utterance aligned
Annotated for dialog “function” and Gaze direction
van Son et al. (ACLC/CLST) Use of Visual Information OAP 20/06/2008 3 / 16
Stimuli
Stimuli: Examples
Example frame of stimuli
Videos were synchronized frame-by-frame
van Son et al. (ACLC/CLST) Use of Visual Information OAP 20/06/2008 4 / 16
Stimuli
Stimuli: Presentation
Concatenate 12 framents of 5 minutes (310 s) each
2 minute practise fragment (Audio + Video)
2 minute pause
Repeat four times:
AV: Audio + VideoA: Audio onlyV: Video only
2 minute pauses after every four fragments (20 minutes)
Fragments were pseudo-randomized
Never present a subject two fragments from the same dialog
van Son et al. (ACLC/CLST) Use of Visual Information OAP 20/06/2008 5 / 16
Stimuli
Stimuli: Task and analysis
30 Subjects were asked to “shadow” the dialog
Subjects should utter minimal responses whenever they “felt like it”
Record responses with laryngograph, segment automatically
RT defined with respect to nearest utterance end
Works surprisingly well, over 10,000 responses
Response delays comparable to turn-switch delays
RT experiment with Real, Correct, and Spurious, Error, responses
Model Spurious responses with Randomized responses
van Son et al. (ACLC/CLST) Use of Visual Information OAP 20/06/2008 6 / 16
RT model
Reaction-Times
Random Walk to thresshold for Central (decision) component
Perceptual and Motor: deterministic response-time t0 = tp + tm
Central decision making component: integration-time τ = 1α
RT ∼ τ + t0
RT Variance ∼ τ3
Sigman, M and Dehaene, S (2005). Parsing a cognitive task: a characterization of the mind’s bottleneck.. PLoS Biol, 3(2):e37.
van Son et al. (ACLC/CLST) Use of Visual Information OAP 20/06/2008 7 / 16
Results
Results: Distribution of Turn and Response delays
A mixture of Spurious and Correct responses
Turns: Orignal speaker turns switch delays
AV audio-visual, A audio-only, and V visual-only
Randomized delays model Spurious responses
van Son et al. (ACLC/CLST) Use of Visual Information OAP 20/06/2008 8 / 16
Results
Results: Separating Correct and Spurious responses
Delays −−> seconds
Re
sp
on
se
s (
co
rre
cte
d)
−2.0 −1.6 −1.2 −0.8 −0.4 0.0 0.4 0.8 1.2 1.6 2.0
06
01
20
20
02
80
36
0 Turns (N=2509 ~ 85%)
AV (N=1836 ~ 56%)
A (N=1663 ~ 44%)
V (N=722 ~ 25%)
Expectation Maximization with one PDF fixed
Filled symbols: Estimated Correct delays (% correct)
Open symbols: Randomized delays
van Son et al. (ACLC/CLST) Use of Visual Information OAP 20/06/2008 9 / 16
Results
Results: Experimental values
Three interesting values
Fraction of Correct responses ⇒ Precision
Estimated Average Response delay ⇒ RT
Estimated variance ⇒ τ3 or “cognitive load”
But: Statistics are problematic
Visual cues: Gaze
Annotated Gaze direction of the speaker is a correlate of visual attention
Gaze: Speaker starts to look towards the listener
Nogaze: Other utterances
van Son et al. (ACLC/CLST) Use of Visual Information OAP 20/06/2008 10 / 16
Results
Results: Gaze versus Nogaze utterances
Delays −−> seconds
Responses (
corr
ecte
d)
−2.0 −1.6 −1.2 −0.8 −0.4 0.0 0.4 0.8 1.2 1.6 2.0
030
70
110
160
210
Turns (N=784 ~ 85%)
AV (N=728 ~ 64%)
A (N=548 ~ 40%)
V (N=377 ~ 35%)
Gaze
Delays −−> seconds
Responses (
corr
ecte
d)
−2.0 −1.6 −1.2 −0.8 −0.4 0.0 0.4 0.8 1.2 1.6 2.0
030
70
110
160
210
Turns (N=1705 ~ 83%)
AV (N=1127 ~ 52%)
A (N=1161 ~ 47%)
V (N=360 ~ 19%)
Nogaze
High numbers of spurious responses, overlapping delay distibutions
Filled symbols: Estimated Correct delays
Open symbols: Randomized delays
van Son et al. (ACLC/CLST) Use of Visual Information OAP 20/06/2008 11 / 16
Results
Results: Estimated Response Counts
Total versus Correct responses Experimental conditionE
stim
ate
d S
NR
(dB
) −
−>
Turns AV A V
−8
−6
−4
−2
02
46
8
Gaze
Nogaze
Correct versus Spurious responses, as Signal-to-Noise ratio (dB)
Left: High numbers of spurious responses
Right: Gaze has better SNR, except for Audio-only (A)
van Son et al. (ACLC/CLST) Use of Visual Information OAP 20/06/2008 12 / 16
Results
Results: Estimated Reaction Times
Experimental condition
Estim
ate
d m
ea
n R
T −
−>
se
c
Turns AV A V
0.0
0.1
0.2
0.3
0.4
0.5
Gaze
Nogaze
Estimated mean RT
Gaze longer than Nogaze even without visual cues (A)
Compounding factors? (eg, utterance length)
van Son et al. (ACLC/CLST) Use of Visual Information OAP 20/06/2008 13 / 16
Results
Results: Estimated Standard Deviation and τGaze/τNogazeS
tandard
Devia
tion −
−>
Gaze None Gaze None Gaze None Gaze None
Turns AV A V
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Gaze
Nogaze
SD: Gaze versus Nogaze
Rela
tive T
au −
−>
Gaze/N
ogaze
Turns AV A V
0.9
11.1
1.2
1.3
τGaze/τNogaze = 3
qS2
Gaze/S2Nogaze
Congnitive “load” increases when visual Gaze cues are used
AV, V have higher Sd and integration constant τ with Gaze (∼ 20%)
Original Turns have high processing load, not affected by Gaze
van Son et al. (ACLC/CLST) Use of Visual Information OAP 20/06/2008 14 / 16
Conclusions
Conclusions
Visual cues are sufficient and helpful for experimental TRP detection
Visual cues almost sufficient for TRP detection (V ∼ 25% Correct)
Visual information without Gaze has little effect (AV∼A)
Visual correlates of attention, eg, Gaze:
improve precision (∼ 2dB SNR, V∼A)increase cognitive load (τ ∼ +20%)
Original dialog turn switches (Turns)
induce a higher cognitive loadunaffected by Gazevisual attention only one factor among many
van Son et al. (ACLC/CLST) Use of Visual Information OAP 20/06/2008 15 / 16
Conclusions
Thank You
van Son et al. (ACLC/CLST) Use of Visual Information OAP 20/06/2008 16 / 16