V ideo A nalysis C ontent E xtraction MLMI, May 2, 2006 A Multimodal Analysis of Floor Control in...
-
Upload
rosaline-owen -
Category
Documents
-
view
212 -
download
0
Transcript of V ideo A nalysis C ontent E xtraction MLMI, May 2, 2006 A Multimodal Analysis of Floor Control in...
MLMI, May 2, 2006
Video Analysis
Content Extraction
A Multimodal Analysis of Floor Control in Meetings
Lei Chen, Mary Harper, Amy Franklin, R. Travis Rose, Irene Kimbara, Zhongqiang Huang, Francis Quek
Video Analysis
Content Extraction
MLMI, May 2, 2006
Multimodal Study: Floor Control
An underlying mechanism is employed to control the floor distribution among participants; understanding floor control in dialogs and meetings is helpful for discerning their structure.
In a meeting for which the current floor holder is known, it is interesting to predict: Whether floor control will change Who the next floor holder will be
We investigate multimodal cues for floor control in two VACE meetings
Video Analysis
Content Extraction
MLMI, May 2, 2006
Importance of Floor Control
The floor holder represents a primary thread in summarizing meetings, so identification of the primary channels (audio and visual) is important Camera focus Special purpose signal processing
More natural human-like conversational agents Using human conversational principles related to the distribution
of floor control Automatic meeting analysis
Floor control information can contribute to revealing the topic flow and interaction patterns that emerge during meetings.
Video Analysis
Content Extraction
MLMI, May 2, 2006
Prior Work Conversation Analysis Research: Sacks et al. (1974) posited
that a conversation is built on turn constructional units (TCUs), complete units with respect to intonation contours, syntax, and semantics. A transition relevance place (TRP) raises the likelihood that another speaker can take over the floor and start speaking. Many cues are used by participants to predict the end of TCUs (Duncan,1972; Argyle and Cook, 1974).
Dialog-based Research: prosody (Caspers, 2000; Wichmann and Caspers, 2001), gaze (Novick et al., 1996); dialog acts by Shriberg et al. (2004)
Multiparty Meetings: Padilha and Carletta (2002, 2003), Novick (2005), Vertegaal et al. (2001),
Meeting Collections: ISL audio corpus, ICSI audio corpus, the NIST audio-visual corpus, and MM4 audio-visual corpus
Video Analysis
Content Extraction
MLMI, May 2, 2006
VACE2 Meeting Room Data
Video Analysis
Content Extraction
MLMI, May 2, 2006
Data and Annotations Two VACE meetings
Jan. 07: foreign weapon testing (41.6 minutes, 5 participants, 9,871 words)
March 18: scholarship selection (44.4 minutes, 5 participants, 7,547 words)
Multimodal annotation
Video Analysis
Content Extraction
MLMI, May 2, 2006
January 7th Excerpt
Video Analysis
Content Extraction
MLMI, May 2, 2006
March 18th Excerpt
Video Analysis
Content Extraction
MLMI, May 2, 2006
Word and SU Annotations
Words (Purdue and U Chicago): Segment into speech and non-speech chunks (IHM) Transcribe speech chunks using LDC Quick Transcription (QTR)
guidelines Obtain time alignments given pronunciations for all words using ASR
SUs (Purdue): Use Ears MDE annotation specification V6.2
SU segmentation Type: statement, question, backchannel, incomplete
Used a hidden event LM to automatically provide automatic SU hypotheses that were hand corrected
Anvil interface allowed us to view time aligned transcripts, while consulting audio and video cues for annotating the sentences in the meetings.
Video Analysis
Content Extraction
MLMI, May 2, 2006
Gesture and Gaze Annotations Gesture and gaze coding was done on MacVissta
under Mac OS X. The display and annotation tool supports the simultaneous display of multiple MPEG-4 videos (representing different camera angles) and enables the annotator to select an appropriate view from any of the videos to produce more accurate gaze/gesture coding.
10 cameras were used to record the meeting participants from different viewing angles, supporting the annotation of each participant’s gaze direction and gestures.
Annotators had access to time aligned word transcriptions and all of the videos when producing gaze and gesture annotations.
Video Analysis
Content Extraction
MLMI, May 2, 2006
Gaze Annotations
Gaze was annotated by researchers in the McNeill Lab at U. Chicago
Gaze target plus start and end times were marked: Based on markup of major saccades (intervals
between fixations) ~3 frames of video (insufficient for micro saccades)
Segmentation of space into areas and objects, which we collapsed into: each participant, paper, table, whiteboard, neutral space, and other
Video Analysis
Content Extraction
MLMI, May 2, 2006
Gesture Annotations Gesture was annotated by researchers in the McNeill Lab at U.
Chicago Gesture Annotations that were annotated and used in our
investigations: Emblematic gestures: e.g., “thumb up” means “good” in some
cultures. Four gesticulation types were annotated and used in our
investigations: Metaphoric: e.g., gestures containing smooth, continuous motions
(such as sweeping, arcing, or dragging) for continuous change; Iconic: e.g., “and he bends it way back” while making an iconic
gesture of appearing to grip something and pull it back. Deictic: gestures are used to point to entities during a
communication. Beat: simple rhythmical hand motions.
Note that fidget and instrumental movements are excluded.
Video Analysis
Content Extraction
MLMI, May 2, 2006
Video Analysis
Content Extraction
MLMI, May 2, 2006
Floor Annotations Six types of floor annotations (Purdue):
Control:Control: Who has control of the floor and which participants comprise the floor
Sidebar:Sidebar: Used to represent sub-floors that have split off from the main thread of the meeting. Again we want to record who has control and which participants are involved.
Backchannel:Backchannel: An SU type involving utterances like ``yeah'' that is spoken when another controls the floor.
Challenge:Challenge: An attempt to grab the floor. Cooperative:Cooperative: An utterance inserted into the middle of the floor
controller's utterance (like a backchannel but with propositional content) Other:Other: Other vocalizations, e.g., self talk, that do not contribute to any
current floor thread.
Anvil interface allowed us to view time aligned transcripts and SU annotations, while consulting audio and video cues for annotating the floor events in the meetings.
Video Analysis
Content Extraction
MLMI, May 2, 2006
Cooperative Example
Video Analysis
Content Extraction
MLMI, May 2, 2006
Challenge Example
Video Analysis
Content Extraction
MLMI, May 2, 2006
Questions Audio
How frequently do verbal backchannels occur in meetings? Are discourse markers (e.g., right, so, well) used more frequently in the
beginning, middle, or end of a control event? Gaze
When a holder finishes his/her turn, does he/she gaze at the next floor holder more often than at other potential targets?
When a holder takes control of the floor, does he/she gaze at the previous floor holder more often than at other potential targets?
Do we observe the frequent mutual gaze breaks between two adjacent floor holders during floor change?
Gesture How frequently does the previous floor holder make floor yielding
gestures such as pointing to the next floor holder? How frequently does the next floor holder make floor grabbing gestures
to gain control of the floor?
Video Analysis
Content Extraction
MLMI, May 2, 2006
Measurement Study Goals:
To gain insight into mechanisms governing floor control in meetings
To identify useful multimodal cues for an automatic floor control identification system
Measurements: Basic meeting statistics Speech events
Verbal backchannels Discourse markers (DM)
Gaze events Gaze distribution at floor transitions Meeting manager’s gaze
Gesture events
Video Analysis
Content Extraction
MLMI, May 2, 2006
299.58
12.8414.9
465.54
5.3164.88
763.31
29.8217.02
523.16
11.832.39
296.31
11.7839.16
0
100
200
300
400
500
600
700
800
900
Cu
mu
lati
ve
Du
rati
on
(se
c)
C D E F G
Participant
Jan 7 Meeting: Control Event Duration by Participant
Control Challenge Backchannel Sidebar-Control Cooperative
648.73
28.02
359.75
21.78
465.03
18.41
481.7
3.47
422.49
36.76
0
100
200
300
400
500
600
700
800
900C
um
ula
tiv
e D
ura
tio
n (
sec)
C D E F G
Participant
March 18 Meeting: Control Event Duration by Participant
Control Challenge Backchannel Sidebar-Control Cooperative
Video Analysis
Content Extraction
MLMI, May 2, 2006
37
26
63
37
31
87
11
20
11
44
29
116
43
55
17
197
15
15
40290
0
50
100
150
200
250
300
Co
un
t
Control Challenge Backchannel Sidebar Cooperative
Control Event
Jan 7 Meeting: Control Event Count and Participant
C D E F G
62
54
49
57
53
4112142 74
65
72
11
111
00000 27002
0
50
100
150
200
250
300
350
Co
un
t
Control Challenge Backchannel Sidebar Cooperative
Control Event
March 18 Meeting: Control Event Count and Participant
C D E F G
Video Analysis
Content Extraction
MLMI, May 2, 2006
Floor Transition Types
Change:Change: there is a clear floor transition between two adjacent floor holders with some gap between adjacent floors.
Overlap:Overlap: there is a clear floor transition between two adjacent floor holders, but the next holder begins talking before the previous holder stops speaking.
Stop:Stop: the previous floor holder clearly gives up the floor, and there is no intended next holder so the floor is open to all participants.
Self-select:Self-select: without being explicitly yielded the floor by the previous holder, a participant takes control of the floor.
Video Analysis
Content Extraction
MLMI, May 2, 2006
Distribution of Floor Transition Types
117
50 70 73115
45
18 17
0
20
40
60
80
100
120
Co
un
t
Change Overlap Stop Self-Select
Jan 7
March 18
Transition Type
Distribution of Floor Transition Types
Video Analysis
Content Extraction
MLMI, May 2, 2006
Verbal Backchannels and Nods
551
102 86
330
509
597
127 109
281 304
0
100
200
300
400
500
600
Co
un
t
Statement Question Incomplete Backchannel Nods
Jan 7
March 18
Event Type
SU Type Distribution and Nod Frequency
Video Analysis
Content Extraction
MLMI, May 2, 2006
Discourse Markers
Event (plus Portion) Meeting # DMs Total Duration DM/sec.Jan 7 22 26.79 0.82March 18 12 18.65 0.64Jan 7 20 52.41 0.38March 18 42 111.04 0.38Jan 7 58 70 0.83March 18 73 82.5 0.88Jan 7 13 70 0.19March 18 13 82.5 0.16Jan 7 304 2155.5 0.14March 18 184 2092.67 0.09
Control Middle
Challenge
Short Control (< 2 sec)
Control Beginning (first 0.5 sec)
Control Ending (last 0.5 sec)
Video Analysis
Content Extraction
MLMI, May 2, 2006
Gaze Patterns: Current to Next Holder
Video Analysis
Content Extraction
MLMI, May 2, 2006
0
20
40
60
80
100
120
Co
un
t
Change Overlap Stop Change Overlap Stop
Jan 7 March 18
Meeting and Transition Type
Eye Gaze of Current Floor Holder at Floor Transition
Next Holder Manager Others Noone
Video Analysis
Content Extraction
MLMI, May 2, 2006
0
20
40
60
80
100
120
Co
un
t
Change Overlap Self-Select Change Overlap Self-Select
Jan 7 March 18
Meeting and Transition Type
Eye Gaze of Next Floor Holder at Floor Transition
Prior Holder Manager Others Noone
Video Analysis
Content Extraction
MLMI, May 2, 2006
Mutual Gaze Break
Video Analysis
Content Extraction
MLMI, May 2, 2006
Meeting Manager Role The ostensible meeting manager for each meeting is participant E;
however, participant E in March 18 meeting does not appear to embrace that role.
In the Jan07 meeting, there were 53 cases that E is not either the previous or next floor holder in floor exchange (only Change and Overlap). In these 53 cases, E gazes at the next floor holder 21 times. If we rule out such cases where other participants look at the next floor holder, E
still gazes to the next floor holder 11 times (20.75%), suggesting that the gaze of the meeting manager plays a role in predicting the next floor holder.
In Mar18 meeting, there are 100 cases that E is not a floor holder. In these100 cases, E gazes to the next floor holder only 6 times. In fact, E tends to gaze largely at his papers or the whiteboard.
Video Analysis
Content Extraction
MLMI, May 2, 2006
37
26
63
37
31
87
11
20
11
44
29
116
43
55
17
197
15
15
40290
0
50
100
150
200
250
300
Co
un
t
Control Challenge Backchannel Sidebar Cooperative
Control Event
Jan 7 Meeting: Control Event Count and Participant
C D E F G
62
54
49
57
53
4112142 74
65
72
11
111
00000 27002
0
50
100
150
200
250
300
350
Co
un
t
Control Challenge Backchannel Sidebar Cooperative
Control Event
March 18 Meeting: Control Event Count and Participant
C D E F G
Video Analysis
Content Extraction
MLMI, May 2, 2006
Gestures for Yielding and Grabbing the Floor
Video Analysis
Content Extraction
MLMI, May 2, 2006
0
5
10
15
20
25
30
35
40
Co
un
t
Change Overlap Stop Self-Select
Change Overlap Stop Self-Select
Jan 7 March 18
Meeting and Transition Type
Distribution of Floor Yielding and Grabbing Gestures Given Meeting and Transition Type
Yield Grab
Video Analysis
Content Extraction
MLMI, May 2, 2006
Conclusions
Presented a floor control annotation specification and conducted an analysis of two VACE meetings
Identified some multimodal cues that will be helpful for predicting floor control events DMs occur frequently at the beginning of a floor The previous holder often gazes at the next floor holder and vice
versa during floor transitions The mutual gaze break patterns previously observed in dialogs are
also found in the Jan07 meeting. An active meeting manager plays a role in floor transitions Gestures, especially floor capturing gestures, play a role in floor
transitions
Video Analysis
Content Extraction
MLMI, May 2, 2006
Acknowledgements
Discussions with David McNeill and Susan Duncan at U Chicago, Liz Shriberg at ICSI/SRI, and Felicia Roberts at Purdue University
This work was supported by: ARDA VACE II DARPA EARS and Gale