The Games Corpus Design, implementation and annotation Agustín Gravano [email protected] Spoken...
-
Upload
paula-amice-lane -
Category
Documents
-
view
221 -
download
0
Transcript of The Games Corpus Design, implementation and annotation Agustín Gravano [email protected] Spoken...
The Games CorpusDesign, implementation and annotation
Agustín [email protected]
Spoken Language Processing GroupColumbia University
"The Games Corpus" - Agustín Gravano - Columbia University 2
The Games Corpus
1. Design and Implementation
2. Annotation
"The Games Corpus" - Agustín Gravano - Columbia University 3
The Games Corpus
1. Design and Implementation
2. Annotation
"The Games Corpus" - Agustín Gravano - Columbia University 4
Experiment Design
Goal: Study the relation between the down-stepped contour and Information status Syntactic position Discourse position
Spontaneous speech Both monologue and dialogue
"The Games Corpus" - Agustín Gravano - Columbia University 5
Experiment Design
Three computer games. Two players, each on a different computer.
They collaborate to perform a common task. Totally unrestricted speech.
"The Games Corpus" - Agustín Gravano - Columbia University 6
Player 2 (Searcher)
Player 1 (Describer)
Cards Game #1
• Short monologues• Vary frequency and order of
occurrence of objects on the cards.
"The Games Corpus" - Agustín Gravano - Columbia University 7
Cards Game #2
Player 2 (Searcher)
Player 1 (Describer)
• Dialogue• Vary frequency and order of
occurrence of objects on the cards.
"The Games Corpus" - Agustín Gravano - Columbia University 8
Objects Game
Player 2 (Searcher)
Player 1 (Describer)
• Dialogue• Vary target and surrounding objects
(subject and object position).
"The Games Corpus" - Agustín Gravano - Columbia University 9
Games Session
Repeat 3 times: Cards Game #1 Cards Game #2
Short break (optional) Repeat 3 times:
Objects Game
Each subject participated in 2 sessions. 12 sessions
"The Games Corpus" - Agustín Gravano - Columbia University 10
Subjects
Postings: Columbia’s webpage for temporary job adds. Craig’s list
http://www.craigslist.org Category: Gigs Event gigs
Problem: People are unreliable ~50% did not show up, or cancelled with short notice.
"The Games Corpus" - Agustín Gravano - Columbia University 11
Subjects
Possible solutions: Give precise instructions to e-mail ALL required info:
Name, native speaker?, hearing impairments?, etc. Ask for a phone number. Call them and explain why it is so important for us that they
show up (or cancel with adecuate notice). Increase the pay after each session.
Example: $5, $10, $15 instead of $10, $10, $10.
"The Games Corpus" - Agustín Gravano - Columbia University 12
Recording Sound-proof booth
2 subjects + 1 or 2 confederates. Head-mounted mics. Digital Audio Tape (DAT): one channel per speaker.
Wav files One mono file per speaker. Sample rate: 48000 Downsampled to 16000 (but kept original files!) ~20 hours of speech 2.8 GB (16k)
"The Games Corpus" - Agustín Gravano - Columbia University 13
Logs
Log everything the subjects do to a text file. Example:
17:03:55:234 BEGIN_EXECUTION17:04:04:868 NEXT_TURN17:04:31:837 RESULTS 97 points awarded.17:04:38:426 NEXT_TURN17:05:03:873 RESULTS 92 points awarded....
Later, this may be used (e.g.) to divide each session into smaller tasks or conversations.
"The Games Corpus" - Agustín Gravano - Columbia University 14
The Games Corpus
1. Design and Implementation
2. Annotation
"The Games Corpus" - Agustín Gravano - Columbia University 15
Speech Processing Tools
Praat http://www.praat.org
WaveSurfer http://www.speech.kth.se/wavesurfer
Transcriber http://trans.sourceforge.net
"The Games Corpus" - Agustín Gravano - Columbia University 16
Orthographic Tier - Method 1
"The Games Corpus" - Agustín Gravano - Columbia University 17
Orthographic Tier - Method 1
Problems Very stressing Time consuming
Separate transcription from alignment.
"The Games Corpus" - Agustín Gravano - Columbia University 18
Orthographic Tier - Method 2
1. Transcribe chunks using a web interface.
"The Games Corpus" - Agustín Gravano - Columbia University 19
Orthographic Tier - Method 2
1. Transcribe chunks using a web interface.
2. Align each chunk automatically.
3. Concatenate all chunks.
4. Correct the alignment by hand using Praat, Wavesurfer or similar.
"The Games Corpus" - Agustín Gravano - Columbia University 20
Orthographic Tier - Method 2
Advantages Transcription task is very comfortable. Most of the alignment task is done automatically.
Only fine-grain hand corrections are needed.
Problems Overhead: chunking, automatic alignment, concat. Error prone! Easy for humans to overlook errors in the
automatic alignment.
"The Games Corpus" - Agustín Gravano - Columbia University 21
Orthographic Tier - Method 3
1. Transcribe the whole file, using: a regular audio player (e.g., Windows Media Player), and a regular plain-text editor (e.g., Notepad).
2. Use Wavesurfer to align the words. “Load text labels” function Check out:
Spectrogram settings Customizable shortcuts
"The Games Corpus" - Agustín Gravano - Columbia University 22
Orthographic Tier
Transcription guidelines capital letters abbreviations disfluencies mmhm, uhhuh, gotcha, etc.
Alignment guidelines boundaries
http://www.cs.columbia.edu/~agus/games username/password = speech/lions
"The Games Corpus" - Agustín Gravano - Columbia University 23
Too many cooks…
Concurrency problem
File locking webpage Annotators lock a file before working on it,
and release it when done.
"The Games Corpus" - Agustín Gravano - Columbia University 24
Annotation: Cue Words
okay, mmhm, uhhuh, right, etc. Acknowledgment, Backchannel, Segment
Beginning, Segment End, etc. Developed an ad-hoc application in Java.
Bad idea!!! Too long development time.
Instead, use Praat (or other general-purpose tool). For simple, specific tasks, Praat is not difficult to learn. Create a file with empty points at the middle point of the
words that need to be labeled. Annotators only label those words, safely ignoring the rest.
"The Games Corpus" - Agustín Gravano - Columbia University 25
Other Annotations
Turn switches Smooth switches, interruptions, backchannels, etc. The labeler received a Praat file with empty turns.
Prosody ToBI Labeling Conventions: Tones and Break Indices.
Questions Identification, form and function.
"The Games Corpus" - Agustín Gravano - Columbia University 26
Guidelines for Guidelines
Web based (password protected) Highlight recent changes Avoid long lists: categorize, trees.
"The Games Corpus" - Agustín Gravano - Columbia University 27
Files
games/data/session_NN/sNN.GAME.P.Y.ext NN = 01..12 GAME = {cards, objects} P = 0..3 if GAME=cards, 0..1 if GAME=objects Y = {A, B} ext = {wav, words, tones, breaks, misc, turns, …}
"The Games Corpus" - Agustín Gravano - Columbia University 28
Files
Examples:games/data/session_08/s08.cards.3.B.wav
s08.cards.3.B.wordss08.cards.3.B.misc…
s08.objects.1.A.wavs08.objects.1.A.wordss08.objects.1.A.misc…
games/data/session_11/…
"The Games Corpus" - Agustín Gravano - Columbia University 29
Files Format
All files (except *.wav) are saved as plain text, with the WaveSurfer format: Start End Value (for interval tiers) Time Value (for point tiers)
Advantages Human-readable. Very easy to process.
Problems Consistency Rounding
"The Games Corpus" - Agustín Gravano - Columbia University 30
Files Format
Other formats: XML
General-purpose mark-up language. <TAG attribute=“value”> … </TAG> Solves problems like consistency and rounding. Not human-readable, harder to process.
Praat Not human-readable, hard to process. Also has the consistency problem.
"The Games Corpus" - Agustín Gravano - Columbia University 31
Scripts
So far, we have needed dozens of Perl scripts. Examples:
Convert between Praat and WaveSurfer formats. Create a Praat file with empty CW labels, turns, etc. Find typos, missing labels, and other errors. Unify notation (e.g., “mm-hmm” “mmhm”). Check consistency of files. …
"The Games Corpus" - Agustín Gravano - Columbia University 32
Back-up!
Back-up wav files only once (too heavy) in different places (DVD, 3+ computers).
Back-up everything else (plain text: light) periodically, and automatically. Configure “cron” to make a backup copy every 8 hours.
"The Games Corpus" - Agustín Gravano - Columbia University 33
Timeline
Orthographic tier first!
time
design+implem.
orthographic tier
cue words
prosody (ToBI)
turn switches
The Games CorpusDesign, implementation and annotation
Agustín [email protected]
Spoken Language Processing GroupColumbia University