The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools...
Transcript of The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools...
The Aboriginal Child LanguageAcquisition Project (ACLA):
fieldwork, media annotation andcataloguing
Patrick McConvell (AIATSIS)and
Jane Simpson (University of Sydney)EthnoER Workshop 15 February 2006
Communities and languageshttp://www.linguistics.unimelb.edu.au/research/projects
/ACLA/
ACLA data
Three communities with one fieldworker andone indigenous researcher in each community 6-8 pre-school focus children in each community Two six week visits per community per year Video data collected over 3 years
four or five sessions with each focus child includes data from interlocutors at a range of ages includes structured, semi structured and naturalistic data
Complex objects
• Session- Information about the session [people,relationships, recorder, activities….]– Transcripts– Field notes
• Associated stuff– Recorded on video and audio over part of, one or more
tapes
Recording and analysing fielddata
• Recording• Naming conventions• Transcribing• Cataloguing• Annotating and coding• Digitising and archiving• Searching and counting
SCHISM: Essential criteria forfieldwork tools generally
1. Simple to use2. Cheap3. Hard to break 4. Installable by user5. Support network of users/maintainers who reply promptly to questions6. Mac/Windows/UNIX compatibility
And when do we need it? NOW
Essential for a recording project:NAME CONVENTIONS
• 1. consistent naming conventions for sessions,media and transcripts
• 2. consistent naming conventions which show thelinks between parts of a set of data (the session,medium, the transcript, the digital copies of themedium, the backups)
• 3. consistent naming conventions for digitalversions of data sets which allow efficient backup(reflecting directory structure)
NAMING CONVENTIONS
• (i) contentful (human readable)• (ii) machine readable (i.e. must have
delimiters, and cannot have weird charactersor uninterpretable blanks)
• (iii) reflect the directory structure on thecentral repository
• (iv) in terms of length, a bearable painthreshold for the human enterers
• (v) identical on media, digital files, in thecatalogue and on central repository
ACLA NAMING CONVENTIONS
FM-05-030_A.cha• FM- researcher [directory]• -05- year [subdirectory]|• -030- tape no. [subdirectory]• _A. session [subdirectory]• .wav/.cha/.mov type [bottom level]
PROBLEM: invisibility of sessions coveringmore than one tape
Essential tool for a recording project:CATALOGUE
• 1. catalogue of sets of data
• 2. method for inheriting identity of metadata onparts of a set of data [e.g. header on transcriptshould be a subset of the catalogue entry for thatset of data]
• 3. backup of sets of data together with catalogue
Essential tool for a recording project:TRANSCRIPTION TOOL
• 1. Slowing sound down
• 2. Viewing wave-form while transcribing
• 3. Transcript metadata linked to catalogue
• 4. Time-coding linking transcript and media - linkedfor transcription and playback
%snd:"FM007.A"_130572_136895
• 5. Ability to configure conversation transcript like adrama script (new line for each utterance)
• 6. Provision for noting overlapping speech
• 7. Mixture of fonts, including phonetic fonts
Essential tool for analysis of recording:ANNOTATION TOOL
• 1. Consistency checking of codes• 2. Automated coding [e.g. Shoebox, MOR
interlinear glossing]FBP:jarrei%mor:adv|=that+way@32:darrei .*FCE:ah gon ?%mor:fill|=ah@32:ah v:intran|=go@32:gu ?• 3. Alignment of tiers• 4. Searching on parallel tiers• 5. Mixture of fonts including phonetic fonts*
Essential tool for analysis of data:SEARCH and CALCULATING
TOOL• 1. Searching for codes and combinations of codes
including in the context of parallel tiers
• 2. Counting the number of instances of codes orcombinations of codes
• 3. Allowing multiple files to be searched and countedsimultaneously.
• 4. Outputting the results in a form amenable to furthercalculation e.g. Excel
Essential tool for collaborativework
CENTRAL REPOSITORY• Longevity• Access for all participants• Catalogue• Easy upload and download
We thank APAC and Stu Hungerford for providing acentral repository for ACLA.
How does ACLA stack up?
• WANTED: someone to clean up after us,a data curator, someone who cares
ACLA and NAME CONVENTIONS• We did try to get consistent names Researcher name - tape ID - session number - format• BUT we only understood the importance of having
hierarchically structured names with machine-usable delimiters for automating backup, after theresearchers had already labelled– their tapes– their transcripts– their digital files
• AND we failed to get across the importance ofconsistent naming conventions which show thelinks between parts of a set of data.
• RESULT: ESS needed…
ACLA and CATALOGUE• We worked with David Penton, Baden Hughes and Steven
Bird on an open source catalogue• Details: http://www.cs.mu.oz.au/research/lt/projects/acla-db/
• Reference:• Hughes, Baden, Penton, David , Bird, Steven, Bow,
Catherine, Wigglesworth, Gillian , McConvell,Patrick and Simpson, Jane. 2004. Management ofmetadata in linguistic fieldwork: experience fromthe ACLA project. 4th International Conference onLanguage Resources and Evaluation (LREC),Lisbon, Portugal.http://eprints.unimelb.edu.au/archive/00001406/
ACLA and CATALOGUEGood things
• Helped us clarify the links between kinds of information andmeta-data in the catalogue
Details:http://www.cs.mu.oz.au/research/lt/projects/acla-db/database.png
• Researchers can upload and overwrite old versions of entriesonto the central repository, after working off-line on thedatabase. This was done by making each researcher’smachine into a little web-server which can talk directly tothe central database.
• In principle the central catalogue could be on the centralrepository at ANU’s APAC mass storage.
ACLA and CATALOGUEBUT…
• Localised database problem: The php software requiresaccess to various system files, and this access has to bespecified. BUT every time the system software is updated,the addresses change, and the user have to use SUDO inTerminal to re-establish the links. Too dangerous.
• New machines: A user requires professional help to re-install the database when they switch to another machine.
• Non-customisable searches: Users cannot define thecombinations of fields to search on.
ACLA and CATALOGUEWhere to?
• Reluctantly:At end of current project
1. Export current catalogue as XML2. Install Microsoft Access on users’ machines3. Import catalogue into Microsoft Access4. Somehow put web version of MS Access catalogue
onto APAC
Looking for alternative ideas….
ACLA and Central RepositoryCurrently:
Media, transcripts stored at APAC Mass storage, ANU1. Uploaded and downloaded by FTP2. Nearline storage3. Not a long-term archive4. We haven’t put a catalogue on the repository5. We need to clean up our file names and directory structure
Wish-list….1. Need catalogue on repository linked to media2. Streaming video and audio (e.g. localised version of Talkbank
Viewer)3. Working out a way for coding and determining access restrictions
on viewing and using data4. Select and download partial complex objects, e.g. extract of
transcript with associated extract of media file
CLAN• 1. CLAN is a tool in the CHILDES child language
data exchange scheme.• 2. The idea is for researchers to mark up their data
in similar ways so as to share it. This is a schemecalled CHAT.
• 3. CLAN provides a means for transcribing,annotating, analysing, searching and quantifyingdata formatted in the CHAT system
• 4. Anyone can access the public marked-up dataand streaming media stored on the centralCHILDES database using CLAN.
• 5. CLAN is now rewritten to reflect an XMLschema.
CLAN as A GENERAL TOOL
http://childes.psy.cmu.edu/
1. Simple to use SOME BITS, SOME BITS NOT
2. Cheap Free
3. Hard to break Crashes occasionally but not unbearably
4. Installable YES
5. Support network WONDERFUL BUT NOT IN OZ
6. Mac/Windows YES, but some functions are only available onWindows
CLAN as Transcription tool
YES UNICODE UTF8Mixture of fonts, including phonetic
YES but clunkyProvision for overlaps
YESAbility to configure conversationtranscript like a drama script (new linefor each utterance)
YES - but time-consuming in multi-party fast speech. AND CLAN allowstime-coding, but doesn’t require it.
Time-coding linking transcript andmedia - linked for transcription andplayback
Metadata required, not linkedMetadata linked to catalogue
The manual suggests you can’t do thiswhile viewing video, but after theworkshop Pat McConvell discoveredyou CAN have ‘sonic mode’ displayingthe waveform together with videomode.
Viewing wave-form while transcribing
NOSlowing sound and video down
CLAN Time-coding example
• *SKT: im _kayi milk iya na .%mov:"SDETextract31.1.06"_3383_5272
• *SET: ye, yu giv _it im xx.%mov:"SDETextract31.1.06"_5272_7609
CLAN metadata given as HEADERS
• @Begin• @Participants: SET Evonne Thompson,SLT Travis
Thompson,SKT Khelsie• Thompson,SMM Mikey Nappa,SD Researcher• @Birth of SKT: 16-FEB-2002• @Age of SKT: 2:6• @Tape Location: SD044.A.DV• @Date: 10-AUG-2004• @Activities: Medical kit, baby doll.• @Comment: All of SD044A.cha• Focus Child SKT 2:6• Focus interactant SET (20-34)• *SKT: im _kayi milk iya na .
CLAN as ANNOTATION TOOL
YES, but cross-tier searching isrestricted restricted. Allows filters for
searching tiers.
Searching on parallel tiers
YESHorizontal display of successiveutterances and their dependent tiers
Alignment of speaker tier as main tierand dependent tiers (but nodependencies between dependent tiers)
Alignment of tiers
YES, post-processing with ‘MOR’command comparing and inserting fromlexicon file
Automated coding
YES, post-processing with ‘Check’command comparing to file of codes
Consistency checking of codes
MOR Tier in CLAN: example
*SKT:im _kayi milk iya na . %mor:pro|@32:3SG:im case:poss|@4:_kari
n:inanimate=milk@32:milk adv|=here@32:iya dis:interj|=now+focus@32:na.
*SET:ye, yu giv _it im xx. %mor:dis:interj|=yes@32:ye
pro|@32:2SG:yu v:tran|=give@32:giv suf|@32:TRANS:_im pro|@32:3SG:im ?| xx
MOR Tier lexicon• Works by post-processing a transcribed file with reference to a lexicon with entries that can be polysemous.
a {[scat n:kin]} "@1:whatever"b {[scat n]}c {[scat n]}c {[scat n:kin]}d {[scat v:aux]}_a {[scat suf]} "@2:whatever”
• allows several lexicons to be open at once - essential for multilingual• ACLA partly automated MOR coding by importing an existing
Shoebox Kriol lexicon and marking it up
CLAN as SEARCH and COUNTINGTOOL
• CLAN was designed for analysis of conversational data - initiallychildren, later Conversation Analysis and Bilingual conversation
• Counting the number of instances of codes or combinations of codeswas important to child language studies; generally field linguists wereinterested in searching for examples but not in counting instances.Now linguists in general are interested in frequency of occurrence.
• CLAN has 51 Analysis commands which can be run over largecorpora. These are commonly needed prefabricated searches andcounts.
• CLAN allows multiple files to be searched and countedsimultaneously.
• BUT, metadata can’t be easily used as a filter in searching - e.g. age ofparticipants
CLAN as COUNTING TOOLsome of the commands
• COMBO Searches for complex string patterns[e.g. across tiers].
• KWAL Searches for word patterns and printsthe line [concordance].
• FREQ Computes the frequencies of the wordsin a file or files.
• STATFREQ Formats the output of FREQ forstatistical analysis
• GEM Finds areas of text that were markedwith GEM markers (user defined note).
• MLU Computes the mean length of utterance.
SCHISM ?1. Can we get a cataloguing system?2. Can we get a hygienic way of transmitting
metadata between transcript files (e.g. CLANheaders) and the catalogue?
3. Can we get an interchange system between thetranscription tools that preserves data?
4. Does it make sense to build on existing tools,rather than build new ones - e.g. SHOEBOXinterlinearisation is better than CLAN interlinearisation - is there anyway of bolting a SHOEBOX interlineariser onto CLAN?
5. Can we get access to, and extract partial complexobjects from, a central repository of transcriptsand streaming video/audio linked to the catalogueof our media?
Thank you!