The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools...

33
The Aboriginal Child Language Acquisition Project (ACLA): fieldwork, media annotation and cataloguing Patrick McConvell (AIATSIS) and Jane Simpson (University of Sydney) EthnoER Workshop 15 February 2006

Transcript of The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools...

Page 1: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

The Aboriginal Child LanguageAcquisition Project (ACLA):

fieldwork, media annotation andcataloguing

Patrick McConvell (AIATSIS)and

Jane Simpson (University of Sydney)EthnoER Workshop 15 February 2006

Page 2: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

Communities and languageshttp://www.linguistics.unimelb.edu.au/research/projects

/ACLA/

Page 3: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

ACLA data

Three communities with one fieldworker andone indigenous researcher in each community 6-8 pre-school focus children in each community Two six week visits per community per year Video data collected over 3 years

four or five sessions with each focus child includes data from interlocutors at a range of ages includes structured, semi structured and naturalistic data

Page 4: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

Complex objects

• Session- Information about the session [people,relationships, recorder, activities….]– Transcripts– Field notes

• Associated stuff– Recorded on video and audio over part of, one or more

tapes

Page 5: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

Recording and analysing fielddata

• Recording• Naming conventions• Transcribing• Cataloguing• Annotating and coding• Digitising and archiving• Searching and counting

Page 6: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

SCHISM: Essential criteria forfieldwork tools generally

1. Simple to use2. Cheap3. Hard to break 4. Installable by user5. Support network of users/maintainers who reply promptly to questions6. Mac/Windows/UNIX compatibility

And when do we need it? NOW

Page 7: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

Essential for a recording project:NAME CONVENTIONS

• 1. consistent naming conventions for sessions,media and transcripts

• 2. consistent naming conventions which show thelinks between parts of a set of data (the session,medium, the transcript, the digital copies of themedium, the backups)

• 3. consistent naming conventions for digitalversions of data sets which allow efficient backup(reflecting directory structure)

Page 8: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

NAMING CONVENTIONS

• (i) contentful (human readable)• (ii) machine readable (i.e. must have

delimiters, and cannot have weird charactersor uninterpretable blanks)

• (iii) reflect the directory structure on thecentral repository

• (iv) in terms of length, a bearable painthreshold for the human enterers

• (v) identical on media, digital files, in thecatalogue and on central repository

Page 9: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

ACLA NAMING CONVENTIONS

FM-05-030_A.cha• FM- researcher [directory]• -05- year [subdirectory]|• -030- tape no. [subdirectory]• _A. session [subdirectory]• .wav/.cha/.mov type [bottom level]

PROBLEM: invisibility of sessions coveringmore than one tape

Page 10: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

Essential tool for a recording project:CATALOGUE

• 1. catalogue of sets of data

• 2. method for inheriting identity of metadata onparts of a set of data [e.g. header on transcriptshould be a subset of the catalogue entry for thatset of data]

• 3. backup of sets of data together with catalogue

Page 11: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

Essential tool for a recording project:TRANSCRIPTION TOOL

• 1. Slowing sound down

• 2. Viewing wave-form while transcribing

• 3. Transcript metadata linked to catalogue

• 4. Time-coding linking transcript and media - linkedfor transcription and playback

%snd:"FM007.A"_130572_136895

• 5. Ability to configure conversation transcript like adrama script (new line for each utterance)

• 6. Provision for noting overlapping speech

• 7. Mixture of fonts, including phonetic fonts

Page 12: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

Essential tool for analysis of recording:ANNOTATION TOOL

• 1. Consistency checking of codes• 2. Automated coding [e.g. Shoebox, MOR

interlinear glossing]FBP:jarrei%mor:adv|=that+way@32:darrei .*FCE:ah gon ?%mor:fill|=ah@32:ah v:intran|=go@32:gu ?• 3. Alignment of tiers• 4. Searching on parallel tiers• 5. Mixture of fonts including phonetic fonts*

Page 13: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

Essential tool for analysis of data:SEARCH and CALCULATING

TOOL• 1. Searching for codes and combinations of codes

including in the context of parallel tiers

• 2. Counting the number of instances of codes orcombinations of codes

• 3. Allowing multiple files to be searched and countedsimultaneously.

• 4. Outputting the results in a form amenable to furthercalculation e.g. Excel

Page 14: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

Essential tool for collaborativework

CENTRAL REPOSITORY• Longevity• Access for all participants• Catalogue• Easy upload and download

We thank APAC and Stu Hungerford for providing acentral repository for ACLA.

Page 15: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

How does ACLA stack up?

• WANTED: someone to clean up after us,a data curator, someone who cares

Page 16: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

ACLA and NAME CONVENTIONS• We did try to get consistent names Researcher name - tape ID - session number - format• BUT we only understood the importance of having

hierarchically structured names with machine-usable delimiters for automating backup, after theresearchers had already labelled– their tapes– their transcripts– their digital files

• AND we failed to get across the importance ofconsistent naming conventions which show thelinks between parts of a set of data.

• RESULT: ESS needed…

Page 17: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

ACLA and CATALOGUE• We worked with David Penton, Baden Hughes and Steven

Bird on an open source catalogue• Details: http://www.cs.mu.oz.au/research/lt/projects/acla-db/

• Reference:• Hughes, Baden, Penton, David , Bird, Steven, Bow,

Catherine, Wigglesworth, Gillian , McConvell,Patrick and Simpson, Jane. 2004. Management ofmetadata in linguistic fieldwork: experience fromthe ACLA project. 4th International Conference onLanguage Resources and Evaluation (LREC),Lisbon, Portugal.http://eprints.unimelb.edu.au/archive/00001406/

Page 18: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

ACLA and CATALOGUEGood things

• Helped us clarify the links between kinds of information andmeta-data in the catalogue

Details:http://www.cs.mu.oz.au/research/lt/projects/acla-db/database.png

• Researchers can upload and overwrite old versions of entriesonto the central repository, after working off-line on thedatabase. This was done by making each researcher’smachine into a little web-server which can talk directly tothe central database.

• In principle the central catalogue could be on the centralrepository at ANU’s APAC mass storage.

Page 19: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

ACLA and CATALOGUEBUT…

• Localised database problem: The php software requiresaccess to various system files, and this access has to bespecified. BUT every time the system software is updated,the addresses change, and the user have to use SUDO inTerminal to re-establish the links. Too dangerous.

• New machines: A user requires professional help to re-install the database when they switch to another machine.

• Non-customisable searches: Users cannot define thecombinations of fields to search on.

Page 20: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

ACLA and CATALOGUEWhere to?

• Reluctantly:At end of current project

1. Export current catalogue as XML2. Install Microsoft Access on users’ machines3. Import catalogue into Microsoft Access4. Somehow put web version of MS Access catalogue

onto APAC

Looking for alternative ideas….

Page 21: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

ACLA and Central RepositoryCurrently:

Media, transcripts stored at APAC Mass storage, ANU1. Uploaded and downloaded by FTP2. Nearline storage3. Not a long-term archive4. We haven’t put a catalogue on the repository5. We need to clean up our file names and directory structure

Wish-list….1. Need catalogue on repository linked to media2. Streaming video and audio (e.g. localised version of Talkbank

Viewer)3. Working out a way for coding and determining access restrictions

on viewing and using data4. Select and download partial complex objects, e.g. extract of

transcript with associated extract of media file

Page 22: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

CLAN• 1. CLAN is a tool in the CHILDES child language

data exchange scheme.• 2. The idea is for researchers to mark up their data

in similar ways so as to share it. This is a schemecalled CHAT.

• 3. CLAN provides a means for transcribing,annotating, analysing, searching and quantifyingdata formatted in the CHAT system

• 4. Anyone can access the public marked-up dataand streaming media stored on the centralCHILDES database using CLAN.

• 5. CLAN is now rewritten to reflect an XMLschema.

Page 23: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

CLAN as A GENERAL TOOL

http://childes.psy.cmu.edu/

1. Simple to use SOME BITS, SOME BITS NOT

2. Cheap Free

3. Hard to break Crashes occasionally but not unbearably

4. Installable YES

5. Support network WONDERFUL BUT NOT IN OZ

6. Mac/Windows YES, but some functions are only available onWindows

Page 24: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

CLAN as Transcription tool

YES UNICODE UTF8Mixture of fonts, including phonetic

YES but clunkyProvision for overlaps

YESAbility to configure conversationtranscript like a drama script (new linefor each utterance)

YES - but time-consuming in multi-party fast speech. AND CLAN allowstime-coding, but doesn’t require it.

Time-coding linking transcript andmedia - linked for transcription andplayback

Metadata required, not linkedMetadata linked to catalogue

The manual suggests you can’t do thiswhile viewing video, but after theworkshop Pat McConvell discoveredyou CAN have ‘sonic mode’ displayingthe waveform together with videomode.

Viewing wave-form while transcribing

NOSlowing sound and video down

Page 25: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

CLAN Time-coding example

• *SKT: im _kayi milk iya na .%mov:"SDETextract31.1.06"_3383_5272

• *SET: ye, yu giv _it im xx.%mov:"SDETextract31.1.06"_5272_7609

Page 26: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

CLAN metadata given as HEADERS

• @Begin• @Participants: SET Evonne Thompson,SLT Travis

Thompson,SKT Khelsie• Thompson,SMM Mikey Nappa,SD Researcher• @Birth of SKT: 16-FEB-2002• @Age of SKT: 2:6• @Tape Location: SD044.A.DV• @Date: 10-AUG-2004• @Activities: Medical kit, baby doll.• @Comment: All of SD044A.cha• Focus Child SKT 2:6• Focus interactant SET (20-34)• *SKT: im _kayi milk iya na .

Page 27: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

CLAN as ANNOTATION TOOL

YES, but cross-tier searching isrestricted restricted. Allows filters for

searching tiers.

Searching on parallel tiers

YESHorizontal display of successiveutterances and their dependent tiers

Alignment of speaker tier as main tierand dependent tiers (but nodependencies between dependent tiers)

Alignment of tiers

YES, post-processing with ‘MOR’command comparing and inserting fromlexicon file

Automated coding

YES, post-processing with ‘Check’command comparing to file of codes

Consistency checking of codes

Page 28: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

MOR Tier in CLAN: example

*SKT:im _kayi milk iya na . %mor:pro|@32:3SG:im case:poss|@4:_kari

n:inanimate=milk@32:milk adv|=here@32:iya dis:interj|=now+focus@32:na.

*SET:ye, yu giv _it im xx. %mor:dis:interj|=yes@32:ye

pro|@32:2SG:yu v:tran|=give@32:giv suf|@32:TRANS:_im pro|@32:3SG:im ?| xx

Page 29: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

MOR Tier lexicon• Works by post-processing a transcribed file with reference to a lexicon with entries that can be polysemous.

a {[scat n:kin]} "@1:whatever"b {[scat n]}c {[scat n]}c {[scat n:kin]}d {[scat v:aux]}_a {[scat suf]} "@2:whatever”

• allows several lexicons to be open at once - essential for multilingual• ACLA partly automated MOR coding by importing an existing

Shoebox Kriol lexicon and marking it up

Page 30: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

CLAN as SEARCH and COUNTINGTOOL

• CLAN was designed for analysis of conversational data - initiallychildren, later Conversation Analysis and Bilingual conversation

• Counting the number of instances of codes or combinations of codeswas important to child language studies; generally field linguists wereinterested in searching for examples but not in counting instances.Now linguists in general are interested in frequency of occurrence.

• CLAN has 51 Analysis commands which can be run over largecorpora. These are commonly needed prefabricated searches andcounts.

• CLAN allows multiple files to be searched and countedsimultaneously.

• BUT, metadata can’t be easily used as a filter in searching - e.g. age ofparticipants

Page 31: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

CLAN as COUNTING TOOLsome of the commands

• COMBO Searches for complex string patterns[e.g. across tiers].

• KWAL Searches for word patterns and printsthe line [concordance].

• FREQ Computes the frequencies of the wordsin a file or files.

• STATFREQ Formats the output of FREQ forstatistical analysis

• GEM Finds areas of text that were markedwith GEM markers (user defined note).

• MLU Computes the mean length of utterance.

Page 32: The Aboriginal Child Language Acquisition Project (ACLA): … · 2015-07-28 · fieldwork tools generally 1. Simple to use 2. Cheap 3. Hard to break 4. Installable by user 5. Support

SCHISM ?1. Can we get a cataloguing system?2. Can we get a hygienic way of transmitting

metadata between transcript files (e.g. CLANheaders) and the catalogue?

3. Can we get an interchange system between thetranscription tools that preserves data?

4. Does it make sense to build on existing tools,rather than build new ones - e.g. SHOEBOXinterlinearisation is better than CLAN interlinearisation - is there anyway of bolting a SHOEBOX interlineariser onto CLAN?

5. Can we get access to, and extract partial complexobjects from, a central repository of transcriptsand streaming video/audio linked to the catalogueof our media?