Data management
DocLing 2016David Nathan
Two most valuable strategies
design and use a filename system work out (‘model’) your basic units of documentation
and the relationships between them
- if you get these right, it will do the “heavy lifting” of your data management strategy- data and metadata are intertwined, points in a spectrum rather than different things
Three most important qualities
consistency documentation of conventions, structures, methods machine readability
“computer programs can act on data in terms of its proper structures and categories” an example
Data management
understand and model the data (units, relationships) use appropriate data structure methods – in both file
contents and organisation use appropriate and conventional data encoding
methods (e.g. Unicode) be explicit and consistent plan for flow of data, working with others, across
different systems document steps, decisions, conventions, structures think ahead to archiving
Managing data in your computer
design a well-organised system of folders so that you can always find your stuff according to what it is, not: where the software decided to put it what the software decided to call it when/where you last used it what someone else called it
File structures and names
design folder structure as a logical hierarchy that suits your goals, content and work style have documentary materials within one
overall directory (e.g. for backup) make directories for relevant categories,
e.g. sessions, media types, dates design it so that you will always be able to
find things you may need to restructure at different
points in your project, e.g. move from date-based to session-based structures
Designing a file/folder structure
it should relate to reality locations should make sense, so you (and
others) will know where to look for things (where do you keep your passport; favourite cup?)
the best location is “the place that one would naturally look to find it”
3 methods of linking or ‘bunding’ related files
tree of distinguishing folder names
one folder with distinguishing filenames
one folder with numerical filenames
… what else is needed?
On identifiers
real world objects are uniquely identified because they are physically unique - an unlabelled cassette is poorly identified
digital objects have no physical existence - they depend on identifiers that we give them
three types of identifiers: semantic keys relative
On identifiers
semantic, e.g. Nelson Mandela The Sound of Music SA_JA_Bongo_Palace_Land Dispute Trial_015_29-04-
2010.wav *
* SA_JA_Bongo_Palace_Land Dispute Trial_015_29-04-2010.wav
On identifiers
keys (disambiguators), e.g. 1137204 (a student number) 0803 211 6148 (a telephone number) p12893fh23.pdf (some system's reference
number)
On identifiers
relative, e.g. 67 High Street the secretary index.html metadata.xls
On identifiers
your collection may have a mix of these but it is important to be aware of their differences and limitations, for example: semantic identifiers: invite name clashes keys: a program or process might depend on the
identifier to work properly relative identifiers: if you move them, you probably
change or destroy their meaning
Digital objects and identities
a digital object’s identity includes its location a file’s full identity = path + filename the path is a representation of the volume
and the directory (folder) hierarchy if the full identity is unambiguous then
everything can be fine, compare: c:\\dogs\spaniels\rover.jpg c:\\cars\british\rover.jpg
or lectures\syntax\2013-02-12\notes.doc
Digital objects and identities
but semantic identifiers are potentially ‘dangerous’, because just adding more chunks to disambiguate them will not work: 2015\rover.jpg 2015\white_rover.jpg
therefore, domains that do not offer semantic uniqueness may need identifiers which are either keys, or relative identifiers
And now to file names
(having said all that) filenames are only filenames, and do not necessarily
provide information common mistaken assumptions:
that a filename “dp_verbs_39.wav” means there is an entity “dp_verbs_39”
that files are logically linked just by sharing some part of their filenames- these are only true if your system ensures it (and if you state it explicitly)
File naming
filenames that are unsystematic or are non-standard will cause problems, eventually
unsystematic file naming might be (just) OK if you already have many files you have a working method that already does
everything you need to do your “system” will do everything you need to do in
the future
Manage file names from the start
a new file: don’t just accept the default filename or
location suggested by the application when you first save the file
put it where it belongs, immediately. If necessary, create the place (directory/path) where it belongs
name it according to your naming system! if you have an inventory/index of files, add
an entry for the new file
Filename rules
all filenames should have correct extensions each filename should have only one ".", before the
extension use only ASCII characters (US keyboard) use only letters, numbers, hyphens (-) and underscores
(_) keep filenames short, just long enough to contain the
necessary identifier - don't fill them up with lots of information about the content (that is metadata!)
(advised) use only lower case letters
How about these file names?1. ready.audio.wav2. ReAlLyhArDtOReAd.txt3. éclair.jpg4. e'clair.jpg5. french-cake.jpeg6. french-cake.jaypeg7. -2011.psd8. lexicon-master9. ɘɫIɲʰ.eaf10. ice cream.doc11. Obama.TXT12.オバマ .txt
Make filenames sortable
make filenames usefully sortable:
20100119lecture.doc 20100203lecture.doc
gr_transcription_1.txtgr_transcription_12.txtgr_transcription_5.txt gr_transcription_9.txt
gr_transcription_001.txtgr_transcription_005.txtgr_transcription_009.txtgr_transcription_012.txt
Associating files
you can make resources sortable together by giving them the same filename root (the part before the extension), or part of the root:
document your conventions and system if you do this
gr_reefs.wavgr_reefs.eafgr_reefs.txt
paaka_photo001.jpgpaaka_photo002.jpgpaaka_txt_conv203.wavpaaka_txt_conv203.eafpaaka_txt_lex.doc
Avoid metadata in filenames
avoid putting metadata into filenames. A filename is an identifier, not a data container
better to use a simple (semantic) filename or a key (i.e. meaningless) filename, and then create a metadata table to contain all the relevant information
a table can properly express all the information, contain links etc, and is extensible for further metadata
Avoid metadata in filenames
e.g. Paaka_Reefs_Dan_BH_3Oct97.wav better:
paaka_063.wavplus
paaka_063.txt
language topic speaker location datePaakantyi Reefs at
MutawintyiDan Herbert
Broken Hill 1997-10-03
paaka_063.txt
A filenaming system
carefully design a filename system for your data and document the system so that somebody else can understand it
one documenter’s new system:
aaa_bb_cc_yyyy-mm-dd_nnn.wav
A filenaming system
aaa_bb_cc_yyyy-mm-dd_nnn.wavaaa = village codebb = (main) speaker codecc = genre/event codeyyyy-mm-dd = date (why this order?)nnn = optional number (e.g. 001).wav = correct extension for file content type
Documenting the filename system
describe the system- how would you describe it?- where would you put the description?
document the codes – this is probably part of your metadata
On changing file names
decide if it’s possible, benefits and side effects (e.g. loss of links in ELAN files)
design a system first don’t change names in situ – copy data set and
gradually migrate it to your new system document file name changes if possible, automate or copy and paste filenames if possible, use machine processes, e.g. system
filename listings, XLS formulas
Different types of metadata
there are many types of metadata different types of materials may have different
metadata eg metadata for photos and videos may have
technical parameters, lists of people appearing e.g. metadata for transcriptions may have date,
version, who transcribed, notes on progress
Meta-documentation
you should keep an updated description of the methods, conventions, abbreviations you use
.. so somebody could fully understand (and use) your data and methods in your absence
Your collection catalogue
first, define your collection/corpus/project as some coherent (logical) set of materials
your collection catalogue/inventory/index is a type of metadata this should list and describe all files in your
collection it usually contains the categories of information
that are relevant for many files
Your collection catalogue
you could have one large catalogue that covers every file, or
you could have a catalogue that is subdivided according to types of files, and/or groups of resources
there is no “one size fits all” solution!
Making an “active” catalogue
this is not necessary, but may be useful if you use a spreadsheet, you can embed links to
actual files to make using your collection easier Excel formula
=hyperlink(address, display-text) useful methods for getting file listings
“Open command window here” Win 7: SHIFT+right-click
Karen’s Directory Printer
My cells have multiple values!
example: speakers in a recording speakers are probably not ‘atomic’ – they have
other attributes create a separate “speakers” sheet give each speaker an ID (number or initials) use the IDs in the original sheet, with delimiter
(implements one to many) (better) make another sheet to associate recordings
with speakers (implements many to many)
Data/file versions
need to distinguish or keep versions depends on purposes
by suffixing filename, eg fugu1.txt
fugu2.txt or fugu_1.txt
fugu_2.txt which of the above methods is better?
Data/file versions
fugu_14022013.txtfugu_20130214.txt14022013_fugu.txt20130214_fugu.txt
which of the above would be best?
Managing data/file versions
do you need to keep every version? it may be OK to keep “original” plus current
if information is regularly updated, corrected, you can keep 1 filename and put dates in the document itself, or record dates in a catalogue/metadata file
however, a series of files may have inherent value, e.g. your transcriptions/annotations, as your understanding and analysis changes, so date and keep files use different tiers in ELAN?
Character encoding
if your document contains anything other than those on a US keyboard, use UTF character encoding
how can I tell if characters in my MS Word document are encoded as UTF8? save as plain text and check options copy into plain text editor such as
Notepad++
Character encoding, useful tools
Notepad++ http://notepad-plus-plus.org/ for Mac, use: TextWrangler http://www.barebones.com/products/textwrangler/
SIL ViewGlyph http://scripts.sil.org/cms/scripts/page.php?item_id=ViewGlyph_home
BabelMap http://www.babelstone.co.uk/software/babelmap.html
TypeIt (view and write IPA) http://ipa.typeit.org/full/
browsers such as Firefox and Chrome are useful for checking and reporting character encoding
Transferring data
ensure your computer is not a “walled garden” you can use
drives/devices (but avoid DVDs!!) email upload to website (where available) send links “cloud” e.g. Carbonite, Dropbox, collaboration
software
some of these could be considered backup but not true archiving
Sharing
can we work in a shared, collaborative space? Google Docs Dropbox blogs, Tumblr, wikis etc can have shared
“authors”, and contributors with particular roles
aalso there is dedicated collaboration software (usually $$$)
Exercise - now it’s your turn!
Practical exercise for DocLing 2016 Data management & archiving Work in pairs Go to
http://www.el-training.org/courses/docling/2016/exercise/
Download the file, unzip it, and place it in a working folder • exercise.zip
This is dummy data - the content is not important for the exercise
Look through all the files to see what files are present Find the metadata file Do the following:
identify the problems and errors with the data set work out strategies for dealing with the problems work out strategies for documenting the changes you
make fix the problems and errors (as much as possible) add columns to the metadata for date and location modify the metadata to create links to the audio files
Top Related