Care henk vd Heuvel
Transcript of Care henk vd Heuvel
Aim of project
CARE: Curation of Dutch Regional Dialect Dictionaries
Nicoline van der Sijs, Henk van den Heuvel, Roeland van Hout, Eric Sanders
CLS/CLST, Radboud University Nijmegen, The Netherlands
•OCR version of PDF files (WBD & WLD, Parts I and II
• Formerly curated TSV files for WBD & WLD, Part III
• FP5 files of WGD
What we deliver • Generic LMF model for dialect dictionaries • WBD, WLD as CSV files and LMF files
• For at least 32 of 42 books of Parts I and II • For all 28 books of Part III
• Original PDFs of books • CMDI files per Part • Curation Reports
Where we start
The CARE project is funded by CLARIN-NL under grant number 15-004
• Definition of a generic database structure for dialect dictionaries (LMF)
• Link the structure to Woordenboek van de Vlaamse Dialecten (WVD) and other regional dictionaries
• Curation of Woordenboek van de Brabantse dialecten (WBD) and Woordenboek van de Limburgse Dialecten (WLD) parts I and II
• Update curation of WBD and WLD Part III • Include Woordenboek van de Gelderse Dialecten (WGD)
Generic aspects
• LMF model suited for all sorts of dialect dictionaries
• CMDI metadata profile • Very flexible LMF conversion script
PDF book
CLARIN Data Centre
LMF files
CSV files
CMDI files
CLARIN Data Centre: Meertens Institute
• Adding Persistent Identifiers • Storage
CMDI -Metadata profile includes: -Link to LMF
LMF script -Converts CSV file into LMF
CSV script -Converts typographed text file into CSV file by:
-Typographic & text cleaning - Categorization of information based on typography
-Recoding dialect forms -Checking and expanding Kloekecodes -Logfile is used for iterative manual correction
Manual Preprocesing by trained assistents, greatly acknowledged:
Aukje Borkent, Maaike Borst, Eline Dimmendaal, Jorik van Engeland and Inge Otto
- Addition of typographic codes for Comments (“Toelichting”) in text file
- Correcting script errors