ISOcat to LMF to TEI
-
Upload
menzo-windhouwer -
Category
Documents
-
view
746 -
download
1
description
Transcript of ISOcat to LMF to TEI
www.isocat.org
TEI Lexical workshop - Würzburg, Germany 1
ISOcat -> LMF-> TEI (Dictionaries)
Menzo WindhouwerThe Language Archive – MPI-PL
12 October 2011
www.isocat.org
TEI Lexical workshop - Würzburg, Germany 2
Outline
• Introduction to ISOcat a ISO 12620:2009 compliant Data Category Registry (DCR)
• ISOcat and the Lexical Markup Framework (LMF; ISO 24613:2008)
• ISOcat and TEI (Dictionaries)
12 October 2011
www.isocat.org
TEI Lexical workshop - Würzburg, Germany 312 October 2011
ISO 12620:2009
• Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources– An ISO TC 37/SC 3 standard– Replaces ISO 12620:1999, a hardcoded list of Data
Categories, with a registry for (standardized) Data Categories
www.isocat.org
TEI Lexical workshop - Würzburg, Germany 412 October 2011
What is a Data Category?
• The result of the specification of a given data field– A data category is an elementary descriptor in a linguistic
structure or an annotation scheme.
• Specification consists of 3 main parts:– Administrative part
• Administration and identification
– Descriptive part• Documentation in various working languages
– Linguistic part• Conceptual domain(s for various object languages)
www.isocat.org
TEI Lexical workshop - Würzburg, Germany 512 October 2011
Data category example
• Data category: /grammatical gender/– Administrative part:
• Identifier: grammaticalGender• PID: http://www.isocat.org/datcat/DC-1297
– Descriptive part:• English definition: Category based on (depending on languages)
the natural distinction between sex and formal criteria.• French definition: Catégorie fondée (selon la langue) sur la
distinction naturelle entre les sexes ou d'autres critères formels.
– Linguistic part:• Morposyntax conceptual domain: /male/, /feminine/, /neuter/• French conceptual domain: /male/, /feminine/
www.isocat.org
TEI Lexical workshop - Würzburg, Germany 612 October 2011
What is a Data Category Registry?
• A (coherent) set of Data Categories, in our case for linguistic resources
• A system to manage this set:– Create and edit Data Categories– Share Data Categories, e.g., resolve PID references– Standardize Data Categories
• Grass roots approach
www.isocat.org
www.isocat.org
TEI Lexical workshop - Würzburg, Germany 7
ISOcat and LMF
• §4.4 ISO 12620 Data Category Registry (DCR)– “The designers of an LMF conformant lexicon shall
use data categories from the ISO 12620 Data Category Registry (DCR) located at www.isocat.org.”
• § 5.4 LMF data category selection procedures– Create a Data Category Selection– Add Data Categories to ISOcat if needed! Missing: how to refer to ISOcat Data Categories?
12 October 2011
www.isocat.org
TEI Lexical workshop - Würzburg, Germany 8
Data Category identifiers are ambiguous
…<LexicalEntry>
<feat att=“partOfSpeech” val=“commonNoun”/>…
ISOcat contains two exact matches for “commonNoun” and one close match:
12 October 2011
www.isocat.org
TEI Lexical workshop - Würzburg, Germany 9
Why are identifiers ambiguous?
• Several thematic domains can use the same name for a (slightly) different Data Category– This was already true in the predecessor of ISOcat SYNTAX
(legacy)• There maybe multiple versions of the same Data Category
– Due to semantic drift or rot the name can not just point to the latest version
• Users can also create Data Categories with the same name– In the future even copy a Data Category to extends its
conceptual domain Identifier should have been renamed, e.g., to mnemonic
12 October 2011
www.isocat.org
TEI Lexical workshop - Würzburg, Germany 10
ISOcat Data Category PIDs are unique
• Each ISOcat Data Category (version) has an unique PID– http://www.isocat.org/datcat/DC-1256
/common noun/ by Gil Francopoulo
• ISO 12620:2009 Annex A provides a small vocabulary to annotate an XML document with Data Category PID references:<feat
att=“partOfSpeech”dcr:datcat=“http://www.isocat.org/datcat/DC-1345”val=“commonNoun”dcr:valueDatcat=“http://www.isocat.org/datcat/DC-1256”
/> Preferably annotate the schema of the resource
12 October 2011
www.isocat.org
TEI Lexical workshop - Würzburg, Germany 11
TEI feature structures
<tei:fname=“partOfSpeech”dcr:datcat=“http://www.isocat.org/datcat/DC-1345”>fVal=“commonNoun”dcr:valueDatcat=“http://www.isocat.org/datcat/DC-1256”
/>
12 October 2011
www.isocat.org
TEI Lexical workshop - Würzburg, Germany 12
TEI feature structure declarations
<tei:fDecl name=“partOfSpeech” dcr:datcat=“http://www.isocat.org/datcat/DC-1345”> <tei:vRange> <tei:vAlt> <tei:symbol value=“commonNoun” dcr:datcat=http://www.isocat.org/datcat/DC-1256/> …
12 October 2011
www.isocat.org
TEI Lexical workshop - Würzburg, Germany 13
TEI and ISOcat Data Category PIDs1. Is TEI open to attributes from foreign namespaces?
dcr:* attributes can already be used
2. Or can the dcr:* attributes be part of the global attribute list? It would enable to annotate any TEI element, incl. Dictionary elements, with a Data
Category reference The DCR data model now also includes container Data Categories and can thus also cover inner nodes
Could also (partially?) be done by <equiv/> statements in the ODD files Scripts to do this (semi-)automatically have already been created
3. Or can at least the TEI/ISO feature structure part accept dcr:* attributes?? Add a DCR specific attribute list? Would make the ISO TC 37 standards consistent ISO 24610-1, ISO 24613:2008 and ISO
12620:2009
Could also be another TEI attribute that expresses equivalence with an external (URI) specification (like <equiv/> in ODD) and which isn’t as much bound to ISOcat as the dcr:* attributes imply
12 October 2011
www.isocat.org
TEI Lexical workshop - Würzburg, Germany 1412 October 2011
Thank you for your attention!
Visitwww.isocat.org