ISOcat to LMF to TEI

14
ww.isocat.org ISOcat -> LMF -> TEI (Dictionaries) Menzo Windhouwer The Language Archive – MPI-PL [email protected] 12 October 2011 1 TEI Lexical workshop - Würzburg, Germany

description

Tightening the representation of lexical data, a TEI perspective (TEI 2011 workshop), 12 October 2011, Wurzburg, Germany

Transcript of ISOcat to LMF to TEI

Page 1: ISOcat to LMF to TEI

www.isocat.org

TEI Lexical workshop - Würzburg, Germany 1

ISOcat -> LMF-> TEI (Dictionaries)

Menzo WindhouwerThe Language Archive – MPI-PL

[email protected]

12 October 2011

Page 2: ISOcat to LMF to TEI

www.isocat.org

TEI Lexical workshop - Würzburg, Germany 2

Outline

• Introduction to ISOcat a ISO 12620:2009 compliant Data Category Registry (DCR)

• ISOcat and the Lexical Markup Framework (LMF; ISO 24613:2008)

• ISOcat and TEI (Dictionaries)

12 October 2011

Page 3: ISOcat to LMF to TEI

www.isocat.org

TEI Lexical workshop - Würzburg, Germany 312 October 2011

ISO 12620:2009

• Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources– An ISO TC 37/SC 3 standard– Replaces ISO 12620:1999, a hardcoded list of Data

Categories, with a registry for (standardized) Data Categories

Page 4: ISOcat to LMF to TEI

www.isocat.org

TEI Lexical workshop - Würzburg, Germany 412 October 2011

What is a Data Category?

• The result of the specification of a given data field– A data category is an elementary descriptor in a linguistic

structure or an annotation scheme.

• Specification consists of 3 main parts:– Administrative part

• Administration and identification

– Descriptive part• Documentation in various working languages

– Linguistic part• Conceptual domain(s for various object languages)

Page 5: ISOcat to LMF to TEI

www.isocat.org

TEI Lexical workshop - Würzburg, Germany 512 October 2011

Data category example

• Data category: /grammatical gender/– Administrative part:

• Identifier: grammaticalGender• PID: http://www.isocat.org/datcat/DC-1297

– Descriptive part:• English definition: Category based on (depending on languages)

the natural distinction between sex and formal criteria.• French definition: Catégorie fondée (selon la langue) sur la

distinction naturelle entre les sexes ou d'autres critères formels.

– Linguistic part:• Morposyntax conceptual domain: /male/, /feminine/, /neuter/• French conceptual domain: /male/, /feminine/

Page 6: ISOcat to LMF to TEI

www.isocat.org

TEI Lexical workshop - Würzburg, Germany 612 October 2011

What is a Data Category Registry?

• A (coherent) set of Data Categories, in our case for linguistic resources

• A system to manage this set:– Create and edit Data Categories– Share Data Categories, e.g., resolve PID references– Standardize Data Categories

• Grass roots approach

www.isocat.org

Page 7: ISOcat to LMF to TEI

www.isocat.org

TEI Lexical workshop - Würzburg, Germany 7

ISOcat and LMF

• §4.4 ISO 12620 Data Category Registry (DCR)– “The designers of an LMF conformant lexicon shall

use data categories from the ISO 12620 Data Category Registry (DCR) located at www.isocat.org.”

• § 5.4 LMF data category selection procedures– Create a Data Category Selection– Add Data Categories to ISOcat if needed! Missing: how to refer to ISOcat Data Categories?

12 October 2011

Page 8: ISOcat to LMF to TEI

www.isocat.org

TEI Lexical workshop - Würzburg, Germany 8

Data Category identifiers are ambiguous

…<LexicalEntry>

<feat att=“partOfSpeech” val=“commonNoun”/>…

ISOcat contains two exact matches for “commonNoun” and one close match:

12 October 2011

Page 9: ISOcat to LMF to TEI

www.isocat.org

TEI Lexical workshop - Würzburg, Germany 9

Why are identifiers ambiguous?

• Several thematic domains can use the same name for a (slightly) different Data Category– This was already true in the predecessor of ISOcat SYNTAX

(legacy)• There maybe multiple versions of the same Data Category

– Due to semantic drift or rot the name can not just point to the latest version

• Users can also create Data Categories with the same name– In the future even copy a Data Category to extends its

conceptual domain Identifier should have been renamed, e.g., to mnemonic

12 October 2011

Page 10: ISOcat to LMF to TEI

www.isocat.org

TEI Lexical workshop - Würzburg, Germany 10

ISOcat Data Category PIDs are unique

• Each ISOcat Data Category (version) has an unique PID– http://www.isocat.org/datcat/DC-1256

/common noun/ by Gil Francopoulo

• ISO 12620:2009 Annex A provides a small vocabulary to annotate an XML document with Data Category PID references:<feat

att=“partOfSpeech”dcr:datcat=“http://www.isocat.org/datcat/DC-1345”val=“commonNoun”dcr:valueDatcat=“http://www.isocat.org/datcat/DC-1256”

/> Preferably annotate the schema of the resource

12 October 2011

Page 11: ISOcat to LMF to TEI

www.isocat.org

TEI Lexical workshop - Würzburg, Germany 11

TEI feature structures

<tei:fname=“partOfSpeech”dcr:datcat=“http://www.isocat.org/datcat/DC-1345”>fVal=“commonNoun”dcr:valueDatcat=“http://www.isocat.org/datcat/DC-1256”

/>

12 October 2011

Page 12: ISOcat to LMF to TEI

www.isocat.org

TEI Lexical workshop - Würzburg, Germany 12

TEI feature structure declarations

<tei:fDecl name=“partOfSpeech” dcr:datcat=“http://www.isocat.org/datcat/DC-1345”> <tei:vRange> <tei:vAlt> <tei:symbol value=“commonNoun” dcr:datcat=http://www.isocat.org/datcat/DC-1256/> …

12 October 2011

Page 13: ISOcat to LMF to TEI

www.isocat.org

TEI Lexical workshop - Würzburg, Germany 13

TEI and ISOcat Data Category PIDs1. Is TEI open to attributes from foreign namespaces?

dcr:* attributes can already be used

2. Or can the dcr:* attributes be part of the global attribute list? It would enable to annotate any TEI element, incl. Dictionary elements, with a Data

Category reference The DCR data model now also includes container Data Categories and can thus also cover inner nodes

Could also (partially?) be done by <equiv/> statements in the ODD files Scripts to do this (semi-)automatically have already been created

3. Or can at least the TEI/ISO feature structure part accept dcr:* attributes?? Add a DCR specific attribute list? Would make the ISO TC 37 standards consistent ISO 24610-1, ISO 24613:2008 and ISO

12620:2009

Could also be another TEI attribute that expresses equivalence with an external (URI) specification (like <equiv/> in ODD) and which isn’t as much bound to ISOcat as the dcr:* attributes imply

12 October 2011

Page 14: ISOcat to LMF to TEI

www.isocat.org

TEI Lexical workshop - Würzburg, Germany 1412 October 2011

Thank you for your attention!

Visitwww.isocat.org

[email protected]