ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable...

35
www.isocat.org ISOcat An ISO 12620:2009 Data Category Registry Marc Kemps-Snijders a , Menzo Windhouwer a , Sue Ellen Wright b a Max Planck Institute for Psycholinguistics, b Kent State University [email protected] , [email protected] , [email protected] 20/8/2010 1 DGfS-CNRS Summer School on Linguistic Typology

Transcript of ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable...

Page 1: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

ISOcatAn ISO 12620:2009 Data Category Registry

Marc Kemps-Snijdersa, Menzo Windhouwera, Sue Ellen Wrightb

aMax Planck Institute for Psycholinguistics, bKent State University

[email protected] , [email protected], [email protected]

20/8/2010 1DGfS-CNRS Summer School on Linguistic Typology

Page 2: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

Outline

• ISO 12620:2009– What are Data Categories?

– How can you use Data Categories?

– What is a Data Category Registry?

– How can you use a Data Category Registry?

• ISOcat– Demonstration/Tutorial

• Future work

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 2

Page 3: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

ISO 12620:2009

• Terminology and other content and languageresources — Specification of data categoriesand management of a Data Category Registryfor language resources– An ISO TC 37/SC 3 standard (see [1])

– Successor to ISO 12620:1999 which contained ahardcoded list of Data Categories

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 3

Page 4: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

What is a Data Category?

• The result of the specification of a given datafield– A data category is an elementary descriptor in a

linguistic structure or an annotation scheme.

• Specification consists of 3 main parts:– Administrative part

• Administration and identification– Descriptive part

• Documentation in various working languages– Linguistic part

• Conceptual domain(s for various object languages)

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 4

Page 5: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

Data category example

• Data category: /Grammatical gender/– Administrative part:

• Identifier: grammaticalGender• PID: http://www.isocat.org/datcat/DC-1297

– Descriptive part:• English definition: Category based on (depending on languages)

the natural distinction between sex and formal criteria.• French definition: Catégorie fondée (selon la langue) sur la

distinction naturelle entre les sexes ou d'autres critères formels.

– Linguistic part:• Morposyntax conceptual domain: /male/, /feminine/, /neuter/• French conceptual domain: /male/, /feminine/

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 5

Page 6: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

Data Category specification – Administrative part

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 6

DCR

Data CategoryGlobal Information

Administration Information Section

Administration Record

Registration Group

Submission Group

Stewardship Group

Decision Group

Change

0..1

1

0..1

1

0..1

1

0..1

1

1..*

1

0..1

1

1..*

1

1

1

1

1

Page 7: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

Data Category specification – Descriptive part

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 7

Data Category

Description Section

Language Section

Name Section

Definition Example

Explanation

Data Element Name

0..*

1

1

1

0..*

1

0..*

1

1..*

1

0..*

1

0..*

1

Page 8: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

Data Category specification – Linguistic part

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 8

Data Category

Complex Data Category

Simple Data Category

Closed Data Category

Open Data Category

Constrained Data Category

Linguistic Section

Closed Linguistic Section

Constrained Linguistic Section

Conceptual Domain

Value Domain

Open Conceptual Domain

Schema Specific Domain

Profile Value Domain

Example Explanation

0..*

1

0..*1

1..*1

0..*

1

1..*

1 0..*

1

0..10..*

11

0..* 1

1..*

1

1..* 1

0..1 1

0..*

1

0..*

1

1..*1

Page 9: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

Mandatory parts of the specification

• For each data category:– a mnemonic identifier– an English definition– an English name

• For complex data categories:– a conceptual domain

• For standardization candidates:– a profile– a justification

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 9

Page 10: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

Guidelines for the specification

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 10

(see [2])

Page 11: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

More guidelines

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 11

• Name Section in a Language Section– legible name

• ‘part of speech’ in the English language section• ‘partie du discours’ in the French language section

• Definition:– intentional definitions (ISO 704)– should consist of a single sentence fragment

• Source:– add a source for any quoted material

Page 12: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

More guidelines

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 12

• Justification:– a simple statement justifying the relevance of the

data category to the field of language resources

– especially needed for standardization

Page 13: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

Data Category types

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 13

writtenForm

string

open

grammaticalGender

string

neuter

masculine

feminine

closed

simple:

email

string

constrained

Constraint: .+@.+

complex:

Page 14: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

Data Category relationships

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 14

• Value domainmembership

• Subsumptionrelationships betweensimple data categories

• Relationships betweencomplex data categoriesare not stored in the DCR

partOfSpeech

string

pronoun

personalpronoun

Page 15: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

How can you use Data Categories?

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 15

Lexicon

Lexical Entry

Form Sense

0..*

0..*1..*

1..*

partOfSpeech

writtenForm

writtenForm

grammaticalGender

lexicalType

Word Form

Lemma

Language BWO genders

grammaticalGenderwordOrder

A LMF (ISO 24613:2008) complaint(schema for a) lexicon

A (schema for a) typological databaseSh

ared

sem

antic

s!

Shar

edse

man

tics!

Page 16: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

How?

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 16

<lmf:lexicon xml:lang=“jp” alphabet=“ipa”><lmf:entry>

<lmf:lemma><lmf:writtenForm>nihongo</…>…

</…>…

</…>…

</…>

Page 17: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

Referencing Data Categories

• Each Data Category should be uniquelyidentifiable– Ambiguity: different domains use the same term but

mean different ‘things’– Semantic rot: even in the same domain the meaning

of a term changes over time– Persistence: for archived resources Data Category

references should still be resolvable and point to thespecification as it was at/close to time of creation

• ISO/DIS 24619 Language resource management -- Persistentidentification and access in language technology applications

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 17

Page 18: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

Data Categories Persistent IDentifiers

• persistent identifier (PID)– “unique Uniform Resource Identifier (URI) that

ensures permanent access for a digital object byproviding access to it independently of its physicallocation or current ownership” (see [1])

• For Data Categories this digital object is aspecific version of a Data Categoryspecification, i.e., each version of a DataCategory has its own PID

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 18

Page 19: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

Where do you put these references?

• Preferably in a schema:

<rng:attribute name=“alphabet”dcr:datcat=“http://www.isocat.org/datcat/…”>

<rng:value dcr:datcat=“http://www.isocat.org/datcat/…”>

ipa

</…>

</…>

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 19

Page 20: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

ISO TC 37 standards using Data Categories

• Terminological Markup Framework (TMF; ISO 16642)• Lexical Markup Framework (LMF; ISO 24613)• TermBase eXchange (TBX; ISO 30042)• Morpho-syntactic Annotation Framework (MAF; ISO

24611)• Linguistic Annotation Framework (LAF; ISO 24612)

• Meta models which can be instantiated into a specificmodel with data categories

• However, some still refer to ISO 12620:1999 DataCategories and some don’t support all types (see [3])

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 20

Page 21: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

Other uses of Data Categories

• CLARIN Component Metadata Infrastructure (CMDI)• ISO 12620:2009 provides a small XML vocabulary, DC

Reference (see [4]), which provides elements andattributes to embed Data Category references inarbitrary XML documents– Including: XML Schema, Relax NG, TEI/ISO feature

structures, …

• The references can be used in URI based ‘mappings’:– Including: ODD, RDF-based vocabularies (OWL, SKOS), …

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 21

Page 22: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

What is a Data Category Registry?

• A (coherent) set of Data Categories, in ourcase for linguistic resources

• A system to manage this set:– Create and edit Data Categories

– Share Data Categories, e.g., resolve PID references

– Standardize Data Categories

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 22

Page 23: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

Standardize Data Categories

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 23

Submissiongroup

Data Category RegistryBoard

Validation

Thematic DomainGroup

Evaluation

Stewardshipgroup

Decision Group

rejected rejected

Publication

Page 24: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

Thematic Domain Groups

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 24

TDG 1: Metadata

TDG 2: Morphosyntax

TDG 3: Semantic Content Representation

TDG 4: Syntax

TDG 5: Machine Readable Dictionary

TDG 6: Language Resource Ontology

TDG 7: Lexicography

TDG 8: Language Codes

TDG 9: Terminology

TDG 11: Multilingual Information Management

TDG 12: Lexical Resources

TDG 13: Lexical Semantics

TDG 14: Source Identification

• TDGs are the owner and guardiansof a coherent subset of the DCR

• TDGs own one or more profiles

• Each TDG has a chair

• A number of judges (assigned bySC P members)

• A number of expert members (upto 50%)

• TDGs are constituted at theTC37/SC plenary

• New TDGs need to be proposed bya SC

1. Translation

2. Sign language

3. Audio

Page 25: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

How can you use a Data Category Registry?

• You can:– Find Data Categories relevant for your resources and

embed references to them so the semantics of (parts of)your resources are made explicit

• This can be supported by tools you use, e.g., ELAN, LEXUS and theCMDI Component Editor directly interact with ISOcat

– Interact with Data Category owners to improve (thecoverage of) their Data Categories

– Create (together with others) new Data Categories neededfor your resources and share those

– Submit (your) Data Categories for standardization

– Free of charge

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 25

Page 26: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

ISOcat

• Reference implementation of ISO 12620:2009

• The TC 37 Data Category Registry

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 26

Page 27: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 27

A glimpse of ISOcat

Page 28: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.orgData Category Interchange Format(DCIF)

• Simplified XML serialization of the data model (see [4])

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 28

Data CategorySelection

[1]

Global Information[0..1]

Data Category[1..*]

AdministrationInformation Section

[1]

Description Section[1]

AdministrationRecord

[1]

Language Section[1..*]

Name Section [0..*]

Data Element NameSection

[0..*]

Conceptual Domain[0..*]

Linguistic Section[0..*]

Conceptual Domain[0..*]

Page 29: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

RESTful Web Services

• read-only programming interface to the DCR (see [5])• allows tools to interact with ISOcat to help an user to

embed PIDs in their resources• mainly based on DCIF• uses authentication to access private/shared Data

Categories• currently used by:

– LEXUS: populate an LMF model– ELAN: create controlled vocabularies– CMDI Component Editor: create concept links for component

elements

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 29

Page 30: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

Persistent IDentifiers

• ISOcat uses ‘cool URIs’ as PIDs (see [6])– these URIs will never change, but resolve to the

current location in the current implementation,e.g., in ISOcat they resolve to a RESTful WebService call

– the isocat.org domain is bound to ISO 12620:2009and the Registration Authority, currently the MPI,is obliged to keep the PIDs associated with thisdomain resolvable

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 30

Page 31: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

Future work

• Finish first complete version of ISOcat:– Standardization process

• Cleanup of the current set of Data Categories– TDGs cleanup their profiles

– Standardize first sets of Data Categories

• Interaction with other TC 37 standards:– Migration from ISO 12620:1999

– Full support for all types of Data Categories

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 31

Page 32: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

More future work

• Additional Data Categories types– Container Data Categories

• Complex and Simple only cover ‘leafs’ and their values

– Data Category Concepts• Basic building blocks for knowledge bases

• Relation Registries– Stores (your) (semantic) relationships between

Data Categories

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 32

Page 33: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

Registry network

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 33

Linguistic resources

Data category registries

Relation registries

MPIDCR

ISODCR

Typological Database SystemRRMPI RR

MPIarchive

TDSdatabaseresource

Page 34: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 34

Thank you for your attention!

Visitwww.isocat.org

Questions?www.isocat.org/forum/

[email protected]

Page 35: ISOcat - Max Planck Institute Leipzig | Home · 2015-03-17 · TDG4: Syntax TDG5: Machine Readable Dictionary TDG6: LanguageResourceOntology TDG7: Lexicography TDG8: LanguageCodes

www.isocat.org

References

[1] ISO 12620, Terminology and other language and content resources --Specification of data categories and management of a Data CategoryRegistry for language resources.

[2] http://www.isocat.org/manual/DCRGuidelines.pdf

[3] M.A. Windhouwer, S.E. Wright, M. Kemps-Snijders. Referencing ISOcatdata categories. In proceedings of the LREC 2010 LRT standards workshop.Malta, May 18, 2010.

[4] http://www.isocat.org/12620/

[5] http://www.isocat.org/rest/help.html

[6] Tim Berners-Lee, Cool URIs don't change, 1998.

20/8/2010 DGfS-CNRS Summer School on Linguistic Typology 35