1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.

14
1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010

Transcript of 1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.

Page 1: 1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.

1

ISOCAT Proposed solutions for

Problems encountered in DUELME-LMF

Jan Odijk

Nijmegen 21 Sep 2010

Page 2: 1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.

2

Overview

• General• Standardized DCs?• Multiple relevant DCs in ISOCAT• Overlap with other projects• Container Data Catgegories• Almost Identical DCs• Language Sections• Existing Tagsets

Page 3: 1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.

3

General

• Always try to map to an existing ISOCAT DC, – Where possible– Irrespective of whether the ISOCAT DC is part of an

official standard• If not possible, or if there is uncertainty

– Create a new DC, but– Also specify the relation with existing closely related

ISOCAT DCs. Provide • Type of the relation

– dropdown list to be provided by RELCAT developers,» E.g. equals, almost-equals, is hyponym of , is hyperonym of, etc.

• Textual clarification of the deviation

Page 4: 1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.

4

General

• Relation to be entered into Relation Registry (RR) as soon as it is available

• Temporarily Proposed notation:– recordset in CSV format with records consisting of 4

fields:• Relation type (from drop-down list; should be ISOCAT DCs

themselves)• Data-category 1 (ISOCAT PID)• Data-category 2 (ISOCAT PID)• Clarification (rich text)• Plus some administrative info: User id, creation date etc.

– To import into RR as soon as available

Page 5: 1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.

5

Standardized DCs?

• Ignore +/- standard status of DC in ISOCAT

• If needed, use relations in Relation Registry

Page 6: 1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.

6

Multiple ISOCAT DCs

• Map to an existing DC that is identical (wherever possible)

• Use relations to relate it to almost identical DCs in ISOCAT

Page 7: 1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.

7

Overlap with other projects

• Consult with other projects

• Registry of topics people/projects are working on– Dieter took some initiative– http://spreadsheets.google.com/ccc?key=0Al5Lw-

npZ6ZTdDZlT2VjeGhwZm5iRW5IM3BTZFI5WEE&hl=en&authkey=CL_Wl4ID

• This workshop (and others if needed)

Page 8: 1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.

8

Container data categories

• ISOCAT might be extended for this

• Probably not really a problem in the short term(?)

Page 9: 1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.

9

Almost identical DCs

• For ill-defined DCs in ISOCAT– Suggest better definitions and submit them to the

Thematic Domain Group– Use relations to relate your DC to existing

slightly different DCs (see later)

Page 10: 1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.

10

Almost identical DCs

• Example: Noun• Noun is a Part of Speech assigned to words which

share specific morphosyntactic (inflectional), morphological, syntactic (and semantic) properties

– morphosyntactic (inflectional) properties: • person, number, gender/class. declension class, case, …• Specific morphological combinatorial potential (derivation,

compounding), in particular diminutives, augmentatives• specific syntactic combinatorial potential

• Where each language selects a specific subset of these properties (as illustrated in the language sections.

Page 11: 1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.

11

Language Sections?

• The highly (Polish) language-specific – http://www.isocat.org/datcat/DC-2704 (noun)

• Noun [subst] contains lexemes infecting for number and case, with a lexically determined grammatical gender, which do not have the category of person, e.g., woda `water', profesor `professor', pięciokrotność 'fivefoldness'; this class also contains defective plurale tantum and singulare tantum lexemes, but not depreciative lexemes. Grammatical categories of noun [subst]: number (http://www.isocat.org/datcat/DC-2709), case (http://www.isocat.org/datcat/DC-2720), gender (http://www.isocat.org/datcat/DC-2728).

• Can now be part of the Polish language section of the DC Noun with the definition given in the previous slide

Page 12: 1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.

12

Existing Tagsets

• Make sure all DCs of an existing de facto standard tag set are in ISOCAT

– Either existing DCs– Or newly added DCs

• Assign all DCs from such a tag set to a new closed complex category

– E.g. DC d-coiTagset, ipipanTagset, etc.– (and/or to datacategory set?)

Page 13: 1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.

13

More…

• Problems and Proposed solutions– Odijk (2009), “Data Categories and ISOCAT: some remarks from a simple

linguist", presentation held at FLaReNet/CLARIN Standards Workshop, Helsinki, September 27, 2009

– Odijk, J. (2010), ""Relations between Data Categories, presentation held at the CLARIN Relation Registry Workshop, MPI, Nijmegen, January 8, 2010

• Both to be found (inter alia) on http://www.clarin.nl/node/80

Page 14: 1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.

14

CLARIN-NL

Thanks for your attention!

http://www.clarin.nl/