CMD and TEI

14
CMD and TEI CMDI interoperability workshop 2013-06-04 - Utrecht Matej Ďurčo, ICLTT, Vienna

description

CMD and TEI. CMDI interoperability workshop 2013-06-04 - Utrecht Matej Ďu r č o, ICLTT, Vienna. TEI at ICLTT. AAC – Austrian Academy Corpus diachronic corpus ~ 500 mil. tokens being converted into TEI C4 – distributed corpus of german of 20 th century Basel, Berlin, Bozen , Wien - PowerPoint PPT Presentation

Transcript of CMD and TEI

Page 1: CMD  and  TEI

CMD and TEI

CMDI interoperability workshop2013-06-04 - Utrecht

Matej Ďurčo, ICLTT, Vienna

Page 2: CMD  and  TEI

2

TEI at ICLTT

• AAC – Austrian Academy Corpus– diachronic corpus ~ 500 mil. tokens– being converted into TEI

• C4 – distributed corpus of german of 20th century– Basel, Berlin, Bozen, Wien– harmonized format (TEI/teiHeader)

• Dict-Gate – TEI encoded multilingual lexicons (persian, arabic, german,

english)– however described with LexicalResourceProfile

• Abacus – Austrian Baroque Corpus– 3 (5) historical texts encoded in TEI– elaborate teiHeader

Page 3: CMD  and  TEI

3

TEI (and friends?) in CMD

Projekt Author, Year Profile Comp/Elem/Datcats instances

Deutsches Text Archiv ?

teiHeader #clarin.eu:cr1:p_1345180279115 (NOT in CompReg!)

56/82/10 857

ICLTT Durco, 2010 teiHeader #clarin.eu:cr1:p_1282306194508

16/35/13 (7 dublincore, 6 isocat) 467

Leipzig Corpora Eckart, 2012 TEIDocumentDescription

#clarin.eu:cr1:p_13377789249924/17/17 (isocat) ?

Nederlab Zhang 2013 ?

DBNL_Tekst #clarin.eu:cr1:p_1361876010678 DBNL_Tekst_Onzelfstandig #clarin.eu:cr1:p_1366279029218 (private)

20/38/15

20/47/21 ?

• overview of currently existing TEIish CMD-profiles

Page 4: CMD  and  TEI

4

teiHeader (ICLTT)

size = reuse in other profiles

Page 5: CMD  and  TEI

5

teiHeader (DTA) size = count elements in instance data

Page 6: CMD  and  TEI

6

datcats in teiHeader(DTA)

Page 7: CMD  and  TEI

7

TEI and ISOcat

• a special DCS: TEi Header (2.1.0) – Windhouwer, 2012– a datcat for every element of the teiHeader (135 datcats)– based on an ODD-file (ODD2DCIF.xsl and DCIF2ODD.xsl

available)– owed to CLARIN-NL projects using TEI header

• a enriched schema was generated = annotated with these new data categories (dcr:datcat-attribute) put in SCHEMAcat: http://lux13.mpi.nl/schemacat/schema/teiHeader

• define relations between TEI and other data categories in RELcat (the relation registry)

Page 8: CMD  and  TEI

8

Next Step(s) ?

• create (or adapt existing) teiHeader profile – as a union of the existing profiles ?– based on the enriched schema– i.e. linking to the new TEI data categories– define a relation set in RELcat

between TEI and ISOcat (and dublincore) data categories

Page 9: CMD  and  TEI

9

profile: data (LINDAT)

dublincore + metashare

Page 11: CMD  and  TEI

11

dublincore I

• 2 profiles with dc-terms (55 datacategories)• 2 profiles with dc-elements (called „dc-terms“)as of 2013-01

Page 13: CMD  and  TEI

13

dublincore III

(almost) all datcatsshared by all

Page 14: CMD  and  TEI

14

dublincore IV

1 profile has extra component:DANS-DC-metadata

example:language