DELAMAN / DAM-LR - the vision -

34
1 DELAMAN / DAM-LR - the vision - Peter Wittenburg MPI for Psycholinguistics Digital Endangered Languages and Music Archives Network Distributed Access Management for Language Resources (EU – Project started at 1.1.05)

description

DELAMAN / DAM-LR - the vision -. Digital Endangered Languages and Music Archives Network Distributed Access Management for Language Resources (EU – Project started at 1.1.05). Peter Wittenburg MPI for Psycholinguistics. When did “we” start?. - PowerPoint PPT Presentation

Transcript of DELAMAN / DAM-LR - the vision -

Page 1: DELAMAN / DAM-LR - the vision -

1

DELAMAN / DAM-LR- the vision -

Peter WittenburgMPI for Psycholinguistics

Digital Endangered Languages and Music Archives Network

Distributed Access Management for Language Resources(EU – Project started at 1.1.05)

Page 2: DELAMAN / DAM-LR - the vision -

2

When did “we” start?

• it is just 5 years that we started in our discipline speaking about– large digital online collections – standardizing the formats – open metadata to come to browsable and searchable domains – using open metadata to create well-organized archives

• LREC Athens 2000– first workshop on these issues– start of the ISLE project (linguistic concepts, lexicon, metadata, …) – start of the work on the IMDI metadata infrastructure

• in late 2000 also first LDC workshop with OLAC as focus

• this is very short time when you want to convince a community

Page 3: DELAMAN / DAM-LR - the vision -

3

What did we achieve?

• have “large” on-line digital archives/collections/Digital Libraries– MPI ~40.000 session bundles (> 100.000 objects) / ~11 TB – DOBES ~1.500 session bundles/ 1500 h – AILLA archive – PARADISEC archive– Lund corpus archive– also in HLT domain larger data centers – also “traditional” archives (Phonogramm Archiv, NAA, …) – etc

• idea of web visibility and online accessibility spreads • necessity of central data collection and preservation spreads

Page 4: DELAMAN / DAM-LR - the vision -

4

What did we achieve?

• much evangelization and agreement about standards

• “everyone” agrees with XML, UNICODE and linear PCM• “everyone” understands the relevance of schemas to make linguistic structure and encoding explicit • wrt JPEG and MPEG we are shooting on a moving target, but don’t yet have real alternatives

Page 5: DELAMAN / DAM-LR - the vision -

5

What did we achieve?

• interoperability is still a dream however …– have metadata gateways in our discipline (OLAC-IMDI) – increasingly often tools are producing correct XML, UNICODE, …– have filters for character encodings and formats although we miss well-designed and comprehensive services – have started with ontology work to tackle the linguistic aspects

• GOLD ontology from E-Meld• ISO TC37/SC4 Data Category Registry • TDS (Dutch Typology Project) meta-language • EAGLES/ISLE/TEI specifications

• we are at the beginning• cannot speak yet about fully operational infrastructures but there are island tools like FIELD, LEXUS, ONTO-ELAN, …

Page 6: DELAMAN / DAM-LR - the vision -

6

Changing role of Language Archives

different groups of people contribute The

Archive

different groups of people use the content

specialists maintain, unify, check quality, etc

• at the MPI it is understood that the archive is the capital to build on • in the DOBES programme the point to make results explicit and accessible

• only works if we don’t have an “inert, dusty” archives • language archives are dynamic!

Page 7: DELAMAN / DAM-LR - the vision -

7

DOBES / MPI Archivesas

Example

Page 8: DELAMAN / DAM-LR - the vision -

8

Vision for a single archive

MetadataTools

Archive Utility Layer

Domain ofRegistered Primary and Secondary Resources

Domain ofDescriptive Metadata

Primary Resources:TextsImagesSoundMovies

User

DataIngestion&

Management

UserAuthentication

AccessRights

Web-based Archive Exploration

AnnotationExploration

LexiconExploration

TextExploration

Ontological Knowledge

MediaAnnotation

(Web-based) Archive Enrichment

LexicalEncoding

WebCommentary

The Archive

done in progressto start

Page 9: DELAMAN / DAM-LR - the vision -

9

Content Organization

Domain ofRegistered Primary and Secondary Resources

Domain ofDescriptive Metadata

Primary Resources:TextsImagesSoundMovies

User

The Archive

Archive Contents

mydomain

KilivilaTrumai

yourdomain

TofaTseltal

IMDI domain

mytextmysoundmyimagemymoviemyannotations

info files

info filest-lexicongrammar….

info filesk-lexicongrammar

yourtextyoursoundyourimageyourmovieyourannotations

IMDI schema

EAF schemamyanno structureyouranno structure

LEXUS schemat-lex structurek-lex structure…..

Page 10: DELAMAN / DAM-LR - the vision -

10

IMDI Based Virtual Layer (corp man)

• researcher free to define structure • MD descriptions have to be correct (IMDI schema and CV)

Access ManagementNijmegenNovember 2004

mydomain

KilivilaTrumai

yourdomain

TofaTseltal

IMDI domain

mytextmysoundmyimagemymoviemyannotations

info files

info filest-lexicongrammar….

info filesk-lexicongrammar

yourtextyoursoundyourimageyourmovieyourannotations

• fully distributed domain • sufficient to register the root URL• searching requires harvesting• HTML browsing requires harvesting

Page 11: DELAMAN / DAM-LR - the vision -

11

Ingestion & Management

Domain ofRegistered Primary and Secondary Resources

Domain ofDescriptive Metadata

Primary Resources:TextsImagesSoundMovies

User

DataIngestion&

Management

The Archive info files

lexicagrammar….

MPI

Kilivila Trumai

DOBES

TofaTseltal

IMDI domain

textsoundimagemovieannotationseye movements

info files

Resource Ingestion

1. upload/define structure

2. upload/define sessions

3. upload resources

4. link resources

5. define access policy

6. system to carry out checks

LAMUS Light almost ready textsound

textsound

textsound

Page 12: DELAMAN / DAM-LR - the vision -

12

MetadataTools

Archive Utility Layer

IMDI Metadata Infrastructure

Domain ofRegistered Primary and Secondary Resources

Domain ofDescriptive Metadata

Primary Resources:TextsImagesSoundMovies

User

DataIngestion&

Management

The Archive

Editor XMLBrowser

HTMLBrowser

Treebuilder

Page 13: DELAMAN / DAM-LR - the vision -

13

MetadataTools

Archive Utility Layer

Access & User Management

Domain ofRegistered Primary and Secondary Resources

Domain ofDescriptive Metadata

Primary Resources:TextsImagesSoundMovies

User

DataIngestion&

Management

UserAuthentication

AccessRights

The Archive

Page 14: DELAMAN / DAM-LR - the vision -

14

Access Management

MPI CM

personX personY

personZ

textsoundimagemovieannotationseye movements

info files

domain ofopen metadatadescriptions

domain ofresources to be protected

• current solution is centralized – one database• has delegation mechanism to make administration tractable• association of declarations etc is possible • powerful commands from any node to give rights to groups

domainofcontrol delegation

Page 15: DELAMAN / DAM-LR - the vision -

15

MetadataTools

Archive Utility Layer

Web-based Annotation Exploitation

Domain ofRegistered Primary and Secondary Resources

Domain ofDescriptive Metadata

Primary Resources:TextsImagesSoundMovies

User

DataIngestion&

Management

UserAuthentication

AccessRights

Archive Exploitation

AnnotationExploitation

The Archive

Page 16: DELAMAN / DAM-LR - the vision -

16

MetadataTools

Archive Utility Layer

Web-based Lexicon Exploitation

Domain ofRegistered Primary and Secondary Resources

Domain ofDescriptive Metadata

Primary Resources:TextsImagesSoundMovies

User

DataIngestion&

Management

UserAuthentication

AccessRights

Archive Exploitation

AnnotationExploitation

The Archive

LexiconExploitation

Page 17: DELAMAN / DAM-LR - the vision -

17

MetadataTools

Archive Utility Layer

Web-based Text Exploitation

Domain ofRegistered Primary and Secondary Resources

Domain ofDescriptive Metadata

Primary Resources:TextsImagesSoundMovies

User

DataIngestion&

Management

UserAuthentication

AccessRights

Archive Exploitation

AnnotationExploitation

The Archive

LexiconExploitation

TextExploitationmeant for field notes,

grammars, ethno notes, etc

nothing concrete yetbut least complex

to implement

Page 18: DELAMAN / DAM-LR - the vision -

18

MetadataTools

Archive Utility Layer

Web-based Archive Exploitation

Domain ofRegistered Primary and Secondary Resources

Domain ofDescriptive Metadata

Primary Resources:TextsImagesSoundMovies

User

DataIngestion&

Management

UserAuthentication

AccessRights

Archive Exploitation

AnnotationExploitation

The Archive

LexiconExploitation

TextExploitation

?

Page 19: DELAMAN / DAM-LR - the vision -

19

MetadataTools

Archive Utility Layer

Ontology Support Necessary

Domain ofRegistered Primary and Secondary Resources

Domain ofDescriptive Metadata

Primary Resources:TextsImagesSoundMovies

User

DataIngestion&

Management

UserAuthentication

AccessRights

Archive Exploitation

AnnotationExploitation

The Archive

LexiconExploitation

TextExploitation

?

Ontological Knowledge

mo = morpho n = noun …

Page 20: DELAMAN / DAM-LR - the vision -

20

Annotation

The Problem

trans

POS noun

dog

ortho

PS n

dog

formdog

dog wordclassno

??

LexiconAnnotation

this is not the same for a stupid search engine

this is not the same for a stupid search engine

Page 21: DELAMAN / DAM-LR - the vision -

21

Central Solution

trans

POS noun

dog

ortho

PS n

dog

formdog

dog wordclassno

??

cat 107 = orthographic transcription

cat 229 = part-of-speech

cat 531 = noun

trans = cat 107, POS = cat 229, noun = cat 531

ortho = cat 107, PS = cat 229, n = cat 531

form = cat 107, wordclass = cat 229, no = cat 531

Central ISODCR

contains all relevant linguistic definitions can refer to them

given linguistic differences not realistic

Page 22: DELAMAN / DAM-LR - the vision -

22

Individual Solution

trans

POS noun

dog

ortho

PS n

dog

formdog

dog wordclassno

??

trans = ortho = form

POS = PS = gramcat

n = noun = no

Linguist’smapping

file

means lot of work for all individuals

given time constraints not realistic

will start with this version

Page 23: DELAMAN / DAM-LR - the vision -

23

Proper Solution

how long will it take to be there?nevertheless – have to start now!

central ISO DCR

MPI DCR

personal DCR

Search Engine

relations

relations

relations

Domain of Ontologiesthere will be many knowledge sources

Page 24: DELAMAN / DAM-LR - the vision -

24

MetadataTools

Archive Utility Layer

Web-Based Annotation

Domain ofRegistered Primary and Secondary Resources

Domain ofDescriptive Metadata

Primary Resources:TextsImagesSoundMovies

User

DataIngestion&

Management

UserAuthentication

AccessRights

Archive Exploitation

AnnotationExploitation

The Archive

LexiconExploitation

TextExploitation

?

Ontological Knowledge

MediaAnnotation

Archive EnrichmentYET FIRST DOWNLOAD

ANNOTATE AND UPLOADONLINE ANNOTATION

LATER

Page 25: DELAMAN / DAM-LR - the vision -

25

MetadataTools

Archive Utility Layer

Web-based Lexicon Editing

Domain ofRegistered Primary and Secondary Resources

Domain ofDescriptive Metadata

Primary Resources:TextsImagesSoundMovies

User

DataIngestion&

Management

UserAuthentication

AccessRights

Archive Exploitation

AnnotationExploitation

The Archive

LexiconExploitation

TextExploitation

?

Ontological Knowledge

MediaAnnotation

Archive Enrichment

LexicalEncoding

Page 26: DELAMAN / DAM-LR - the vision -

26

MetadataTools

Archive Utility Layer

Web-based Commentary

Domain ofRegistered Primary and Secondary Resources

Domain ofDescriptive Metadata

Primary Resources:TextsImagesSoundMovies

User

DataIngestion&

Management

UserAuthentication

AccessRights

Archive Exploitation

AnnotationExploitation

The Archive

LexiconExploitation

TextExploitation

?

Ontological Knowledge

MediaAnnotation

Archive Enrichment

LexicalEncoding

WebCommentary

Comment: This is an interesting relationType: Semantic Author: Peter Wittenburg Date: 27.9.2004

Page 27: DELAMAN / DAM-LR - the vision -

27

MetadataTools

Archive Utility Layer

Language Archives – The Vision

Domain ofRegistered Primary and Secondary Resources

Domain ofDescriptive Metadata

Primary Resources:TextsImagesSoundMovies

User

DataIngestion&

Management

UserAuthentication

AccessRights

Archive Exploitation

AnnotationExploitation

The Archive

LexiconExploitation

TextExploitation

?

Ontological Knowledge

MediaAnnotation

Archive Enrichment

LexicalEncoding

WebCommentary

Page 28: DELAMAN / DAM-LR - the vision -

28

Cross-Archive Dimension

DELAMAN / DAM-LRVisions

Page 29: DELAMAN / DAM-LR - the vision -

29

DELAMAN / DAM-LR Map

MPI

AILLA

EMELDANLC

LACITO

ELAR

PARADISEC

AMPM

LundINL

AIATSIS

Page 30: DELAMAN / DAM-LR - the vision -

30

Exchange Resources

• have to take care of long-term data preservation • only chance is world-wide distribution

Raw Data

Metadata

Raw Data

Metadatadata exchange for

data survival reasons

archive A archive B

Page 31: DELAMAN / DAM-LR - the vision -

31

Joint Access Domain

• Users want to work across administrational boundaries

Raw Data

Metadata

DOBES Archive

Raw Data

Metadata

AILLA Archive

my personalTrumai archive

AILLATrumai

DOBESTrumai

not just copies but result of own creative process

Page 32: DELAMAN / DAM-LR - the vision -

32

Goals

• it’s about future usage scenarios with distributed archives • it’s about federated language resource archives • it’s about eScience scenarios in linguistics

• want to exchange data automatically (list driven)• want to allow people to create integrated virtual working spaces • want to have an integrated access management domain (one identity, rights go with the copies, …)

• first talks in Nijmegen and at HRELP workshops 2003• foundation at PARADISEC meeting in Sydney 2003

• last workshop in Nijmegen November 2004 – linguists– archivists– (GRID) technologists

Page 33: DELAMAN / DAM-LR - the vision -

33

Technologies

• much technology to achieve our goals is available

– A-Select authentication system– Shibboleth authorization system– Handle System for URID resolving– Distributed metadata environment such as IMDI– Storage Request Broker for federated resources – Web-Services for layered services– …

Page 34: DELAMAN / DAM-LR - the vision -

34

Links

DELAMAN Web-Site www.delaman.org

DELAMAN Workshop-Site www.mpi.nl/delaman/workshop

DOBES Web-Site www.mpi.nl/DOBES

MPI Archive Web-Site www.mpi.nl/world/corpus

MPI Tools Web-Site www.mpi.nl/tools