DELAMAN / DAM-LR - the vision -
description
Transcript of DELAMAN / DAM-LR - the vision -
1
DELAMAN / DAM-LR- the vision -
Peter WittenburgMPI for Psycholinguistics
Digital Endangered Languages and Music Archives Network
Distributed Access Management for Language Resources(EU – Project started at 1.1.05)
2
When did “we” start?
• it is just 5 years that we started in our discipline speaking about– large digital online collections – standardizing the formats – open metadata to come to browsable and searchable domains – using open metadata to create well-organized archives
• LREC Athens 2000– first workshop on these issues– start of the ISLE project (linguistic concepts, lexicon, metadata, …) – start of the work on the IMDI metadata infrastructure
• in late 2000 also first LDC workshop with OLAC as focus
• this is very short time when you want to convince a community
3
What did we achieve?
• have “large” on-line digital archives/collections/Digital Libraries– MPI ~40.000 session bundles (> 100.000 objects) / ~11 TB – DOBES ~1.500 session bundles/ 1500 h – AILLA archive – PARADISEC archive– Lund corpus archive– also in HLT domain larger data centers – also “traditional” archives (Phonogramm Archiv, NAA, …) – etc
• idea of web visibility and online accessibility spreads • necessity of central data collection and preservation spreads
4
What did we achieve?
• much evangelization and agreement about standards
• “everyone” agrees with XML, UNICODE and linear PCM• “everyone” understands the relevance of schemas to make linguistic structure and encoding explicit • wrt JPEG and MPEG we are shooting on a moving target, but don’t yet have real alternatives
5
What did we achieve?
• interoperability is still a dream however …– have metadata gateways in our discipline (OLAC-IMDI) – increasingly often tools are producing correct XML, UNICODE, …– have filters for character encodings and formats although we miss well-designed and comprehensive services – have started with ontology work to tackle the linguistic aspects
• GOLD ontology from E-Meld• ISO TC37/SC4 Data Category Registry • TDS (Dutch Typology Project) meta-language • EAGLES/ISLE/TEI specifications
• we are at the beginning• cannot speak yet about fully operational infrastructures but there are island tools like FIELD, LEXUS, ONTO-ELAN, …
6
Changing role of Language Archives
different groups of people contribute The
Archive
different groups of people use the content
specialists maintain, unify, check quality, etc
• at the MPI it is understood that the archive is the capital to build on • in the DOBES programme the point to make results explicit and accessible
• only works if we don’t have an “inert, dusty” archives • language archives are dynamic!
7
DOBES / MPI Archivesas
Example
8
Vision for a single archive
MetadataTools
Archive Utility Layer
Domain ofRegistered Primary and Secondary Resources
Domain ofDescriptive Metadata
Primary Resources:TextsImagesSoundMovies
User
DataIngestion&
Management
UserAuthentication
AccessRights
Web-based Archive Exploration
AnnotationExploration
LexiconExploration
TextExploration
Ontological Knowledge
MediaAnnotation
(Web-based) Archive Enrichment
LexicalEncoding
WebCommentary
The Archive
done in progressto start
9
Content Organization
Domain ofRegistered Primary and Secondary Resources
Domain ofDescriptive Metadata
Primary Resources:TextsImagesSoundMovies
User
The Archive
Archive Contents
mydomain
KilivilaTrumai
yourdomain
TofaTseltal
IMDI domain
mytextmysoundmyimagemymoviemyannotations
info files
info filest-lexicongrammar….
info filesk-lexicongrammar
yourtextyoursoundyourimageyourmovieyourannotations
IMDI schema
EAF schemamyanno structureyouranno structure
LEXUS schemat-lex structurek-lex structure…..
10
IMDI Based Virtual Layer (corp man)
• researcher free to define structure • MD descriptions have to be correct (IMDI schema and CV)
Access ManagementNijmegenNovember 2004
mydomain
KilivilaTrumai
yourdomain
TofaTseltal
IMDI domain
mytextmysoundmyimagemymoviemyannotations
info files
info filest-lexicongrammar….
info filesk-lexicongrammar
yourtextyoursoundyourimageyourmovieyourannotations
• fully distributed domain • sufficient to register the root URL• searching requires harvesting• HTML browsing requires harvesting
11
Ingestion & Management
Domain ofRegistered Primary and Secondary Resources
Domain ofDescriptive Metadata
Primary Resources:TextsImagesSoundMovies
User
DataIngestion&
Management
The Archive info files
lexicagrammar….
MPI
Kilivila Trumai
DOBES
TofaTseltal
IMDI domain
textsoundimagemovieannotationseye movements
info files
Resource Ingestion
1. upload/define structure
2. upload/define sessions
3. upload resources
4. link resources
5. define access policy
6. system to carry out checks
LAMUS Light almost ready textsound
textsound
textsound
12
MetadataTools
Archive Utility Layer
IMDI Metadata Infrastructure
Domain ofRegistered Primary and Secondary Resources
Domain ofDescriptive Metadata
Primary Resources:TextsImagesSoundMovies
User
DataIngestion&
Management
The Archive
Editor XMLBrowser
HTMLBrowser
Treebuilder
13
MetadataTools
Archive Utility Layer
Access & User Management
Domain ofRegistered Primary and Secondary Resources
Domain ofDescriptive Metadata
Primary Resources:TextsImagesSoundMovies
User
DataIngestion&
Management
UserAuthentication
AccessRights
The Archive
14
Access Management
MPI CM
personX personY
personZ
textsoundimagemovieannotationseye movements
info files
domain ofopen metadatadescriptions
domain ofresources to be protected
• current solution is centralized – one database• has delegation mechanism to make administration tractable• association of declarations etc is possible • powerful commands from any node to give rights to groups
domainofcontrol delegation
15
MetadataTools
Archive Utility Layer
Web-based Annotation Exploitation
Domain ofRegistered Primary and Secondary Resources
Domain ofDescriptive Metadata
Primary Resources:TextsImagesSoundMovies
User
DataIngestion&
Management
UserAuthentication
AccessRights
Archive Exploitation
AnnotationExploitation
The Archive
16
MetadataTools
Archive Utility Layer
Web-based Lexicon Exploitation
Domain ofRegistered Primary and Secondary Resources
Domain ofDescriptive Metadata
Primary Resources:TextsImagesSoundMovies
User
DataIngestion&
Management
UserAuthentication
AccessRights
Archive Exploitation
AnnotationExploitation
The Archive
LexiconExploitation
17
MetadataTools
Archive Utility Layer
Web-based Text Exploitation
Domain ofRegistered Primary and Secondary Resources
Domain ofDescriptive Metadata
Primary Resources:TextsImagesSoundMovies
User
DataIngestion&
Management
UserAuthentication
AccessRights
Archive Exploitation
AnnotationExploitation
The Archive
LexiconExploitation
TextExploitationmeant for field notes,
grammars, ethno notes, etc
nothing concrete yetbut least complex
to implement
18
MetadataTools
Archive Utility Layer
Web-based Archive Exploitation
Domain ofRegistered Primary and Secondary Resources
Domain ofDescriptive Metadata
Primary Resources:TextsImagesSoundMovies
User
DataIngestion&
Management
UserAuthentication
AccessRights
Archive Exploitation
AnnotationExploitation
The Archive
LexiconExploitation
TextExploitation
?
19
MetadataTools
Archive Utility Layer
Ontology Support Necessary
Domain ofRegistered Primary and Secondary Resources
Domain ofDescriptive Metadata
Primary Resources:TextsImagesSoundMovies
User
DataIngestion&
Management
UserAuthentication
AccessRights
Archive Exploitation
AnnotationExploitation
The Archive
LexiconExploitation
TextExploitation
?
Ontological Knowledge
mo = morpho n = noun …
20
Annotation
The Problem
trans
POS noun
dog
ortho
PS n
dog
formdog
dog wordclassno
??
LexiconAnnotation
this is not the same for a stupid search engine
this is not the same for a stupid search engine
21
Central Solution
trans
POS noun
dog
ortho
PS n
dog
formdog
dog wordclassno
??
cat 107 = orthographic transcription
cat 229 = part-of-speech
cat 531 = noun
trans = cat 107, POS = cat 229, noun = cat 531
ortho = cat 107, PS = cat 229, n = cat 531
form = cat 107, wordclass = cat 229, no = cat 531
Central ISODCR
contains all relevant linguistic definitions can refer to them
given linguistic differences not realistic
22
Individual Solution
trans
POS noun
dog
ortho
PS n
dog
formdog
dog wordclassno
??
trans = ortho = form
POS = PS = gramcat
n = noun = no
Linguist’smapping
file
means lot of work for all individuals
given time constraints not realistic
will start with this version
23
Proper Solution
how long will it take to be there?nevertheless – have to start now!
central ISO DCR
MPI DCR
personal DCR
Search Engine
relations
relations
relations
Domain of Ontologiesthere will be many knowledge sources
24
MetadataTools
Archive Utility Layer
Web-Based Annotation
Domain ofRegistered Primary and Secondary Resources
Domain ofDescriptive Metadata
Primary Resources:TextsImagesSoundMovies
User
DataIngestion&
Management
UserAuthentication
AccessRights
Archive Exploitation
AnnotationExploitation
The Archive
LexiconExploitation
TextExploitation
?
Ontological Knowledge
MediaAnnotation
Archive EnrichmentYET FIRST DOWNLOAD
ANNOTATE AND UPLOADONLINE ANNOTATION
LATER
25
MetadataTools
Archive Utility Layer
Web-based Lexicon Editing
Domain ofRegistered Primary and Secondary Resources
Domain ofDescriptive Metadata
Primary Resources:TextsImagesSoundMovies
User
DataIngestion&
Management
UserAuthentication
AccessRights
Archive Exploitation
AnnotationExploitation
The Archive
LexiconExploitation
TextExploitation
?
Ontological Knowledge
MediaAnnotation
Archive Enrichment
LexicalEncoding
26
MetadataTools
Archive Utility Layer
Web-based Commentary
Domain ofRegistered Primary and Secondary Resources
Domain ofDescriptive Metadata
Primary Resources:TextsImagesSoundMovies
User
DataIngestion&
Management
UserAuthentication
AccessRights
Archive Exploitation
AnnotationExploitation
The Archive
LexiconExploitation
TextExploitation
?
Ontological Knowledge
MediaAnnotation
Archive Enrichment
LexicalEncoding
WebCommentary
Comment: This is an interesting relationType: Semantic Author: Peter Wittenburg Date: 27.9.2004
27
MetadataTools
Archive Utility Layer
Language Archives – The Vision
Domain ofRegistered Primary and Secondary Resources
Domain ofDescriptive Metadata
Primary Resources:TextsImagesSoundMovies
User
DataIngestion&
Management
UserAuthentication
AccessRights
Archive Exploitation
AnnotationExploitation
The Archive
LexiconExploitation
TextExploitation
?
Ontological Knowledge
MediaAnnotation
Archive Enrichment
LexicalEncoding
WebCommentary
28
Cross-Archive Dimension
DELAMAN / DAM-LRVisions
29
DELAMAN / DAM-LR Map
MPI
AILLA
EMELDANLC
LACITO
ELAR
PARADISEC
AMPM
LundINL
AIATSIS
30
Exchange Resources
• have to take care of long-term data preservation • only chance is world-wide distribution
Raw Data
Metadata
Raw Data
Metadatadata exchange for
data survival reasons
archive A archive B
31
Joint Access Domain
• Users want to work across administrational boundaries
Raw Data
Metadata
DOBES Archive
Raw Data
Metadata
AILLA Archive
my personalTrumai archive
AILLATrumai
DOBESTrumai
not just copies but result of own creative process
32
Goals
• it’s about future usage scenarios with distributed archives • it’s about federated language resource archives • it’s about eScience scenarios in linguistics
• want to exchange data automatically (list driven)• want to allow people to create integrated virtual working spaces • want to have an integrated access management domain (one identity, rights go with the copies, …)
• first talks in Nijmegen and at HRELP workshops 2003• foundation at PARADISEC meeting in Sydney 2003
• last workshop in Nijmegen November 2004 – linguists– archivists– (GRID) technologists
33
Technologies
• much technology to achieve our goals is available
– A-Select authentication system– Shibboleth authorization system– Handle System for URID resolving– Distributed metadata environment such as IMDI– Storage Request Broker for federated resources – Web-Services for layered services– …
34
Links
DELAMAN Web-Site www.delaman.org
DELAMAN Workshop-Site www.mpi.nl/delaman/workshop
DOBES Web-Site www.mpi.nl/DOBES
MPI Archive Web-Site www.mpi.nl/world/corpus
MPI Tools Web-Site www.mpi.nl/tools