The current state of Metadata - as far as we understand it -
description
Transcript of The current state of Metadata - as far as we understand it -
The current state of Metadata- as far as we understand it -
Peter WittenburgThe Language Archive - Max Planck Institute
CLARIN Research InfrastructureNijmegen, The Netherlands
Old Concept
• of course "metadata" is an old concept • library cards were introduced to cope with
mass and anonymity
• not surprising that library people started thinking about this to describe all kind web-accessible resources
• DC and qualified DC wee the results
• however, research world is different - not just search
• therefore in many domains solutions were developed • 2 years ago CLARIN revised its 15 year old set&framework
Big Ideas
• of course managing increasing amounts of data • of course finding valuable data in the growing haystacks
• but also• machine usage of metadata
• automatic profile matching• research statistics - virtual sub-collection building• etc.
• multilinguality in a multilingual European society• interdisciplinary research
biodiversity people should find information in linguistic archivesetc.
• linking with contextual information • document lifecycle management (provenance)
Big Change
• until now researchers informed each other • culture of personal exchange
• claim: this will only work partially in the future• have distributed centers storing lots of data
national and discipline dimensions • depositors upload their data into these centers• will have an anonymous landscape of data & tools
all offered as services • what do we have to find things:
• proper metadata descriptions • social tagging by virtual organizations • content to operate on by "smart" data mining
Big Question
• are we ready to meet these wishes and changes?• probably not
• some major issues • quality • interoperability • registry and reference stability • functional• multilingual • scalability • IT principles
Quality Issue
• lack quality in descriptions • not all elements filled in
(researchers are lazy, lack of tool support)• often not schema based (XLS) thus inconsistent • lack agreed and standardized vocabularies
• ISO 639-3 - about 6000 language codes • what about subject classification schemes • what about institution names• thus many errors and inconsistencies• ontologies are expensive to maintain
• misinterpretations/misuse of element semantics • etc
Interoperability Issue
• hampered by different approaches (closed DB, no modularity, embedded ontologies)
• structural difficulties up to context dependency• difficult semantic mapping
• different description dimensions • bad element definitions • bad vocabulary definitions
• only little support of OAI-PMH• reliance on DC semantics - but useless for research etc• often "hardwired" mappings • lack of a flexible framework to create/share/use relations • little is standardized - what about lifetime then
Registry and Reference Stability Issue
• flexibility only when we separate things • define & register all concepts in open registries
(we are using ISO 12620 - ISOcat) • define & register all components/profiles
(we are using CLARIN registry)• register all mappings (nothing yet)
• but if we do this we need to refer • are our references stable??
• some are using Cool URIs - are they just URLs?• some using explicit Handles - are they maintained?• who takes care?
(we are using EPIC - European PID Consortium)
Functional Issue
• do we address new functional requirements
• what about provenance information is it automatically generated
• what about versions - are they visible • what about ltp information • what about formal access information• do we know what is needed for the web services scenario
(profile matching, deployment information, etc)
Multilingual Issue
• what does it really include?• localizing all software • multilingual definitions of all concepts
elements and vocabulary terms(no translations of proper names of course or?)
• or do we simply rely on some lingua franca • answer probably discipline dependent • how much is (should be) public involved
• whatever we do it is a lot of work• CLARIN: ISOcat covers almost all major EU languages
Scalability Issue
• are our solutions scalable?• in EUROPEANA millions of metadata records• in CLARIN about 270.000
• how to structure the offer • how to present this to naive users
• do we share same granularity (md at collection and/or resource level)• can we deal with aggregations in same way
• can we apply semantic web technology • automatic mapping• automatic quality improvement
IT Principles
• we need to disseminate the message of some basic IT principles
• define and register your semantics• specify and register your syntax • use a stable reference scheme• in some areas separate definitions and relations
• get things standardized or use standards such as • XML, some schema language• ISO 12620, etc• URI, Handles
What can we do?
• listen to each other first
• increase awareness about metadata and basic principles
• see how we can create an interoperable landscape • harmonizing approaches• harmonizing along major issues• making things explicit and scalable • look for proper interdisciplinary solutions
Üm nicht to end in Babylonish scenario nous avons still algo time om sistemas te improve.
Thanks for your attention.
moving towards an ideal e-Science
domain