Report from the Europeana Semantic Enrichment Taskforce Roxanne Wyns, LIBIS KU Leuven...
-
Upload
ethan-singleton -
Category
Documents
-
view
212 -
download
0
Transcript of Report from the Europeana Semantic Enrichment Taskforce Roxanne Wyns, LIBIS KU Leuven...
1
Report from the Europeana Semantic Enrichment Taskforce
Roxanne Wyns, LIBIS KU Leuven
5 mei 2014Europeana Vlaanderen
Europeana Vlaanderen 2
Context• EuropeanaTech Task Force on Multilingual and Semantic Enrichment
Strategy– Task Force chairs: Juliane Stiller, Antoine Isaac, Vivien Petras– Contributors:
Agnès Simon, Bibliothèque Nationale de France
Daniel Vila Suero, Universidad Politécnica de Madrid
Eero Hyvönen, Aalto University
Esther Guggenheim, National Library of Israel
Lars G. Svensson, Deutsche Nationalbibliothek
Nuno Freire, The European Library
Rainer Simon, AIT Austrian Institute of Technology
Rodolphe Bailly, Musical Instrument Museums Online
Roxanne Wyns, LIBIS
Seth van Hooland, Université Libre de Bruxelles
Shenghui Wang, Online Computer Library Center
Vladimir Alexiev, Ontotext
5 mei 2014
Europeana Vlaanderen 3
Context• Task Force objectives
– Analyze datasets in Europeana– Evaluate them in regard to their enrichment potential and the quality of the
enrichments that where executed– Derive a strategy for enriching metadata field that would add value to users
• One day workshop in Berlin (8/11/2013)– Analyzing randomly selected datasets, their metadata fields and enrichments– Provide a set of recommendations concerning enrichment, metadata quality,
vocabularies, documentation, training …
• Task Force report: http://pro.europeana.eu/documents/468623/8b75b054-712e-432b-a0f7-761898e6f60e
5 mei 2014
Europeana Vlaanderen 4
ContextSemantic enrichment process at Europeana:1. Matching the metadata of Europeana objects to external semantic data
resources Such as controlled vocabularies or Wikepedia (DBpedia) Resulting in links between Europeana objects and external resources Can be re-used by users of the Europeana data services (e.g. API) e.g. ‘reciepiente cerámico’ is matched with the concept ‘ceramics’ in GEMET
2. Exploitation of these links through adding this data to the index behind the portal Enrichments provides multilingual search and retrieval The multilingual labels from the external reference vocabularies are pulled in e.g. a search on ‘céramique’ will also give the object with ‘reciepiente cerámico’
5 mei 2014
Europeana Vlaanderen 5
Context• Europeana uses the AnnoCultor tool. Cached at http://
europeanalabs.eu/wiki/EDMPrototypingTask21Annocultor• Interprets values • Searches for corresponding terms in specialised vocabularies• Adds links to matching terms (dcterms:spatial = Venise link to place:
http://sws.geonames.org/3164603• Pulls in additional information about this record (βενετία, velence,
венеция, venice, etc.)
Enrichments take place on edm_place, skos_concept, edm_agent, and edm_timespan
5 mei 2014
Europeana Vlaanderen 6
Enriched elements: edm_place• Place enrichment (edm_place:*)
– Using a subset of GeoNames (www.geonames.org) – Limited to European geographic locations– Limiting on prefixes "A", "P.PPL", "S.CSTL", "S.ANS", "S.MNMT", "S.LIBR",
"S.HSTS", "S.OPRA", "S.AMTH", "S.TMPL", "T.ISL“ (http://www.geonames.org/statistics/total.html)
– Enrichment limited to EDM fields “dcterms:spatial” and “dc:coverage”– Enrichment rules: exact matching?– Result: 5.8M objects enriched, provides multilingual search on places
http://europeana.eu/portal/search.html?query=edm_place%3A*– Results include CP enrichments
Issues?– Appear to be limited– But only places in Europe are enriched– And only for the geographical coverage EDM elements– Sometimes the record is linked to the wider region (e.g. region instead of city)
5 mei 2014
Europeana Vlaanderen 7
Label ‘Venecia’ is linked to http://sws.geonames.org/3164003
5 mei 2014
Europeana Vlaanderen 11
Enriched elements: skos_concept• Concept (topic) enrichment
– Using GEMET thesaurus (http://www.eionet.europa.eu/gemet/) – 12 concepts removed to avoid linking with homonyms (e.g. Druck) – Some WWI battles and the two categories “World War I” and “art” from are taken
from DBpedia– Enrichment limited to EDM fields “dc:subject” and “dc:type”– Enrichment rules: exact matching?– Result: 8.8 M objects enriched,
http://www.europeana.eu/portal/search.html?query=skos_concept%3A*
Issues?– Exact matching not limited to the language of the record (Dutch “Tegel” mapped to
the Swedish “Tegel”, meaning brick)– No suitable multilingual concept thesauri for the cultural domain drawing– Noise because of metadata quality (dc:type “photo”, “book”, “video”,…)
5 mei 2014
Europeana Vlaanderen 14
Enriched elements: edm_agent• Agent (person) enrichment
– Small set of artists (painters) from DBpedia– Enrichment limited to EDM fields “dc:creator” and “dc:contributor”– Enrichment rules: exact matching?– Result: 136K objects enriched http
://www.europeana.eu/portal/search.html?query=edm_agent%3A*
Issues?– Quality or structure of the provided metadata (e.g. Role: Actor)– Limited list of resources to link to
5 mei 2014
Europeana Vlaanderen 16
Enriched elements: edm_timespan• Time period enrichment
– Using Semium time periods vocabulary (http://semium.org/time/)– Partly automatically generated (3rd quarter of 15th century) / manually generated
(Roman empire)– Enrichment limited to EDM fields:dc:date, dc:coverage, dc:temporal, edm:year – Enrichment rules: exact matching?– Result: 13.3M objects enriched
http://www.europeana.eu/portal/search.html?query=edm_timespan%3A*
Issues?– Some words (qualifiers to dates, e.g. “made”, “printed”…) have to be removed from
fields prior to enrichment, but this is only done for English records– So again a problem of quality or structure of the provided metadata – Huge issues with BC dates, but also date ranges (e.g. “1701/1800" is mapped to
"1701" only)
5 mei 2014
Europeana Vlaanderen 17
How to find your collectionOverview of providers: http://www.europeana.eu/portal/europeana-providers.html • Europeana Inside Sweden (57349)
– Europeana Inside Sweden (57349)– Murberget Länsmuseet Västernorrland (34377)– Stockholms läns museum (22972)
• CultureGrid– Imperial War Museums (26851)
5 mei 2014
Europeana Vlaanderen 18
How to find your collectionQuery your Inside collection:• Query: DATA_PROVIDER:"Murberget Länsmuseet Västernorrland”
– Exact spelling, including use of uppercase– Gives also results from other content deliveries (e.g EuropeanaLocal Sweden +
Europeana Inside Sweden)
5 mei 2014
Europeana Vlaanderen 19
How to find your collection• On collection number: “europeana_collectionName: 2032001*”• Start from overview of providers:
http://www.europeana.eu/portal/europeana-providers.html
5 mei 2014
Europeana Vlaanderen 21
How to find your collection• Copy the collection number from the URL “2032001”
5 mei 2014
Europeana Vlaanderen 22
How to find your collection• Query on collection number: europeana_collectionName: 2032001*
5 mei 2014
Europeana Vlaanderen 23
Query examples for enrichments• On concept/topic: europeana_collectionName: 2032001* AND skos_concept:*
5 mei 2014
Europeana Vlaanderen 24
Query examples for enrichments• On concept/topic: europeana_collectionName: 2032001* AND skos_concept:*
5 mei 2014
Europeana Vlaanderen 25
Query examples for enrichments• On place: europeana_collectionName: 2032001* AND edm_place:*
5 mei 2014
Europeana Vlaanderen 26
Query examples for enrichments• On place: europeana_collectionName: 2032001* AND edm_place:*
5 mei 2014
Europeana Vlaanderen 27
Query examples for enrichments• On agent: europeana_collectionName: 2032001* AND edm_agent:*
5 mei 2014
Europeana Vlaanderen 28
Query examples for enrichments• On agent: europeana_collectionName: 2032001* AND edm_agent:*
5 mei 2014
Europeana Vlaanderen 29
Query examples for enrichments• On timespan: europeana_collectionName: 2032001* AND edm_timespan:*
5 mei 2014
Europeana Vlaanderen 31
Further tips• View metadata in a human-readable form by adding ?format=labels
5 mei 2014
Europeana Vlaanderen 33
Quality of the enrichments - IWM• On edm_place: 2,520 out of 26851 records got enriched for IWM
5 mei 2014
Europeana Vlaanderen 34
Quality of the enrichments - IWM• Only the first in the list got enriched http://sws.geonames.org/3017382/
5 mei 2014
Europeana Vlaanderen 35
Quality of the enrichments - IWM• All available languages are pulled in
5 mei 2014
Europeana Vlaanderen 36
Quality of the enrichments - IWM• On edm_timespan: 3,635 out of 26851 records got enriched for IWM
(semium.org)
5 mei 2014
Europeana Vlaanderen 37
Quality of the enrichments - IWM• Early 20th century pulled from semium.org in EN and RU
5 mei 2014
Europeana Vlaanderen 38
Quality of the enrichments - IWM• On edm_concept: 4,233 out of 26851 records got enriched (GEMET)
for IMW
5 mei 2014
Europeana Vlaanderen 39
Quality of the enrichments - IWM• Linked to the label ‘Art’ in Gemet ????
5 mei 2014
Europeana Vlaanderen 40
Quality of the enrichments - IWM• On edm_agent: 0 out of 26851 records got enriched (GEMET)
5 mei 2014
Europeana Vlaanderen 41
Quality of the enrichments• For ‘Murberget Länsmuseet Västernorrland’ 24,312 out of 34,377 got
enriched, but it appears to be that most of these enrichments where provided by the CP
5 mei 2014
Europeana Vlaanderen 42
Recommendations• 3 levels that influence the quality of enrichments
– Metadata level (the source of the enrichments)– Vocabulary level (the target of the enrichments)– Workflow level (the process to create enrichments)
5 mei 2014
Europeana Vlaanderen 43
Recommendations• Metadata level (the source of the enrichments)
– Shortcomings in original metadata– Difficult to spot, only by manual review
Feedback form to flag incorrect metadata and enrichments
– Enrichments missed because of syntactic aspects (e.g. different subjects separated by comma or semicolon, inconsistent date formats, combined data ‘role: actor’ …)
Apply formal formatting rules, if valid format only then enriched
− Low value enrichments because of lack of precision (e.g. enrichments on a broad concept like ‘photo’ have little value)
Quality score, for example by checking the diversity of the metadata
− Providers should provide URIs for conceptual resources where possible rather then mere strings or codes
Europeana could point to interesting reference repositories
5 mei 2014
Europeana Vlaanderen 44
Recommendations• Mapping to EDM
– Enrichment flaw introduced by mapping Better (non-technical) documentation, FAQ … Inform on the fields getting enriched and how CP could provide more appropriate
metadata by direct communication and training Present more clearly the difference between the metadata stored in the production
database and what is displayed on the portal and indexed in the current search engine
Supporting tools for testing mappings in a realistic setting Strengthen awareness on display issues and over-fitting of metadata for display
purposes, which often makes it less suitable for enrichments and data exchange (API, Linked Data)
Change current representation on the portal where Europeana enrichments are shown together with the CP enrichments
5 mei 2014
Europeana Vlaanderen 45
Recommendations• Checking metadata at ingestion time
– Enrichment flaw introduced by mapping Enforce some recommendations regarding the quality of metadata at ingestion time Fields that do not respect the agreed best practices should be flagged to the
providers (e.g. dates not in the preferred format) Quality scores?
5 mei 2014
Europeana Vlaanderen 46
Recommendations• Summary of finding taken from “EuropeanaTech Task Force on a Multilingual and
Semantic Enrichment Strategy: final report”
5 mei 2014
Europeana Vlaanderen 47
Recommendations• Vocabularies:
– The vocabulary should fit the context of the record to be enriched (e.g. GEMET, especially the broader concepts are often not precise and in some cases totally of topic
Skip some of these concepts Link labels to the reference labels of the same language – A contextual vocabulary coming from the provider’s context is often the best
solution, but during mapping it often get list Envision carrying out own efforts of vocabulary mapping Create a reference resource for certain metadata fields with limited amount of
values (e.g. format, language, country…)
5 mei 2014
Europeana Vlaanderen 48
Recommendations• Enrichment process:
– Biggest problem arose from strings separated by comma or semicolon Documentation on the enrichment rules so CP can take this into account Automatic enrichment should try to match all keywords in a field (not just the first
before the comma or semicolon) Matches shouldn’t happen between metadata field values in one language and
labels of semantic resources in another language
5 mei 2014
Europeana Vlaanderen 49
Recommendations• Involve users in improving metadata quality and enrichment quality
(crowd-source and validate links)
5 mei 2014
Europeana Vlaanderen 50
Opportunities• Multilingual access to over 28 milj. records• More enriched elements • Freely available for re-use (DEA)• Closer to original metadata thanks to EDM• Data can be contextualized, semantically linked to other data• Allows for richer semantic query expansion & cross-collection
browsing
5 mei 2014