Report from the Europeana Semantic Enrichment Taskforce Roxanne Wyns, LIBIS KU Leuven...

51
Report from the Europeana Semantic Enrichment Taskforce Roxanne Wyns, LIBIS KU Leuven [email protected] 1 5 mei 2014 Europeana Vlaanderen

Transcript of Report from the Europeana Semantic Enrichment Taskforce Roxanne Wyns, LIBIS KU Leuven...

1

Report from the Europeana Semantic Enrichment Taskforce

Roxanne Wyns, LIBIS KU Leuven

[email protected]

5 mei 2014Europeana Vlaanderen

Europeana Vlaanderen 2

Context• EuropeanaTech Task Force on Multilingual and Semantic Enrichment

Strategy– Task Force chairs: Juliane Stiller, Antoine Isaac, Vivien Petras– Contributors:

Agnès Simon, Bibliothèque Nationale de France

Daniel Vila Suero, Universidad Politécnica de Madrid

Eero Hyvönen, Aalto University

Esther Guggenheim, National Library of Israel

Lars G. Svensson, Deutsche Nationalbibliothek

Nuno Freire, The European Library

Rainer Simon, AIT Austrian Institute of Technology

Rodolphe Bailly, Musical Instrument Museums Online

Roxanne Wyns, LIBIS

Seth van Hooland, Université Libre de Bruxelles

Shenghui Wang, Online Computer Library Center

Vladimir Alexiev, Ontotext

5 mei 2014

Europeana Vlaanderen 3

Context• Task Force objectives

– Analyze datasets in Europeana– Evaluate them in regard to their enrichment potential and the quality of the

enrichments that where executed– Derive a strategy for enriching metadata field that would add value to users

• One day workshop in Berlin (8/11/2013)– Analyzing randomly selected datasets, their metadata fields and enrichments– Provide a set of recommendations concerning enrichment, metadata quality,

vocabularies, documentation, training …

• Task Force report: http://pro.europeana.eu/documents/468623/8b75b054-712e-432b-a0f7-761898e6f60e

5 mei 2014

Europeana Vlaanderen 4

ContextSemantic enrichment process at Europeana:1. Matching the metadata of Europeana objects to external semantic data

resources Such as controlled vocabularies or Wikepedia (DBpedia) Resulting in links between Europeana objects and external resources Can be re-used by users of the Europeana data services (e.g. API) e.g. ‘reciepiente cerámico’ is matched with the concept ‘ceramics’ in GEMET

2. Exploitation of these links through adding this data to the index behind the portal Enrichments provides multilingual search and retrieval The multilingual labels from the external reference vocabularies are pulled in e.g. a search on ‘céramique’ will also give the object with ‘reciepiente cerámico’

5 mei 2014

Europeana Vlaanderen 5

Context• Europeana uses the AnnoCultor tool. Cached at http://

europeanalabs.eu/wiki/EDMPrototypingTask21Annocultor• Interprets values • Searches for corresponding terms in specialised vocabularies• Adds links to matching terms (dcterms:spatial = Venise link to place:

http://sws.geonames.org/3164603• Pulls in additional information about this record (βενετία, velence,

венеция, venice, etc.)

Enrichments take place on edm_place, skos_concept, edm_agent, and edm_timespan

5 mei 2014

Europeana Vlaanderen 6

Enriched elements: edm_place• Place enrichment (edm_place:*)

– Using a subset of GeoNames (www.geonames.org) – Limited to European geographic locations– Limiting on prefixes "A", "P.PPL", "S.CSTL", "S.ANS", "S.MNMT", "S.LIBR",

"S.HSTS", "S.OPRA", "S.AMTH", "S.TMPL", "T.ISL“ (http://www.geonames.org/statistics/total.html)

– Enrichment limited to EDM fields “dcterms:spatial” and “dc:coverage”– Enrichment rules: exact matching?– Result: 5.8M objects enriched, provides multilingual search on places

http://europeana.eu/portal/search.html?query=edm_place%3A*– Results include CP enrichments

Issues?– Appear to be limited– But only places in Europe are enriched– And only for the geographical coverage EDM elements– Sometimes the record is linked to the wider region (e.g. region instead of city)

5 mei 2014

Europeana Vlaanderen 7

Label ‘Venecia’ is linked to http://sws.geonames.org/3164003

5 mei 2014

Europeana Vlaanderen 8

GeoNames data pulled in by Europeana

5 mei 2014

Europeana Vlaanderen 9

Provides multilingual search: E.g. a search on “velence”

5 mei 2014

Europeana Vlaanderen 10

also gives “venecia”

5 mei 2014

Europeana Vlaanderen 11

Enriched elements: skos_concept• Concept (topic) enrichment

– Using GEMET thesaurus (http://www.eionet.europa.eu/gemet/) – 12 concepts removed to avoid linking with homonyms (e.g. Druck) – Some WWI battles and the two categories “World War I” and “art” from are taken

from DBpedia– Enrichment limited to EDM fields “dc:subject” and “dc:type”– Enrichment rules: exact matching?– Result: 8.8 M objects enriched,

http://www.europeana.eu/portal/search.html?query=skos_concept%3A*

Issues?– Exact matching not limited to the language of the record (Dutch “Tegel” mapped to

the Swedish “Tegel”, meaning brick)– No suitable multilingual concept thesauri for the cultural domain drawing– Noise because of metadata quality (dc:type “photo”, “book”, “video”,…)

5 mei 2014

Europeana Vlaanderen 125 mei 2014

Europeana Vlaanderen 135 mei 2014

Europeana Vlaanderen 14

Enriched elements: edm_agent• Agent (person) enrichment

– Small set of artists (painters) from DBpedia– Enrichment limited to EDM fields “dc:creator” and “dc:contributor”– Enrichment rules: exact matching?– Result: 136K objects enriched http

://www.europeana.eu/portal/search.html?query=edm_agent%3A*

Issues?– Quality or structure of the provided metadata (e.g. Role: Actor)– Limited list of resources to link to

5 mei 2014

Europeana Vlaanderen 15

Enriched elements: edm_agent

5 mei 2014

Europeana Vlaanderen 16

Enriched elements: edm_timespan• Time period enrichment

– Using Semium time periods vocabulary (http://semium.org/time/)– Partly automatically generated (3rd quarter of 15th century) / manually generated

(Roman empire)– Enrichment limited to EDM fields:dc:date, dc:coverage, dc:temporal, edm:year – Enrichment rules: exact matching?– Result: 13.3M objects enriched

http://www.europeana.eu/portal/search.html?query=edm_timespan%3A*

Issues?– Some words (qualifiers to dates, e.g. “made”, “printed”…) have to be removed from

fields prior to enrichment, but this is only done for English records– So again a problem of quality or structure of the provided metadata – Huge issues with BC dates, but also date ranges (e.g. “1701/1800" is mapped to

"1701" only)

5 mei 2014

Europeana Vlaanderen 17

How to find your collectionOverview of providers: http://www.europeana.eu/portal/europeana-providers.html • Europeana Inside Sweden (57349)

– Europeana Inside Sweden (57349)– Murberget Länsmuseet Västernorrland (34377)– Stockholms läns museum (22972)

• CultureGrid– Imperial War Museums (26851)

5 mei 2014

Europeana Vlaanderen 18

How to find your collectionQuery your Inside collection:• Query: DATA_PROVIDER:"Murberget Länsmuseet Västernorrland”

– Exact spelling, including use of uppercase– Gives also results from other content deliveries (e.g EuropeanaLocal Sweden +

Europeana Inside Sweden)

5 mei 2014

Europeana Vlaanderen 19

How to find your collection• On collection number: “europeana_collectionName: 2032001*”• Start from overview of providers:

http://www.europeana.eu/portal/europeana-providers.html

5 mei 2014

Europeana Vlaanderen 20

How to find your collection• Open 1 record

5 mei 2014

Europeana Vlaanderen 21

How to find your collection• Copy the collection number from the URL “2032001”

5 mei 2014

Europeana Vlaanderen 22

How to find your collection• Query on collection number: europeana_collectionName: 2032001*

5 mei 2014

Europeana Vlaanderen 23

Query examples for enrichments• On concept/topic: europeana_collectionName: 2032001* AND skos_concept:*

5 mei 2014

Europeana Vlaanderen 24

Query examples for enrichments• On concept/topic: europeana_collectionName: 2032001* AND skos_concept:*

5 mei 2014

Europeana Vlaanderen 25

Query examples for enrichments• On place: europeana_collectionName: 2032001* AND edm_place:*

5 mei 2014

Europeana Vlaanderen 26

Query examples for enrichments• On place: europeana_collectionName: 2032001* AND edm_place:*

5 mei 2014

Europeana Vlaanderen 27

Query examples for enrichments• On agent: europeana_collectionName: 2032001* AND edm_agent:*

5 mei 2014

Europeana Vlaanderen 28

Query examples for enrichments• On agent: europeana_collectionName: 2032001* AND edm_agent:*

5 mei 2014

Europeana Vlaanderen 29

Query examples for enrichments• On timespan: europeana_collectionName: 2032001* AND edm_timespan:*

5 mei 2014

Europeana Vlaanderen 30

Query examples for enrichments (8)

5 mei 2014

Europeana Vlaanderen 31

Further tips• View metadata in a human-readable form by adding ?format=labels

5 mei 2014

Europeana Vlaanderen 32

Further tips

5 mei 2014

Europeana Vlaanderen 33

Quality of the enrichments - IWM• On edm_place: 2,520 out of 26851 records got enriched for IWM

5 mei 2014

Europeana Vlaanderen 34

Quality of the enrichments - IWM• Only the first in the list got enriched http://sws.geonames.org/3017382/

5 mei 2014

Europeana Vlaanderen 35

Quality of the enrichments - IWM• All available languages are pulled in

5 mei 2014

Europeana Vlaanderen 36

Quality of the enrichments - IWM• On edm_timespan: 3,635 out of 26851 records got enriched for IWM

(semium.org)

5 mei 2014

Europeana Vlaanderen 37

Quality of the enrichments - IWM• Early 20th century pulled from semium.org in EN and RU

5 mei 2014

Europeana Vlaanderen 38

Quality of the enrichments - IWM• On edm_concept: 4,233 out of 26851 records got enriched (GEMET)

for IMW

5 mei 2014

Europeana Vlaanderen 39

Quality of the enrichments - IWM• Linked to the label ‘Art’ in Gemet ????

5 mei 2014

Europeana Vlaanderen 40

Quality of the enrichments - IWM• On edm_agent: 0 out of 26851 records got enriched (GEMET)

5 mei 2014

Europeana Vlaanderen 41

Quality of the enrichments• For ‘Murberget Länsmuseet Västernorrland’ 24,312 out of 34,377 got

enriched, but it appears to be that most of these enrichments where provided by the CP

5 mei 2014

Europeana Vlaanderen 42

Recommendations• 3 levels that influence the quality of enrichments

– Metadata level (the source of the enrichments)– Vocabulary level (the target of the enrichments)– Workflow level (the process to create enrichments)

5 mei 2014

Europeana Vlaanderen 43

Recommendations• Metadata level (the source of the enrichments)

– Shortcomings in original metadata– Difficult to spot, only by manual review

Feedback form to flag incorrect metadata and enrichments

– Enrichments missed because of syntactic aspects (e.g. different subjects separated by comma or semicolon, inconsistent date formats, combined data ‘role: actor’ …)

Apply formal formatting rules, if valid format only then enriched

− Low value enrichments because of lack of precision (e.g. enrichments on a broad concept like ‘photo’ have little value)

Quality score, for example by checking the diversity of the metadata

− Providers should provide URIs for conceptual resources where possible rather then mere strings or codes

Europeana could point to interesting reference repositories

5 mei 2014

Europeana Vlaanderen 44

Recommendations• Mapping to EDM

– Enrichment flaw introduced by mapping Better (non-technical) documentation, FAQ … Inform on the fields getting enriched and how CP could provide more appropriate

metadata by direct communication and training Present more clearly the difference between the metadata stored in the production

database and what is displayed on the portal and indexed in the current search engine

Supporting tools for testing mappings in a realistic setting Strengthen awareness on display issues and over-fitting of metadata for display

purposes, which often makes it less suitable for enrichments and data exchange (API, Linked Data)

Change current representation on the portal where Europeana enrichments are shown together with the CP enrichments

5 mei 2014

Europeana Vlaanderen 45

Recommendations• Checking metadata at ingestion time

– Enrichment flaw introduced by mapping Enforce some recommendations regarding the quality of metadata at ingestion time Fields that do not respect the agreed best practices should be flagged to the

providers (e.g. dates not in the preferred format) Quality scores?

5 mei 2014

Europeana Vlaanderen 46

Recommendations• Summary of finding taken from “EuropeanaTech Task Force on a Multilingual and

Semantic Enrichment Strategy: final report”

5 mei 2014

Europeana Vlaanderen 47

Recommendations• Vocabularies:

– The vocabulary should fit the context of the record to be enriched (e.g. GEMET, especially the broader concepts are often not precise and in some cases totally of topic

Skip some of these concepts Link labels to the reference labels of the same language – A contextual vocabulary coming from the provider’s context is often the best

solution, but during mapping it often get list Envision carrying out own efforts of vocabulary mapping Create a reference resource for certain metadata fields with limited amount of

values (e.g. format, language, country…)

5 mei 2014

Europeana Vlaanderen 48

Recommendations• Enrichment process:

– Biggest problem arose from strings separated by comma or semicolon Documentation on the enrichment rules so CP can take this into account Automatic enrichment should try to match all keywords in a field (not just the first

before the comma or semicolon) Matches shouldn’t happen between metadata field values in one language and

labels of semantic resources in another language

5 mei 2014

Europeana Vlaanderen 49

Recommendations• Involve users in improving metadata quality and enrichment quality

(crowd-source and validate links)

5 mei 2014

Europeana Vlaanderen 50

Opportunities• Multilingual access to over 28 milj. records• More enriched elements • Freely available for re-use (DEA)• Closer to original metadata thanks to EDM• Data can be contextualized, semantically linked to other data• Allows for richer semantic query expansion & cross-collection

browsing

5 mei 2014

Europeana Vlaanderen 51

Questions?

5 mei 2014