Representation and Absence in Digital Resources: The Case of Europeana Newspapers
-
Upload
alastairdunning -
Category
Education
-
view
278 -
download
0
description
Transcript of Representation and Absence in Digital Resources: The Case of Europeana Newspapers
Representation and Absence in Digital Resources: The Case of Europeana Newspapers
Alastair Dunning, The European Library, @alastairdunningClemens Neudecker, National Library of Netherlands, @cneudecker
DH2014, Lausanne
Source: http://www.nytimes.com/2007/03/10/business/yourmoney/11archive.html
Source: Europeana Strategic Plan, 2015-2020, currently unpublished. See also Enumerate Project, enumerate.eu
The estimated total cost of digitising the collections of Europe’s
museums, archives and libraries, including the audiovisual material
they hold is approximately €100bn, or €10bn per annum for the next 10
years, factoring in a cumulative efficiency gain of 0.5%
per annum.
The Research & Development Budget for the Joint Strike Fighter
programme is estimated at €40.34bn.
It would cost between 10% and 40% of the Joint Strike Fighter R&D
budget to digitise every eligible title in Europe’s libraries
Source: Nick Poole, Collections Trust, http://nickpoole.org.uk/wp-content/uploads/2011/12/digiti_report.pdf
Currently:
2 millionpages of full text
By 2015:
10 million pages of
full text
Searching by keyword, and organise by language, date, source library, title
Link: http://www.theeuropeanlibrary.org/tel4/newspapers
Currently: Metadata records relating to
1.12m issues
By 2015: Metadata records relating to up to
4m issues -
Browse by date or map
Link: http://www.theeuropeanlibrary.org/tel4/newspapers
Full Text from following libraries
•Bibliotheque nationale de France / National Library France•Koninklijke Bibliotheek / National Library of the Netherlands•Landesbibliothek Dr. Friedrich Teßmann / Teßmann Library•Eesti Rahvusraamatukogu / Estonian National Library• Kansalliskirjasto / National Library of Finland• Latvijas Nacionala Biblioteka / National Library of Latvia•Biblioteka Narodowa / National Library of Poland•Milli Kutuphane Baskanligi / National Library of Turkey• Österreichische Nationalbibliothek / Austrian National Library•Staatsbibliothek zu Berlin / Berlin State Library•Staats- und Universitätsbibliothek Hamburg / State and University Library• Univerzitet u Beogradu / University Library of Belgrade
Searching by title
Issue Level Records from following libraries
•National Library of Wales•St. Cyril and Methodius National Library / The National Library of Bulgaria•National Library of Czech Republic•National and University Library in Zagreb•Koninklijke Bibliotheek van België / Bibliothèque royale de Belgique•Narodna in univerzitetna knjinica / National and University Library of Slovenia•National Library of Portugal•National Library of Romania•Landsbókasafn Íslands - Háskólabókasafn / National and Univeristy Library of Iceland National Library of Spain•Bibliothèque nationale de Luxembourg / National Library of Luxembourg
Finding matching results in single or multiple issues
Highlighting search terms
So far, okay. Similar functionality to other national and regional digital libraries of newspapers
See other archives via:https://www.google.com/maps/ms?msid=217164746645697066594.0004c3d764fcb71ed2314&msa=0
But what was the user response to an aggregation of European newspaper libraries ?
Results of Usability Testing: http://www.europeana-newspapers.eu/wp-content/uploads/2014/05/The-European-Library-Newspaper-Archive-Usability-testing-Report-April-2014.pdf
“Aggregated view of content
from many sources highly valued.
There was a strong positive reaction to the availability of
the archive.”
“Many saying they would be keen to return to the site as
the content expands.”
“Ability to search over geographic map was highly valued”
Plenty of quibbles about design
- positions of advanced options- re-order list of results- manipulating facets
Much greater expectations of functionality once logged in
For example,Saved searches
New content notification
“Much of the value of the site to participants was provided by the images of the documents.
Participants expected to be able to save a 'local' copy once they
had located content of relevance.
As no download facility is provided, this led to some frustration and undermined the overall potential value of the site for some
participants.”
Timetable for rest of project
Now – Protype version of interface shared with project
Throughout 2014 - Ongoing creation of OCR, and other
related technical work (OLR, Named Entities)
Throughout 2014 – Live version of website improved /
usability testing / added content
Autumn 2014 - Final project conference
Late 2014 - Newspaper browser completed with content and
tools from project
More information at
http://www.europeana-newspapers.eu/Interface at
http://www.theeuropeanlibrary.org/tel4/newspapers/
Things the users didn’t say(but we thought they would)
Why can’t I edit the text ?
(Our sample was researchers/ maybe it is other communities interested in crowdsourcing?)
Note: If time permits, The European Library will develop some crowdsourcing feature
Can I download text for data mining?
Remember: Digital Humanists are still a small percentage of humanists and users
Note: Many, although not all, of the texts are marked public domain, so this is feasible in legal terms
Number of digitised pages in interface: c.2m
Number of digitised pages in European libraries: c.130m
Number of physical pages in European libraries: 1.5bn+
Source: European Newspaper Survey Report http://www.europeana-newspapers.eu/wp-content/uploads/2012/04/D4.1-Europeana-newspapers-survey-report.pdf
Source: European Newspaper Survey Report http://www.europeana-newspapers.eu/wp-content/uploads/2012/04/D4.1-Europeana-newspapers-survey-report.pdf
Quantities of newspapers – a) in project b) digitised in total c) in physical libraries
The project digital library is only a fraction of the newspaper
archive of the continent, indeed the world
As libraries, how should we represent that
absence to users ?
Should such absence be represented in the interface itself ?
Vast
white
spaces in the list of results ?
….. Difficult to represent ‘archival gaps’ when seen in the context of how little has been digitised - creates a
needle in the haystack ….
One placeholder for several metadata records ?
But in many cases the metadata does not even
exist, so a search interface has less to work with
Standardised information for
every digital resource for representing collections, extent of content, licencing and re-use conditions
Standardised information? For every digital resource produced in the world ?
Are you kidding ?
Charts and graphs external to the interface ?
Graphs are the most obvious way of adding context
but still very reliant on the library producing such charts
How to derive a representative (random) sample from a digital collection?
Source: http://dilbert.com/strips/comic/2001-10-25/
Pieter Francois, winner of BL Labs competition 2013:
“How representative are the historical texts humanities scholars study of the overall body of ‘surviving’ texts that are held in the various library collections?”
labs.bl.uk/Sample+Generator
There are other issues in the project content too
Major issues
OCR quality varies Different licensing statements from
different countries Date of copyright boundaries different in
each country
There are other issues in the interface too
Minor Issues
Some pages (2m by 2015) have articles segmentation
Some library content has named entity extraction effecting search results
What impact do OCR errors have when text mining a large digital collection?
Source: http://homepages.inf.ed.ac.uk/balex/publications/slides-DATeCH.pdf
10M pages, 7 billion words – how much you are actually ignoring when using only the “good” OCR
How should we allow users better ways to
understand the digital library ?
What role can the API play in this?
Can opening up the data in the digital library and allowing it to explored in different ways ?
Traditional Model With an API
Interface(Created by Library)
Data(Published by Library)
Interface(Created by Third Party)
Data(Published by Library)
API – Application Programming Interfaces
Pioneering work of Trove API(or rather of Tim Sherratt)
Interface(Created by Library)
Data(Published by Library)
Trove Newspapers site as published by National Library
of Australia, and based on data provided by Libraryhttp://trove.nla.gov.au/newspaper
Trove Newspapers statistics develolped by third party, based
on data provided by libraryhttp://wraggelabs.com/shed/trove/graphs/
Interface(Created by Third Party)
Data(Published by Library)
Headline Roulette, developed by third party, based on data
provided by libraryhttp://wraggelabs.com/shed/headline-
roulette/
Interface(Created by Third Party)
Data(Published by Library)
Word Count of Articles, developed by third party, based on data
provided by libraryhttp://dhistory.org/frontpages/53/words/
Interface(Created by Third Party)
Data(Published by Library)
Sounds great !But … ?
How many people in this audience would now how to build an interface on top of API?
How many users do you know who could build on top of an API ?
That is the problem I leave you to discuss
Thank you.
http://www.theeuropeanlibrary.org/tel4/newspapers
Representation and Absence in Digital Resources: The Case of Europeana Newspapers
Thanks !
Project Details: http://europeana-newspapers.eu
Interface: http://www.theeuropeanlibrary.org/tel4/newspapers
Desert: https://www.flickr.com/photos/aigle_dore/5952236932/sizes/l
Borges Sign: https://www.flickr.com/photos/monceau/7705020640/
Map: http://gallica.bnf.fr/ark:/12148/btv1b530299707
Strike Fighter : http://en.wikipedia.org/wiki/Strike_fighter
Credits