Representation and Absence in Digital Resources: The Case of Europeana Newspapers

55
Representation and Absence in Digital Resources: The Case of Europeana Newspapers Alastair Dunning, The European Library, @alastairdunning Clemens Neudecker, National Library of Netherlands, @cneudecker DH2014, Lausanne

description

Presentation at Digital Humanities 2014, Lausanne. Looks at some of the issues related to digitising historic newspapers in Europe, particularly how a website that can search through all of them can be built

Transcript of Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Page 1: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Alastair Dunning, The European Library, @alastairdunningClemens Neudecker, National Library of Netherlands, @cneudecker

DH2014, Lausanne

Page 2: Representation and Absence in Digital Resources: The Case of Europeana Newspapers
Page 3: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Source: http://www.nytimes.com/2007/03/10/business/yourmoney/11archive.html

Page 4: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Source: Europeana Strategic Plan, 2015-2020, currently unpublished. See also Enumerate Project, enumerate.eu

Page 5: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

The estimated total cost of digitising the collections of Europe’s

museums, archives and libraries, including the audiovisual material

they hold is approximately €100bn, or €10bn per annum for the next 10

years, factoring in a cumulative efficiency gain of 0.5%

per annum.

The Research & Development Budget for the Joint Strike Fighter

programme is estimated at €40.34bn.

It would cost between 10% and 40% of the Joint Strike Fighter R&D

budget to digitise every eligible title in Europe’s libraries

Source: Nick Poole, Collections Trust, http://nickpoole.org.uk/wp-content/uploads/2011/12/digiti_report.pdf

Page 6: Representation and Absence in Digital Resources: The Case of Europeana Newspapers
Page 7: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Currently:

2 millionpages of full text

By 2015:

10 million pages of

full text

Searching by keyword, and organise by language, date, source library, title

Link: http://www.theeuropeanlibrary.org/tel4/newspapers

Page 8: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Currently: Metadata records relating to

1.12m issues

By 2015: Metadata records relating to up to

4m issues -

Browse by date or map

Link: http://www.theeuropeanlibrary.org/tel4/newspapers

Page 9: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Full Text from following libraries

•Bibliotheque nationale de France / National Library France•Koninklijke Bibliotheek / National Library of the Netherlands•Landesbibliothek Dr. Friedrich Teßmann / Teßmann Library•Eesti Rahvusraamatukogu / Estonian National Library• Kansalliskirjasto / National Library of Finland• Latvijas Nacionala Biblioteka / National Library of Latvia•Biblioteka Narodowa / National Library of Poland•Milli Kutuphane Baskanligi / National Library of Turkey• Österreichische Nationalbibliothek / Austrian National Library•Staatsbibliothek zu Berlin / Berlin State Library•Staats- und Universitätsbibliothek Hamburg / State and University Library• Univerzitet u Beogradu / University Library of Belgrade

Searching by title

Page 10: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Issue Level Records from following libraries

•National Library of Wales•St. Cyril and Methodius National Library / The National Library of Bulgaria•National Library of Czech Republic•National and University Library in Zagreb•Koninklijke Bibliotheek van België / Bibliothèque royale de Belgique•Narodna in univerzitetna knjinica / National and University Library of Slovenia•National Library of Portugal•National Library of Romania•Landsbókasafn Íslands - Háskólabókasafn / National and Univeristy Library of Iceland National Library of Spain•Bibliothèque nationale de Luxembourg / National Library of Luxembourg

Finding matching results in single or multiple issues

Page 11: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Highlighting search terms

Page 12: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

So far, okay. Similar functionality to other national and regional digital libraries of newspapers

See other archives via:https://www.google.com/maps/ms?msid=217164746645697066594.0004c3d764fcb71ed2314&msa=0

Page 13: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

But what was the user response to an aggregation of European newspaper libraries ?

Results of Usability Testing: http://www.europeana-newspapers.eu/wp-content/uploads/2014/05/The-European-Library-Newspaper-Archive-Usability-testing-Report-April-2014.pdf

Page 14: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

“Aggregated view of content

from many sources highly valued.

There was a strong positive reaction to the availability of

the archive.”

Page 15: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

“Many saying they would be keen to return to the site as

the content expands.”

Page 16: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

“Ability to search over geographic map was highly valued”

Page 17: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Plenty of quibbles about design

- positions of advanced options- re-order list of results- manipulating facets

Page 18: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Much greater expectations of functionality once logged in

For example,Saved searches

New content notification

Page 19: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

“Much of the value of the site to participants was provided by the images of the documents.

Participants expected to be able to save a 'local' copy once they

had located content of relevance.

As no download facility is provided, this led to some frustration and undermined the overall potential value of the site for some

participants.”

Page 20: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Timetable for rest of project

Now – Protype version of interface shared with project

Throughout 2014 - Ongoing creation of OCR, and other

related technical work (OLR, Named Entities)

Throughout 2014 – Live version of website improved /

usability testing / added content

Autumn 2014 - Final project conference

Late 2014 - Newspaper browser completed with content and

tools from project

More information at

http://www.europeana-newspapers.eu/Interface at

http://www.theeuropeanlibrary.org/tel4/newspapers/

Page 21: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Things the users didn’t say(but we thought they would)

Page 22: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Why can’t I edit the text ?

(Our sample was researchers/ maybe it is other communities interested in crowdsourcing?)

Note: If time permits, The European Library will develop some crowdsourcing feature

Page 23: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Can I download text for data mining?

Remember: Digital Humanists are still a small percentage of humanists and users

Note: Many, although not all, of the texts are marked public domain, so this is feasible in legal terms

Page 24: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Number of digitised pages in interface: c.2m

Number of digitised pages in European libraries: c.130m

Number of physical pages in European libraries: 1.5bn+

Source: European Newspaper Survey Report http://www.europeana-newspapers.eu/wp-content/uploads/2012/04/D4.1-Europeana-newspapers-survey-report.pdf

Page 25: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Source: European Newspaper Survey Report http://www.europeana-newspapers.eu/wp-content/uploads/2012/04/D4.1-Europeana-newspapers-survey-report.pdf

Quantities of newspapers – a) in project b) digitised in total c) in physical libraries

Page 26: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

The project digital library is only a fraction of the newspaper

archive of the continent, indeed the world

Page 27: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

As libraries, how should we represent that

absence to users ?

Page 28: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Should such absence be represented in the interface itself ?

Page 29: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Vast

white

spaces in the list of results ?

Page 30: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

….. Difficult to represent ‘archival gaps’ when seen in the context of how little has been digitised - creates a

needle in the haystack ….

Page 31: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

One placeholder for several metadata records ?

But in many cases the metadata does not even

exist, so a search interface has less to work with

Page 32: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Standardised information for

every digital resource for representing collections, extent of content, licencing and re-use conditions

Page 33: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Standardised information? For every digital resource produced in the world ?

Are you kidding ?

Page 34: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Charts and graphs external to the interface ?

Page 35: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Graphs are the most obvious way of adding context

but still very reliant on the library producing such charts

Page 37: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Pieter Francois, winner of BL Labs competition 2013:

“How representative are the historical texts humanities scholars study of the overall body of ‘surviving’ texts that are held in the various library collections?”

labs.bl.uk/Sample+Generator

Page 38: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

There are other issues in the project content too

Major issues

OCR quality varies Different licensing statements from

different countries Date of copyright boundaries different in

each country

Page 39: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

There are other issues in the interface too

Minor Issues

Some pages (2m by 2015) have articles segmentation

Some library content has named entity extraction effecting search results

Page 40: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

What impact do OCR errors have when text mining a large digital collection?

Page 41: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Source: http://homepages.inf.ed.ac.uk/balex/publications/slides-DATeCH.pdf

10M pages, 7 billion words – how much you are actually ignoring when using only the “good” OCR

Page 42: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

How should we allow users better ways to

understand the digital library ?

Page 43: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

What role can the API play in this?

Can opening up the data in the digital library and allowing it to explored in different ways ?

Page 44: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Traditional Model With an API

Interface(Created by Library)

Data(Published by Library)

Interface(Created by Third Party)

Data(Published by Library)

API – Application Programming Interfaces

Page 45: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Pioneering work of Trove API(or rather of Tim Sherratt)

Page 46: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Interface(Created by Library)

Data(Published by Library)

Trove Newspapers site as published by National Library

of Australia, and based on data provided by Libraryhttp://trove.nla.gov.au/newspaper

Page 47: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Trove Newspapers statistics develolped by third party, based

on data provided by libraryhttp://wraggelabs.com/shed/trove/graphs/

Interface(Created by Third Party)

Data(Published by Library)

Page 48: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Headline Roulette, developed by third party, based on data

provided by libraryhttp://wraggelabs.com/shed/headline-

roulette/

Interface(Created by Third Party)

Data(Published by Library)

Page 49: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Word Count of Articles, developed by third party, based on data

provided by libraryhttp://dhistory.org/frontpages/53/words/

Interface(Created by Third Party)

Data(Published by Library)

Page 50: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Sounds great !But … ?

Page 51: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

How many people in this audience would now how to build an interface on top of API?

Page 52: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

How many users do you know who could build on top of an API ?

Page 53: Representation and Absence in Digital Resources: The Case of Europeana Newspapers
Page 54: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

That is the problem I leave you to discuss

Thank you.

http://www.theeuropeanlibrary.org/tel4/newspapers

Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Thanks !

Project Details: http://europeana-newspapers.eu

Interface: http://www.theeuropeanlibrary.org/tel4/newspapers

Page 55: Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Desert: https://www.flickr.com/photos/aigle_dore/5952236932/sizes/l

Borges Sign: https://www.flickr.com/photos/monceau/7705020640/

Map: http://gallica.bnf.fr/ark:/12148/btv1b530299707

Strike Fighter : http://en.wikipedia.org/wiki/Strike_fighter

Credits