Keynote: Unexpected repurposing

Post on 11-Jan-2017

34 views 0 download

Transcript of Keynote: Unexpected repurposing

Unexpected Repurposing: the British Library's digital collections and UCL teaching, research and infrastructure

Professor Melissa TerrasProfessor of Digital Humanities, UCL Dept of Information StudiesDirector, UCL Centre for Digital Humanitiesm.terras@ucl.ac.uk, @melissaterras

#openglam

British Library, 28th May 2008. https://web.archive.org/web/20110707135434/http://pressandpolicy.bl.uk/Press-Releases/The-British-Library-19th-Century-Book-Digitisation-Project-343.aspx

Returned to library in 2012, placed under a CCO-Public domain license for commercial and non-commercial use.

Optically Character Recognised (OCR) generated TextScanned Page

OCR XML Generated by ABBY Fine Reader

https://www.flickr.com/photos/britishlibrary

Image on Flickr Commons

https://goo.gl/AC43vs

http://blpublicdomain.wikispaces.com/home

https://historicaltexts.jisc.ac.uk/results?filter=service%7C%7Cbl&tab=date

Data: what can we do with 65,000 books?

224GB compressed ALTO XML

http://www0.cs.ucl.ac.uk/staff/D.Mohamedally/

Staff and Students, working together

• James Baker,  Adam Farquhar• Melissa Terras,  Dean Mohamedally,  Tim

Weyrich,• Stefan Alborzpour,  Stelios Georgiou,  Nektaria

Stavrou,  Wendy Wong,  Jonathan Lloyd,  Meral Sahin,  Divya Surendran,  James Durrant,  Muhammad Rafdi,  Ali Sarraf

Approach

• How can we search the dataset differently?• Complex and multifaceted needs of humanities

researchers• Boolean and Advanced Search• Microsoft Azure 5 APIs were implemented that

functionally scale to the data • Offering unconventional services such as bulk

download of text based on metadata queries, word frequency lists, and OCR text previews.

github.com/BL-publicdomain/blpublicdomain

picaguess.herokuapp.com, dx.doi.org/10.5281/zenodo.15980

James Baker, Tim Weyrich, Dean MohamedallyJonathan Lloyd, Meral Sahin,Divya Surendran

http://blbigdata.herokuapp.com/James Baker, Tim Weyrich, Dean Mohamedally,

Ali Sarraf, James Durrant, Muhammad Rafdi

github.com/UCL-dataspring

Method

• 65k books from the British Library:• 17th - 19th century• 224GB compressed ALTO XML• UCL High Performance Computing• Support from RITS and UCLDH• 4 humanities researchers• Turn research questions into computational

queries• Learn from the researchers about their needs,

wants, desires, and method.

Results

Taking Humanities data to HPC…

https://www.flickr.com/photos/epublicist/3546059144

Case Study 1: History of Medicine, Oliver Duke-Williams, UCL

Case Study 2: History of Images, Will Finley, Sheffield

What did this tell us?

• Best practice recommendations:– Derived datasets for home use– Documentating decisions– Fixed/defined dataset– Normalisations

Common Queries

• searches for all variants of a word • searches that return keywords in context traced

over time • NOT searches for a word or phrase that ignored

another word or phrase • searches for a word when in close proximity to a

second word • searches based on image metadata …. All returned in a derived dataset, in context.

Do try this at home…

1. Invest in research software engineer capacity to deploy and maintain openly licensed largescale digital collections from across the GLAM sector in order to facilitate research in the arts, humanities and social and historical sciences

2. Invest in training library staff to run these initial queries in collaboration with humanities faculty, to support work with subsets of data that are produced, and to document and manage resulting code and derived data.

github.com/UCL-dataspring

With thanks to

• BL Labs and Digital Curators: James Baker, Adam Farquhar, Mahendra Mahey, Ben O’Steen, Hana Lewis

• UCL CS Student Project Team: James Baker, Tim Weyrich, Dean Mohamedally

• Bluclobber Project Team: James Baker, James Hetherington, David Beavan, Anne Welsh, Helen O’Neill, Will Finley, Oliver Duke-Williams, Adam Farquhar.

• UCL Research IT Services: James Hetherington, Clare Gryce, Raquel Algere.