Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter...

20
Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National Library of New Zealand Te Puna Mātauranga o Aotearoa

Transcript of Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter...

Page 1: Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.

Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o AotearoaPeter McKinney Digital Preservation Policy AnalystNational Library of New Zealand Te Puna Mātauranga o Aotearoa

Page 2: Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.

Agenda

• Relationship with National Library of China

• National Library mandate• Collection analysis• Preservation of those collections• Future plans

Page 3: Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.

Agreement on Cooperation: NLZ and NLNZ

- Drawn up in the spirit of friendship- Principles of equity and reciprocity- Purpose is to enhance the ability to

contribute to improvement of cultural and economic life of their respective nations

- Cooperation on digital library matters Share knowledge and information in relation to Digital Preservation

Page 4: Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.

Mandate“The Minister may, by notice in the Gazette, authorise the

National Librarian to make a copy, at any time or times and at his or her discretion, of public documents that are internet documents in accordance with any terms and conditions as to format, public access, or other matters that are specified in the notice.” National Library Act 2003

Te Wehenga, by Cliff Whiting, 1974. CAC-readingroomdetail_00002493

Page 5: Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.

Two web archiving programmes

http://topics.breitbart.com/fishing+pole/ http://www.trimarinegroup.com/operations/fleet.php

Domain Harvesting of the entire “New Zealand Internet”(2008, 2010, 2013)

Selective harvesting of specific websites or parts of websites (since 1999)

Page 6: Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.

Collections

Selective Harvests (since 1999)No. of Websites 14,943

No. of ARC files 71,374

Size on disk (Tb) 5.63

Total number of files in harvest IE (excluding those within the ARC)

c.291,000

No. of Files contained within ARCs 87,112,551

Page 7: Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.

2008 2010 20130

2

4

6

8

10

12

0

50,000,000

100,000,000

150,000,000

200,000,000

250,000,000

No. of files (within WARCs)

Data size

Tera

byte

s

No

of fi

les

Whole of Domain Harvests

Collections

Page 8: Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.

Digital preservationThoughts for today1. How much information?2. Whole of domain into preservation system3. ARC to WARC migration

Page 9: Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.

Digital preservationSimply:1. What are the technical characteristics of the content?

- What format is it in?- What are the properties the file exhibits?- Is it a valid expression of the specification?

2. What is the institution’s capability in terms of rendering the content?

Page 10: Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.

Digital preservationWe use that information for:• Repository analysis• Risk assessment• Preservation planning• Preservation action• Evaluation

Rice, Anne Estelle, 1879-1959. Artist unknown :[Embroidered Chinese silk shawl belonging to Katherine Mansfield. Made ca 1900. Detail of bird in flight]. Artist unknown :[Embroidered Chinese silk shawl belonging to Katherine Mansfield] [made ca 1900]. Ref: D-014-007-detail-2. Alexander Turnbull Library, Wellington, New Zealand. http://natlib.govt.nz/records/22688625

Page 11: Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.

Validation stack - current

Page 12: Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.

Validation stack - current

Thoughts1. Mimetype is insufficient2. Not enough information about

the website files to support preservation management

Page 13: Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.

Extracted information - mimetypesTotal number of mime types across selective harvests= 543

text/h

tml

imag

e/jpeg

imag

e/gif

imag

e/png

applica

tion/pdf

text/p

lain

text/d

ns

text/c

ss

applica

tion/x-jav

ascrip

t

text/x

ml

no-type

applica

tion/x-sh

ockwav

e-flash

applica

tion/msw

ord

text/j

avasc

ript

applica

tion/java

script

applica

tion/atom+x

ml

applica

tion/xml

applica

tion/rss+x

ml

applica

tion/octe

t-stre

am100000

1000000

10000000

100000000

Top 20 Mime Types - log scale

Qty

Page 14: Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.

Validation stack – proposed

Page 15: Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.

Validation stack – proposed

Thoughts1. METS file is ‘large’2. Take only format and fixity?3. Take only format summary at

ARC level?

Page 16: Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.

NDHA Overview

Publishing

Delivery

Search ToolsEg. Find / Tapuhi /

Voyager /Beta

PERMANENT REPOSITORY

DEPOSIT

Management UI Back-Office UI

STAGINGTA

AssessorArranger Approver

Web Curator Tool

EnrichmentValidation Stack

Rosetta

Domain Harvests into preservation system

Page 17: Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.

Whole of Domain legal considerations

- Films, Videos, and Publications Classifications Act 1993; - Criminal Procedure Act 2011; - Defamation Act 1992; - Human Rights Act 1993; - Copyright Act 1994;- Privacy Act 1993

Page 18: Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.

Legal- Objectionable/illegal

material- Copyright - Is giving access

“republishing”

Technical/Intellectual- One corpus?- Full text search?- Themed?- What is the data

model?

Whole of Domain legal considerations

Page 19: Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.

ARC to WARC conversionWhy?• Take advantage of most up to

date standard• Normalise all web content

Critical considerations- Stakeholder engagement- Development of tools- Proofs of conversion

Page 20: Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.

Conclusions

• Scale of harvests challenge processes for description, access and preservation

• Preservation of harvests at NLNZ is at a basic level• Want to move to fuller preservation management of the

website files