Scalability and Heterogeneity · Scalability and Heterogeneity Colin Perkins
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability
-
Upload
nullhandle -
Category
Internet
-
view
232 -
download
2
Transcript of Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability
![Page 1: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability](https://reader036.fdocuments.us/reader036/viewer/2022062412/58e74b781a28ab91558b5105/html5/thumbnails/1.jpg)
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability
Nicholas Taylor (@nullhandle)Web Archiving Service ManagerStanford University Libraries
Archives 2016209 - Balancing Quality of Life and Quality AssuranceAugust 4, 2016
![Page 2: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability](https://reader036.fdocuments.us/reader036/viewer/2022062412/58e74b781a28ab91558b5105/html5/thumbnails/2.jpg)
QA panelists
Dory BowerGovernment Publishing Office
Lori DonovanInternet Archive / Archive-It
Dallas PillenBentley Historical Library
Nicholas TaylorStanford University Libraries
Alex ThurmanColumbia University Libraries
![Page 3: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability](https://reader036.fdocuments.us/reader036/viewer/2022062412/58e74b781a28ab91558b5105/html5/thumbnails/3.jpg)
balancing QA + quality of life?
“Tab Tatham "junk. balance scales."” by ▓▒░ TORLEY ░▒▓ under CC BY-SA 2.0
![Page 4: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability](https://reader036.fdocuments.us/reader036/viewer/2022062412/58e74b781a28ab91558b5105/html5/thumbnails/4.jpg)
overheard re: QA @ SAA 2015
we set and forget; I’m just glad we’re doing something
did more QA at the beginning but, well, I don’t really look at the reports any moresteady,
ongoing QA is
challengingoccasionally I set aside a lunch hour to do some QA
my strategy right now is to let the big schools figure it out
![Page 5: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability](https://reader036.fdocuments.us/reader036/viewer/2022062412/58e74b781a28ab91558b5105/html5/thumbnails/5.jpg)
2015 SAA WebArchRT discussion
• if you could only apply 3 QA practices to your web archives, which 3?
• do you apply different QA practices to web archives created for different use cases?
• how do you ensure that staff time allocated to QA is best spent?
![Page 6: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability](https://reader036.fdocuments.us/reader036/viewer/2022062412/58e74b781a28ab91558b5105/html5/thumbnails/6.jpg)
quality assurance in the lifecycle
Archive-It: “The Web Archiving Life Cycle Model”
![Page 7: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability](https://reader036.fdocuments.us/reader036/viewer/2022062412/58e74b781a28ab91558b5105/html5/thumbnails/7.jpg)
quality assurance, expansively
typical QA• parsing robots.txt• scoping rules• object count limits• test crawling• inspecting archived
site• reviewing reports• patch crawling
and more• seed selection• assessing live site• capture tool selection• crawl scheduling• crawl duration limits• monitoring crawl• archivability advocacy• training
![Page 8: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability](https://reader036.fdocuments.us/reader036/viewer/2022062412/58e74b781a28ab91558b5105/html5/thumbnails/8.jpg)
3rd highest desired skill
Apprai
sal + Sele
ction
Archivi
ng Too
ls
Collab
oratio
n + C
ommun
icatio
n
Domain
Expert
ise
Metada
ta
Quality
Assu
rance
Software
Dev
elopm
ent
Web
Techno
logies
Other
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
NDSA: “2015 NDSA Web Archiving Survey”
![Page 9: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability](https://reader036.fdocuments.us/reader036/viewer/2022062412/58e74b781a28ab91558b5105/html5/thumbnails/9.jpg)
low perceived programmatic progress
Vision +
Obje
ctive
sPoli
cy
Resourc
es + W
orkflo
w
Risk M
anag
emen
t
Apprai
sal + Sele
ction
Scopin
g
Data C
aptur
e
QA + Ana
lysis
Storag
e + O
rganiz
ation
Preserv
ation
Metada
ta/Desc
riptio
n
Access/
Use/Reu
se0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
NDSA: “2015 NDSA Web Archiving Survey”
![Page 10: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability](https://reader036.fdocuments.us/reader036/viewer/2022062412/58e74b781a28ab91558b5105/html5/thumbnails/10.jpg)
greatest collaboration interest
Policy
+ Risk
Man
agem
ent
Captur
e Con
figura
tion
Collab
orativ
e Coll
ection
Dev
Input
on A
PIs + Stan
dards
Metada
ta Stan
dards
QA Techniq
ues +
Strateg
ies
Tool D
evOthe
r0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
NDSA: “2015 NDSA Web Archiving Survey”
![Page 11: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability](https://reader036.fdocuments.us/reader036/viewer/2022062412/58e74b781a28ab91558b5105/html5/thumbnails/11.jpg)
RETHINKING QA AT STANFORD
“stanford13” by Paradoxotaur under CC BY-SA 2.0
![Page 12: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability](https://reader036.fdocuments.us/reader036/viewer/2022062412/58e74b781a28ab91558b5105/html5/thumbnails/12.jpg)
web archiving at Stanford
• 7 Archive-It accounts
• Heritrix, Webrecorder
• local preservation, discovery, access
• program manager, curators, students
• tens of collections• thousands of
seeds
Internet Archive: “Stanford University Homepage”
![Page 13: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability](https://reader036.fdocuments.us/reader036/viewer/2022062412/58e74b781a28ab91558b5105/html5/thumbnails/13.jpg)
quality assurance goals
• maximize impact + efficiency of QA efforts
• enable diverse, distributed, + approachable contributions
• calibrate investments in quality based on tool capabilities
“Goals” by Eric Peacock under CC BY-NC-SA 2.0
![Page 14: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability](https://reader036.fdocuments.us/reader036/viewer/2022062412/58e74b781a28ab91558b5105/html5/thumbnails/14.jpg)
capture, behavior, appearance
appearancebehavior
capture
NYARC: “I. Introduction - NYARC Documentation”
![Page 15: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability](https://reader036.fdocuments.us/reader036/viewer/2022062412/58e74b781a28ab91558b5105/html5/thumbnails/15.jpg)
capture, behavior, appearance
appearancebehavior
capture
NYARC: “I. Introduction - NYARC Documentation”
![Page 16: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability](https://reader036.fdocuments.us/reader036/viewer/2022062412/58e74b781a28ab91558b5105/html5/thumbnails/16.jpg)
in practicecare more about…• report data• crawl finishing• 4xx, 5xx, complete
robots.txt block• plausible duration• plausible object
counts• scoping out
extraneous content• new seeds
care less about…• visual inspection• reviewing every
capture• appearance fidelity• behavior fidelity• partial content out of
scope• partial content
blocked by robots.txt• ongoing seeds
![Page 17: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability](https://reader036.fdocuments.us/reader036/viewer/2022062412/58e74b781a28ab91558b5105/html5/thumbnails/17.jpg)
more next from Lori, Alex, Dallas, Dory
“Olympic Relay Handoff” by Dr. Mark Kubert under CC BY-NC-ND 2.0