Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03
-
Upload
vangelis-banos -
Category
Science
-
view
226 -
download
1
Transcript of Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03
![Page 1: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/1.jpg)
CLEAR+: a Credible Live Evaluation Method of Website Archivability
Vangelis Banos, Yannis Manolopoulos
3 JUNE 2015 NATIONAL DIGITAL INFORMATION INFRASTRUCTURE AND PRESERVATION
PROGRAMLIBRARY OF CONGRESS
Data Engineering LabDepartment of Informatics, Aristotle University, Thessaloniki , Greece
ARCHIVEREADY.COM
![Page 2: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/2.jpg)
Website Archivability 2
Table of Contents1. Motivation and problem definition, related work,
2. Website Archivability,
3. CLEAR+: A Credible Live method to Evaluate Website Archivability,
4. Demonstration: http://archiveready.com/,
5. Experimental Evaluation,
6. Use Cases,
7. Web Content Management Systems Archivability
8. Discussion – conclusions.
![Page 3: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/3.jpg)
1. Motivation• Web developer: I’m building a website. Is it
going to be archived correctly by a web archive?I don’t know until I see the archived snapshot…
• Web archivist: Can I archive that website?I don’t know, let’s crawl it and we’ll see the results…
• Professor: How can I teach my students about web archiving?100’s of standards but not many relevant apps online…
3
![Page 4: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/4.jpg)
Problem definition• Web content acquisition is a critical step in the
process of web archiving;• If the initial Submission Information Package lacks
completeness and accuracy for any reason (e.g. missing or invalid web content), the rest of the preservation processes are rendered useless;
• There is no guarantee that web bots dedicated to retrieving website content can access and retrieve it successfully;
• Web bots face increasing difficulties in harvesting websites.
4
![Page 5: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/5.jpg)
5
• Web harvesting is automated while Quality Assurance (QA) is mostly manual.
• Web archives perform test crawls.• Humans review the results, resources are spent.
• After web harvesting, administrators review manually the content and endorse or reject the harvested material.
• Efforts to deploy crowdsourced techniques to manage QA provide an indication of how significant the bottleneck is. • (IIPC GA 2012 Crowdsourcing Workshop)
Problem definition
![Page 6: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/6.jpg)
6
1. the introduction of the notion of Website Archivability,
2. the Credible Live Evaluation of Archive Readiness Plus (CLEAR+) method to measure Website Archivability
3. ArchiveReady.com, a web application which implements the proposed method.
Publications:• Banos V., Manolopoulos Y.: A quantitative approach to
evaluate Website Archivability using the CLEAR+ method, International Journal on Digital Libraries (IJDL), 2015.
• Banos V., Kim Y., Ross S., Manolopoulos Y.: CLEAR: a credible method to evaluate website archivability, iPRES’2013, Lisbon, 2013.
2. Our Contributions
![Page 7: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/7.jpg)
7
1. Mechanism to improve the quality of web archives.
2. Expand and optimize the knowledge and practices of web archivists, supporting them in their decision making, and risk management.
3. Standardize the web aggregation practices of web archives, especially QA.
4. Foster good practices in web development, make sites more amenable to harvesting, ingesting, and preserving.
5. Raise awareness among web professionals regarding preservation.
6. Support web archiving training.
Our Aims
![Page 8: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/8.jpg)
WebsiteArchivability ?
What is
Website Archivability (WA) captures the coreaspects of a website crucial in diagnosing
whether it has the potentiality to be archivedwith completeness and accuracy.
Attention! it must not be confused with website dependability, reliability, availability, safety, security, survivability, maintainability.
![Page 9: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/9.jpg)
CLEAR+: A Credible Live Method to Evaluate Website Archivability• An approach to producing on-the-fly measurement
of Website Archivability,• Web archives communicate with target websites via
standard HTTP,• Information such as file types, content and transfer
errors could be used to support archival decisions,• We combine this kind of information with an
evaluation of the website's compliance with recognized practices in digital curation,
• We generate a credible score representing the archivability of target websites.
9
![Page 10: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/10.jpg)
The main components of CLEAR+
1. WA Facets: the factors that come into play and need to be taken into account to calculate total WA.
2. Website Attributes: the website homepage elements analysed to assess the WA Facets (e.g. the HTML markup code).
3. Evaluations: the tests executed on the website attributes (e.g. HTML code validation against W3C HTML standards) and approach used to combine the test results to calculate the WA metrics.
10
![Page 11: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/11.jpg)
11
Accessibility
Cohesion
StandardsCompliance
Metadata
CLEAR+: A Credible Live Method to Evaluate Website Archivability
![Page 12: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/12.jpg)
12
Website attributes evaluated using CLEAR+
![Page 13: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/13.jpg)
13
CLEAR+ Evaluations1. Perform specific Evaluations on Website Attributes,
2. In order to calculate each Archivability Facet’s score:• Scores range from (0 – 100%),• Evaluations significance varies:
• High: critical issues which prevent web crawling or may cause highly problematic web archiving results.
• Medium: issues which are not critical but may affect the quality of web archiving results.
• Low: minor details which do not cause any issues when they are missing but will help web archiving when available
3. Website Archivability is the average of all Facets’ scores.
![Page 14: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/14.jpg)
Accessibility
14
![Page 15: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/15.jpg)
Accessibility• A website is considered accessible only if web
crawlers are able to visit its home page, traverse its content and retrieve it via standard HTTP requests.
• Performance is also an important aspect of web archiving. Faster performance means faster web content ingestion.
15
![Page 16: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/16.jpg)
Accessibility Evaluations
16
![Page 17: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/17.jpg)
Cohesion
17
![Page 18: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/18.jpg)
Cohesion• Relevant to:
• Efficient operation of web crawlers,• Management of dependancies with digital
curation.• If files constituting a single website are dispersed
across different web locations, the acquisition and ingest is likely to risk suffering if one or more web locations fail.
• Changes that occur outside the website are not going to affect it if it does not use 3rd party resources.
18
![Page 19: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/19.jpg)
Cohesion Evaluations
19
![Page 20: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/20.jpg)
Metadata
20
![Page 21: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/21.jpg)
Metadata• The adequate provision of metadata has been a
continuing concern within digital curation.• The lack of metadata impairs the archive’s ability to
manage, organise, retrieve and interact with content effectively.
• Metadata may include descriptive or technical information.
• Metadata increases the probability of successful information extraction and reuse in web archives after ingestion.
21
![Page 22: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/22.jpg)
Metadata Evaluations
22
![Page 23: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/23.jpg)
StandardsCompliance
23
![Page 24: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/24.jpg)
Standards Compliance• Compliance with standards is a recurring theme in
digital curation practices. It is recommended that for digital resources to be preserved they need to be represented in known and transparent standards.
24
![Page 25: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/25.jpg)
Standards Compliance Evaluations
25
![Page 26: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/26.jpg)
ArchiveReady.com4. Demonstration
- Web application implementing CLEAR+,
- Web interface & also Web API in JSON,
- Running on Linux, Python, Nginx, Redis, Mysql, PhantomJS headless browser.
26
![Page 27: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/27.jpg)
archiveready.com DEMO
27
![Page 28: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/28.jpg)
5. ExperimentalEvaluation
28
![Page 29: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/29.jpg)
5. Experimental evaluation
• Questions:– How can we prove the validity of the Website
Archivability metric?
– Is it possible to calculate the WA of a website by evaluating a single webpage?
29
![Page 30: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/30.jpg)
Experiment 1: Evaluation using datasets
30
![Page 31: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/31.jpg)
Experiment 1: Evaluation using datasets
31
![Page 32: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/32.jpg)
Experiment 2: Evaluation by experts
• Experts rank 200 websites according to the quality of their snapshots at the Internet Archive
• We evaluate the same websites with archiveready.com
• We calculate the Pearson’s Correlation Coefficient of our variables and find correlations.
32
![Page 33: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/33.jpg)
Experiment 3: WA variance in the pages of the same website
• We evaluate only a single webpage to calculate website archivability. Is this correct?
• Is the homepage WA representative of the whole website WA?
• We use a website of 800 webpages and calculate the WA of 10 different webpages for each website to find out.
33
![Page 34: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/34.jpg)
Experiment 3: WA variation in the pages of the same website
34
![Page 35: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/35.jpg)
6. Use Cases
35
![Page 36: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/36.jpg)
Use Case 1: Deutsches Literatur Archiv, Marbach, Germany
• German literature web archiving project,• http://www.dla-marbach.de/dla/bibliothek/literatur_im_netz/netzliteratur/
• ~3.000 websites are preserved,• An evaluation of the archivability
characteristics of these websites was necessary before crawling,
• archiveready.com API was used to gain an insight on their properties http://archiveready.com/docs/api.html
36
![Page 37: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/37.jpg)
Use Case 2: Academia• Used by digital curation units, researchers and
teachers. – University of Newcastle, UK,– Columbia University Libraries,– Stanford University Libraries,– University of Michigan, Bentley Historical Library,– Old Dominion University.
37
![Page 38: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/38.jpg)
7. Web CMSArchivability
38
![Page 39: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/39.jpg)
Web CMS Archivability• CMS dominate the web
– (Wordpress, Drupal, Joomla, MovableType, +++)• CMS constitute a common technical
framework for web publishing.• If a CMS is ‘incompatible’ with some web
archiving aspect, millions of websites are affected and web archives suffer.
39
![Page 40: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/40.jpg)
Web CMS Archivability
• Our contribution:– We study 12 prominent web CMS.– We conduct experiments with a sample of ~5.800
websites based on these CMS.– We make specific observations on the Website
Archivability characteristics of each CMS.• Paper (under review):
– Web Content Management Systems Archivability, Banos V., Manolopoulos Y., ADBIS 2015’
40
![Page 41: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/41.jpg)
Web CMS Archivability
41
![Page 42: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/42.jpg)
Web CMS Archivability
• Indicative results:– Drupal has the third highest WA score (82.08%). It has
good overall performance and the only issue is the existence of too many inline scripts per instance (15.09).
– DotNetNuke has the second worst WA score in our evaluation (77.2%). We suggest that they look into their RSS feeds (13% Correct). and lacking HTTP caching support (5%).
– Typo3 WA score is average (79%). It has the largest number of invalid URLs per instance (12%).
42
![Page 43: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/43.jpg)
8. Discussion &Conclusions
43
![Page 44: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/44.jpg)
Discussion and conclusions
44
• Introducing a new metric to quantify the previously unquantifiable notion of WA is not an easy task.
• CLEAR+ and Website Archivability capture the core aspects of a website crucial in diagnosing whether it has the potential to be archived with correctness and accuracy.
• Archiveready.com is a reference implementation of the CLEAR+ method.
• Archiveready.com provides a REST API for 3rd parties.
![Page 45: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/45.jpg)
Discussion and conclusions
45
1. Web professionals - evaluate the archivability of their websites in an easy but thorough way, - become aware of web preservation concepts, - embrace preservation-friendly practices. 2. Web archive operators - make informed decisions on archiving websites, - perform large scale website evaluations with ease, - automate web archiving Quality Assurance, - minimise wasted resources on problematic websites. 3. Academics - teach students about web archiving.
![Page 46: Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03](https://reader035.fdocuments.us/reader035/viewer/2022062420/55b5ad94bb61eb7a448b47d3/html5/thumbnails/46.jpg)
THANK YOUVisit: http://archiveready.com Contact: [email protected] Learn More:• Banos V., Manolopoulos Y.: A quantitative approach to
evaluate Website Archivability using the CLEAR+ method, International Journal on Digital Libraries, 2015.
• Banos V., Kim Y., Ross S., Manolopoulos Y.: CLEAR: a credible method to evaluate website archivability, 10th International Conference on Preservation of Digital Objects (iPRES’2013), Lisbon, 2013.
ANY QUESTIONS? 46