Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler:...
Transcript of Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler:...
Workshop KU
02.11.2016 netlab.dk
Workshop on Web Archiving
MODULE 1: WEB ARCHIVING
Niels Brügger
Asger Harlung
netlab.dk
Workshop KU
02.11.2016
Program 02.11.2016, KU
2
12.00-13.00 Workshop part 1: Web archiving
13.00-13.15 Coffee
13.15-14.00 Workshop part 2: Existing collections
14.00-14.10 Short break
14.10-15.00 Workshop part 3: Doing your own archiving (incl. hands-on)
15.00-15.05 Wrap-up; feedback
15.05-15.30 Open discussion: Thoughts and questions
netlab.dk
Workshop KU
02.11.2016
Module 1: Web Archiving
3
• Introducing ourselves and NetLab
• Why archive the web
• Two examples; dr.dk’s history and Probing a Nations Web
• Three kinds of digital content
• WWW as technology
• What is web archiving?
• Methods of web archiving
• Challenges for the web crawler
• Crawling — advantages/disadvantages
• Characteristics of the archived web
netlab.dk
Workshop KU
02.11.2016
Introducing Ourselves and NetLab
4
Niels Brügger – Professor (MSO, with special responsibilities) in Internet Studies and Digital Humanities, Head of NetLab, and of the Centre for Internet Studies, specialising in internet research since 1997.
Asger Harlung – MA in ICT and learning, has previously worked with research in digital rhetoric, and supporting creativity development in learning processes
… and the rest of NetLab is:
Janne Nielsen – Assistant Professor, PhD, Media Studies, AU; author of the book: Using Web Archives in Research – an Introduction (2016)
Ulrich Have – IT Architect, MA in Information Studies. At NetLab, he is developer of tools for research using archived web material and does data analytics.
netlab.dk
Workshop KU
02.11.2016
Introducing Ourselves and NetLab
5
A research infrastructure for internet research.
Part of the Danish research infrastructure Digital Humanities
Lab (DIGHUMLAB).
Established in 2012.
Research driven development of research infrastructures.
Digital Humanities Lab
Language Tools
(KU)
Media Tools
(AU)
Interaction & Design
(AAU & SDU)
Audio and visual
materials
NetLab
Online
Archived
Netarkivet
(the Danish
national web
archive)
NetLab
Forum
IT architect
Collecting
data for
specific
projects
2017-18, micro stipends
6
netlab.dk
Workshop KU
02.11.2016
Ditigal Impact on the Humanities
7
We are living in an
ever more digitised
world …
We know that …
why are they
stating the
obvious?
netlab.dk
Workshop KU
02.11.2016
Ditigal Impact on the Humanities
8
(based on Alvin Toffler: TheThird Wave (1980))
• Neolithic revolution, ca. 12,500 years ago: Agriculture is the first
technological revolution, and replaces hunter-gatherer culture.
Hundreds or thousands of years between technogical advances.
• Industrial revolution (I and II), ca. 1760-1950’s. Decades
between technological advances, but faster advancements
since late in the 19th century (”second industrial revolution”)
• Post-industrial age, information age, digital age … 1950’s and
onwards. Years, months, sometimes weeks between
technological leaps. State of the art today may be obsolete
before the end of the year (a state of perpetual ”future shock”).
netlab.dk
Workshop KU
02.11.2016
Ditigal Impact on the Humanities
9
• 2000: 75% of the world’s data was stored in analog form
(paper, film, photographic prints, vinyl, magnetic
casette tapes, etc.),
• 2007: 7% analog, 93% digital
• 2012: Only 2% of all stored data was stored in analog
form.
Mayer-Schönberger & K. Cukier (2013):
Big Data: A revolution that will transform how we live, work, and think. Houghton Mifflin
Harcourt Publishing Company, New York, 2013, pp. 8-9
netlab.dk
Workshop KU
02.11.2016
Why Archive the Web?
10
• To preserve the cultural heritage
• To preserve a stable research object
• To be able to document and illustrate a study
12
• Started in 2007
• Analyse the historical development of
dr.dk
• Partly based on archived web content
• www.drdk.dk
Dr.dk’s history 1996-2006
netlab.dk
Probing a Nation’s Web Domain — from Small Data to Big Data
13
Website
Web page
Web element
Web sphere
Web sphere
netlab.dk
Probing a Nation’s Web Domain — from Small Data to Big Data
14
The historical development of an entire national web:
.dk 2005-2015
The project is a collaboration with Netarkivet.
2006 2009 2012 2015
netlab.dk
Probing a Nation’s Web Domain — from Small Data to Big Data
15
Brutto list of 'probes’:
• Size — e.g. bytes
• Space — e.g. geolocalisation
• Structure — e.g. network of hyperlinks
• Liveliness — e.g. domain names and updating
• Content — e.g. degrees of openness, files, software types,
language, website textual elements, semantics
Digitised Formerly analogous media, transferred to a digital form.
Born Digital Has not previously existed in any other form than digital.
Reborn Digital Born digital content which has been gathered and
preserved, and to some extent has been changed in the
process.
16
WWW — one among other internet protocols:
http — Hyper Text Transfer Protocol
URL — Uniform Resource Identifier (Locator)
html — Hyper Text Markup Language
Constructing a URL on WWW:
protocol://subdomain.domain.topdomain/path/page/
http://cc.au.dk/research/researchprograms/
20
Web pages = patched together in an ‘empty’ shell (stylesheet) of material from databases
21
The browser (Safari, Firefox...) translates html into writing, pictures etc.
Network of computers
html html html html html html html html html html html html
Computer (webserver)URL, dr.dk
Computer (user)
http
http
Computer (webserver) as database, CMS (Content Management System), URL dr.dk
Web pages = html-files
Images
Heading
Words
Computer (webserver) as database, URL, e.g. dmi.dk
Weather
Comp. X
Comp. Y
22
Small Exercise: Source Code
netlab.dk
Workshop KU
02.11.2016
What is Web Archiving?
23
International Internet Preservation Consortium’s definition:
”… the process of gathering up data that has been published
on the World Wide Web, storing it, ensuring the data is
preserved in an archive, and making the collected data
available for future research.”
(http://netpreserve.org/about-us)
”Any form of deliberate and purposive preserving of web
material.” (Brügger, 2011:25)
Brügger, Niels (2011): ”Web Archiving – Between Past, Present, and Future”. IN: Mia
Consalvo and Charles Ess (eds.) The Handbook of Internet Studies. Wiley-Blackwell.
netlab.dk
Workshop KU
02.11.2016
What is Web Archiving?
24
Macro archiving
• Cultural heritage institutions
• Preserve as much as possible
• Big and varied data
• IT expertise, advanced technology, computer power
Micro archiving
• Individual researcher/research group
• Stablize a concrete research object, here-and-now
• No experience, no advanced technology or computer
power
netlab.dk
Workshop KU
02.11.2016
Methods of Web Archiving
25
• Web crawling (hyperlink crawling)
• Screen image
• Screen filming
• Harvesting via API
• (Delivery from producers)
netlab.dk
Workshop KU
02.11.2016
Web Crawling
26
domain.com
page
page page
page
page page page
page
page
netlab.dk
Workshop KU
02.11.2016
Web Crawling
27
domain.com
page
page page
page
page page page
crawler
page
page
1
0
2
3
28
domain.dk
page page page page
page page page
page page page
page page page
page page page page page page
URL URL URL URL URL …
domain.dk
page
page
page page
page
page
page
page
page
page
page
page
page
page
page
page
page
page
page
page
domain.dk
page page page page
page page page
page page page
page page
page page page page page page
crawler
crawler
domain.dk
domain.com
netlab.dk
Workshop KU
02.11.2016
Challenges for the crawler
29
• JavaScripts
• Content based on Flash
• Interactive pages
• Streamed content
• Websites with access limitations (password, captcha)
• Cookies, adds, plugins etc.
• Robots.txt
• Deep web (e.g. databaser, ftp-server, password-protected
content, hidden content, pages not linked to, dynamic
content based on requests).
http://da.wikipedia.org/wi
ki/CAPTCHA
netlab.dk
Pages not being crawled
30
✔
domain
✔
✔ ✔ ✔ ✔
✔ ✔ ✔
✔ ✔ page
✔ ✔
✔ ✔ ✔ ✔ ✔ ✔
page page page page
Not crawled
– too deep
page page
Not crawled
– password
protected
domain
page page Not
crawled –
robots.txt
page page
Not crawled – script
31
Elements not crawled _ Netarkivet
32
Elements not crawled _ Netarkivet
33
Elements not crawled _ Internet Archive
netlab.dk
Workshop KU
02.11.2016
Crawling, Advantages
34
• The entire page in full length
• Hyperlinks, link source as well as target
• Look and feel of live web (with limitations)
• Automatic (partly, evaluation and trouble shooting)
• Machine readable, enables search, sorting, analysis
• Access to metadata (crawl logs)
• Robust format (html)
• Big data-analysis (content analysis, network analysis, etc.)
netlab.dk
Workshop KU
02.11.2016
Crawling, Disadvantages
35
• Some objects not archived, e.g. videos and streamed
content, and applications based on Flash, JavaScript etc.
• Temporal inconsistencies
• Difficult to delimit in terms of spatial extent
• Risk of web crawler being caught in ’bot traps’ (some
monitoring is necessary)
netlab.dk
Workshop KU
02.11.2016
Characteristics of the Archived Web
36
What is archived is not a 1:1 copy of the material one attempted to archive
It is versions/reconstructions:
• Created in the process of archiving
• On the basis of a number of choices made by the archiver
(harvesting strategy, settings, etc.)
• The choices made have consequences for what is
archived
• The archived objects are re-assembled in the archive
’replay’
netlab.dk
Workshop KU
02.11.2016
Characteristics of the Archived Web
37
The archived version is deficient because of:
• Technical challenges
• Web’s specific characteristics: dynamic, unpredictable
• Potential asynchronicity between updating and archiving
→ archiving takes time
→ certain elements cannot be archived
It is an added challenge that we do not know what is missing:
• Not much documentation
• No baseline to compare with
netlab.dk
Workshop KU
02.11.2016
Characteristics of the Archived Web
38
As scholars using archived web as an object of study, it is important that we are aware of the pitfalls and sources of error inherent in the material.
”During the Olympics in Sydney in 2000, I wanted to save the
website of the Danish newspaper JyllandsPosten. I began at the
first level, the front page, on which I could read that the Danish
badminton player Camilla Martin would play in the finals a half hour
later.
My computer took about an hour to save this first level, after which
time I wanted to download the second level, ’Olympics 2000’. But on
the front page of this section, I could already read the result of the
badminton finals (she lost).
The website was — as a whole — not the same as when I had
started; it had changed in the time it took to archive it, and I could
now read the result on the front page, where the match was
previously only announced.”
N. Brügger: Archiving Websites, 2005, pp. 22-23
39
netlab.dk
Workshop KU
02.11.2016
Characteristics of the Archived Web
40
Exactly what is preserved?
Content, user experience or something else entirely?
”…an archive also must be sure that the document is translated in an
authentic manner. In this case, authenticity means that the document
must both include the context and evoke the experience of the original.”
(Lyman, 2002 s. 41)
netlab.dk
Workshop KU
02.11.2016
Characteristics of the Archived Web
41
It is versions/reconstructions:
• The archived objects are re-assembled in the archive
’replay’
42
IN CONTRAST TO DIGITIZED COLLECTIONS: TO A LARGE EXTENT ARCHIVED WEB IS ALREADY MARKED UP — HTML, FILE NAMES...
html + files
Online web archiving
Link list Named entities